Description:
We are seeking an experienced Machine Learning (ML) expert to assist in preparing a dataset of the 100k most common English words.
The goal is to compile, structure, and process a comprehensive set of metadata for each word entry, including pronunciation, part of speech, definitions, synonyms, usage examples, and more.
Responsibilities:
Dataset Compilation: Extract and compile a list of the 50k, 100k, 200k, most common English words, ensuring that all entries are lemmatized (i.e., in their base or dictionary form).
Metadata Collection: Develop scripts or use APIs to gather relevant metadata for each word, including:
Pronunciation (preferably in a standard dictionary format).
Part of speech.
Concise definitions.
Example sentences.
Synonyms and antonyms.
Etymology (optional).
Word frequency data.
- Note that this metadata must be commercially usable
Requirements:
Expertise in Machine Learning and Data Science: Proven experience in data extraction, processing, and analysis.
Familiarity with Linguistic Data: Experience working with linguistic datasets, corpora, or dictionary projects is a strong plus.
Programming Skills: Proficiency in Python or similar programming languages, with experience using libraries such as NLTK, spaCy, or similar for natural language processing (NLP).
API Experience: Experience working with APIs like Wiktionary, WordNet, or other linguistic databases.
Attention to Detail: Strong focus on data quality and accuracy.
Communication: Ability to clearly communicate progress, challenges, and results.
Deliverables:
- A structured dataset of say 100k most common english English lemmatized words to start, with complete metadata.
Scripts or tools used to gather and process the data, with clear documentation.
Location: Anywhere
Posted: Sept. 1, 2024, 7:25 a.m.
Apply Now Company Website