<aside>
💡 Write-up of my recreation of Word2Vec from scratch. Code repo here.
</aside>
I wanted to recreate Word2Vec from scratch including:
- Sourcing Data (English Wikipedia)
- Cleaning Data (Regex & various rules)
- Creating and Optimizing a Tokenizer (BPE Algorithm)
- Training Word Embeddings (CBOW with Negative Sampling)
- Analyzing Word Embeddings (PCA, Cosine Similarity)
Output from the cosine similarity analysis — semantically meaningful embeddings!
Below is a write-up detailing some key learnings and documenting the process.
Sourcing Data
Cleaning Data
Creating and Optimizing a Tokenizer
Training Word Embeddings
Analyzing Word Embeddings
Notable References