<aside> 💡 Write-up of my recreation of Word2Vec from scratch. Code repo here.

</aside>

I wanted to recreate Word2Vec from scratch including:

  1. Sourcing Data (English Wikipedia)
  2. Cleaning Data (Regex & various rules)
  3. Creating and Optimizing a Tokenizer (BPE Algorithm)
  4. Training Word Embeddings (CBOW with Negative Sampling)
  5. Analyzing Word Embeddings (PCA, Cosine Similarity)

Output from the cosine similarity analysis — semantically meaningful embeddings!

Output from the cosine similarity analysis — semantically meaningful embeddings!

Below is a write-up detailing some key learnings and documenting the process.

Sourcing Data

Cleaning Data

Creating and Optimizing a Tokenizer

Training Word Embeddings

Analyzing Word Embeddings

Notable References