Word2Vec: The Mechanics

Skip-gram and CBOW: how prediction tasks create meaningful vectors

Word2Vec, published in 2013, demonstrated that simple neural networks trained on word prediction could learn rich semantic representations. The key insight: you do not need labeled data about word meanings. You only need text. Lots of text.

The task is self-supervised: predict words from their context, or predict context from a word. In learning to make these predictions, the model discovers meaningful structure in language.

The Prediction Task

Consider the sentence: "The quick brown fox jumps over the lazy dog."

For each word, we define a context window—the words within some distance. If the window size is 2, the context for "fox" is ["quick", "brown", "jumps", "over"].

Now we ask the model: given "fox", predict which words appear in its context. Or, given the context words, predict "fox" in the center.

Interactive: Context windows in text

Skip-gram Training Pairs

foxquick
foxbrown
foxjumps
foxover

Each pair becomes a training example: given "fox", predict the context word.

Click any word to make it the center. Adjust window size to see how context changes.

This is not a useful task in itself. We do not need a model to tell us what words appear near "fox." But in learning to make these predictions accurately across billions of words, the model must learn which words are semantically related—because related words predict similar contexts.

Skip-gram: Predict Context from Word

In the skip-gram architecture, we take a center word and predict each context word independently.

Input: "fox" (represented as a one-hot vector) Output: Probability distribution over vocabulary for each context position

The model has two weight matrices. The first, W, maps from input to hidden layer—this transforms the center word into an embedding. The second, W', maps from hidden to output—this transforms the embedding into context word predictions.

For center word wcw_c and context word wow_o, the model computes:

P(wowc)=exp(vwovwc)wVexp(vwvwc)P(w_o | w_c) = \frac{\exp(\vec{v}'_{w_o} \cdot \vec{v}_{w_c})}{\sum_{w \in V} \exp(\vec{v}'_w \cdot \vec{v}_{w_c})}

The numerator is high when the context embedding vwo\vec{v}'_{w_o} aligns with the center embedding vwc\vec{v}_{w_c}. Training adjusts these vectors to maximize the probability of actual context words.

CBOW: Predict Word from Context

The Continuous Bag of Words (CBOW) architecture inverts the task. Given all context words, predict the center word.

Interactive: CBOW architecture

CBOW Architecture

Context
quick
brown
jumps
over
average
Average Vector
predict
Target
fox

CBOW averages context embeddings, then predicts the center word. One prediction per window.

CBOW averages the embeddings of all context words, then predicts the center word from this average. It trains faster than skip-gram because it makes one prediction per window instead of multiple.

Skip-gram tends to produce better embeddings for rare words (each occurrence updates that word's embedding directly). CBOW works better for frequent words and is faster to train.

The Softmax Problem

The denominator in the probability formula sums over the entire vocabulary—potentially 100,000+ words. Computing this for every training example is prohibitively expensive.

Two solutions emerged:

Hierarchical Softmax: Organize vocabulary as a binary tree. Prediction becomes a sequence of binary decisions. Complexity drops from O(V) to O(log V).

Negative Sampling: Instead of predicting exact probabilities, distinguish true context words from random "negative" words. This reframes classification as binary discrimination.

Negative Sampling: The Practical Solution

Negative sampling transforms the problem. Instead of asking "what is the probability of every word in context?", we ask "is this specific word a context word or a random word?"

Interactive: Negative sampling

Positive Pair

foxquick

Score should be HIGH (σ → 1)

Negative Pairs

foxcomputer
foxtelevision
foxdemocracy
foxalgorithm
foxelephant

Score should be LOW (σ → 0)

Objective Function

max: log σ(v'quick · vfox) + Σ log σ(-v'neg · vfox)

Push positive pairs together (high dot product), push negative pairs apart (low dot product).

Instead of computing softmax over 13+ words, we only compare against 6 pairs per example.

For each positive pair (center word, actual context word), we sample k negative pairs (center word, random word). The model learns to score positive pairs highly and negative pairs lowly.

The objective for skip-gram with negative sampling becomes:

logσ(vwovwc)+i=1kEwiPn(w)[logσ(vwivwc)]\log \sigma(\vec{v}'_{w_o} \cdot \vec{v}_{w_c}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} [\log \sigma(-\vec{v}'_{w_i} \cdot \vec{v}_{w_c})]

Where σ\sigma is the sigmoid function and Pn(w)P_n(w) is a noise distribution over the vocabulary (typically unigram0.75\text{unigram}^{0.75} to upweight rare words).

This reduces computation dramatically while empirically producing similar quality embeddings.

Training Dynamics

What happens during training? Embeddings start as random vectors scattered throughout the space. Then gradients push them around. Words that predict similar contexts get pulled together—their embeddings align. Words that predict different contexts get pushed apart—their embeddings diverge.

Interactive: Watch embeddings evolve

animal
tech
nature
Epoch: 0

Click Train to see how gradient updates move word embeddings based on co-occurrence patterns.

After millions of training steps on billions of words, the embedding space organizes itself. Semantic relationships become geometric relationships. Analogies become vector arithmetic.

No one explicitly teaches the model that "king" and "queen" are related. It discovers this because they predict similar contexts: "throne", "crown", "reign", "royal".

Subsampling Frequent Words

Common words like "the", "a", "is" appear constantly but carry little meaning. Without intervention, they dominate training.

Word2Vec uses subsampling: randomly discard frequent words from training with probability proportional to their frequency. The formula:

P(discard w)=1tf(w)P(\text{discard } w) = 1 - \sqrt{\frac{t}{f(w)}}

Where f(w)f(w) is word frequency and tt is a threshold (typically 10510^{-5}). "The" with frequency 0.05 gets discarded ~99% of the time. "Serendipity" with frequency 0.00001 is always kept.

This speeds training and improves embedding quality by focusing on meaningful words.

What Word2Vec Learns

The resulting embeddings capture multiple kinds of structure. Similarity: words like "dog," "cat," and "pet" cluster together in the space. Analogy: relationships become directions, so king - man + woman ≈ queen. Hierarchy: hypernyms like "animal" separate from hyponyms like "dog." Syntactic patterns: verb tenses align with each other, as do plurals with their singulars.

These properties emerge without supervision. No one labels which words are similar. No one annotates analogies. The structure exists in language itself; Word2Vec simply exposes it through prediction.

Limitations

Word2Vec has fundamental limitations that motivate more advanced models.

The first is static embeddings: each word gets exactly one vector regardless of context. "Bank" has the same embedding whether it appears in "river bank" or "bank account." The model cannot represent polysemy—multiple meanings for the same word.

The second is out-of-vocabulary words. Words not seen during training have no embedding at all. If your corpus lacks "cryptocurrency," you cannot embed it. New terminology requires retraining.

The third is single granularity. Word2Vec operates at the word level only. It cannot directly embed sentences, paragraphs, or documents. Combining word vectors (by averaging, for example) loses structural information.

These limitations drove the development of contextual embeddings like BERT, where each word's representation depends on its surrounding context, and subword tokenization like BPE and WordPiece, which handle novel words by breaking them into familiar pieces.

Key Takeaways

  • Word2Vec learns embeddings through a self-supervised prediction task: predict context from words (skip-gram) or words from context (CBOW)
  • Negative sampling makes training tractable by replacing full softmax with binary discrimination
  • Training dynamics push similar words together and dissimilar words apart based on co-occurrence patterns
  • Subsampling frequent words focuses training on meaningful content
  • The resulting embeddings capture semantic similarity and analogical relationships
  • Limitations include static (context-free) embeddings, out-of-vocabulary problems, and word-only granularity