Attention and Context

How transformers compute context-dependent representations

Word2Vec assigns one vector per word. But meaning depends on context. "Bank" in "river bank" differs from "bank account." We need representations that incorporate surrounding words.

Transformers solve this through attention—a mechanism that allows each word to "look at" every other word and adjust its representation accordingly.

The Problem with Static Embeddings

Consider the word "crane." Without context:

Could be a bird
Could be construction equipment
Could be the verb "to crane one's neck"

A static embedding must somehow encode all meanings in one vector. It averages across uses, becoming imperfect for any specific use.

Contextual embeddings compute a different vector for "crane" depending on surrounding words. In "the crane flew overhead," the embedding shifts toward bird-related concepts. In "the crane lifted the beam," it shifts toward machinery.

Interactive: Same word, different contexts

The fisherman sat on the bank watching the water flow by.

Static embedding

bank

Same vector regardless of context. Averages all meanings.

Contextual embedding

bank_(river)

Nearby: shore, water, stream, riverside

Attention: The Core Idea

Attention computes how much each word should "attend to" each other word when building its representation.

For a sentence of N words, attention produces an N×N matrix of weights. Each row sums to 1 and represents how one word distributes its attention across all words.

The computation involves three transformations of each word embedding:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

Interactive: Query, Key, Value

Attention from "bank"

10%

The

15%

bank

35%

approved

10%

the

30%

loan

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

For each word, we:

Compute its Query
Compare Query to every word's Key (dot product)
Softmax the scores to get attention weights
Weighted sum of Values to get the new representation

Mathematically:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The $\sqrt{d_k}$ scaling prevents dot products from becoming too large in high dimensions.

Multi-Head Attention

One attention pattern cannot capture all relationships. "Bank" might need to attend differently for:

Syntactic relationships (subject-verb agreement)
Semantic relationships (what kind of bank?)
Positional relationships (nearby modifiers)

Multi-head attention runs multiple attention computations in parallel, each with different learned projections.

Interactive: Multiple attention heads

Subject-verb-object relationships

	The	bank	approved	the	loan
The	10	20	50	10	10
bank	20	10	30	20	20
approved	40	20	10	10	20
the	20	20	20	20	20
loan	10	10	40	10	30

Each head learns different patterns. Their outputs are concatenated and projected.

Each head learns to capture different patterns. Some heads specialize in syntax, others in coreference, others in domain relationships. The outputs are concatenated and projected back to the embedding dimension.

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$

Transformer Architecture

A transformer stacks multiple layers of attention. Each layer:

Self-attention: Each word attends to all words (including itself)
Feed-forward network: Applies non-linear transformation
Residual connections: Add input to output
Layer normalization: Stabilize training

Interactive: Transformer layers

Deeper attention captures phrase meaning and semantic relationships.

Lower layers capture local patterns (syntax, nearby words). Higher layers capture global patterns (document topic, long-range dependencies).

By layer 12 (in BERT-base), the embedding for "bank" in "river bank" has been transformed to be near water and nature concepts. The same word in "bank account" has been transformed to be near finance concepts.

Encoder vs Decoder

Encoder models (BERT, sentence transformers): Process entire text at once. Each word attends to all words. Good for understanding, classification, embeddings.

Decoder models (GPT): Process left-to-right. Each word only attends to previous words (causal attention). Good for generation.

Encoder-decoder models (T5): Encoder processes input, decoder generates output attending to encoder states. Good for translation, summarization.

For semantic search, encoder models are standard. We need rich, bidirectional understanding of text.

From Tokens to Embeddings

Transformers process text as tokens, not words. Each token gets a contextual embedding. For semantic search, we need a single vector for the entire text.

Common strategies:

CLS token: BERT prepends a special [CLS] token. Its final-layer embedding represents the whole sequence. Works but was designed for classification, not similarity.

Mean pooling: Average all token embeddings. Simple, often effective. What sentence transformers typically use.

Max pooling: Take element-wise maximum across tokens. Captures strongest signal per dimension.

Weighted pooling: Weight tokens by attention or importance. More sophisticated but requires tuning.

Positional Encoding

Attention is permutation-invariant—it does not inherently know word order. "Dog bites man" and "man bites dog" would produce identical attention if we did not encode position.

Sinusoidal encoding (original transformer): Add sine/cosine waves of different frequencies to embeddings. Position information is injected.

Learned encoding (BERT): Learn a position embedding for each position up to maximum sequence length.

Rotary encoding (RoPE, modern): Rotate embedding dimensions by angle proportional to position. Enables better length generalization.

Why Transformers Work for Search

Contextual embeddings are powerful for semantic search because:

Disambiguation: Same word, different contexts → different vectors
Compositionality: Phrase meaning emerges from word interactions
Long-range dependencies: Words can influence distant words
Transfer learning: Pretrained on massive corpora, fine-tuned for retrieval

The attention mechanism explicitly models relationships between all words. This is exactly what we need for understanding meaning.

Key Takeaways

Static embeddings (Word2Vec) assign one vector per word regardless of context
Transformers compute contextual embeddings where meaning depends on surrounding words
Attention allows each word to "attend to" every other word through Query/Key/Value projections
Multi-head attention captures different types of relationships in parallel
Stacking layers captures increasingly global patterns
For search, we pool token embeddings into a single vector (CLS, mean, max)
Positional encoding provides word order information to the permutation-invariant attention