Attention and Context

How transformers compute context-dependent representations

Word2Vec assigns one vector per word. But meaning depends on context. "Bank" in "river bank" differs from "bank account." We need representations that incorporate surrounding words.

Transformers solve this through attention—a mechanism that allows each word to "look at" every other word and adjust its representation accordingly.

The Problem with Static Embeddings

Consider the word "crane." Without context:

  • Could be a bird
  • Could be construction equipment
  • Could be the verb "to crane one's neck"

A static embedding must somehow encode all meanings in one vector. It averages across uses, becoming imperfect for any specific use.

Contextual embeddings compute a different vector for "crane" depending on surrounding words. In "the crane flew overhead," the embedding shifts toward bird-related concepts. In "the crane lifted the beam," it shifts toward machinery.

Interactive: Same word, different contexts

The fisherman sat on the bank watching the water flow by.

Static embedding

bank

Same vector regardless of context. Averages all meanings.

Contextual embedding

bank(river)

Nearby: shore, water, stream, riverside

Attention: The Core Idea

Attention computes how much each word should "attend to" each other word when building its representation.

For a sentence of N words, attention produces an N×N matrix of weights. Each row sums to 1 and represents how one word distributes its attention across all words.

The computation involves three transformations of each word embedding:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I provide?"

Interactive: Query, Key, Value

Attention from "bank"

10%
The
15%
bank
35%
approved
10%
the
30%
loan
Attention(Q, K, V) = softmax(QKT / √dk) × V

For each word, we:

  1. Compute its Query
  2. Compare Query to every word's Key (dot product)
  3. Softmax the scores to get attention weights
  4. Weighted sum of Values to get the new representation

Mathematically:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The dk\sqrt{d_k} scaling prevents dot products from becoming too large in high dimensions.

Multi-Head Attention

One attention pattern cannot capture all relationships. "Bank" might need to attend differently for:

  • Syntactic relationships (subject-verb agreement)
  • Semantic relationships (what kind of bank?)
  • Positional relationships (nearby modifiers)

Multi-head attention runs multiple attention computations in parallel, each with different learned projections.

Interactive: Multiple attention heads

Subject-verb-object relationships

Thebankapprovedtheloan
The
10
20
50
10
10
bank
20
10
30
20
20
approved
40
20
10
10
20
the
20
20
20
20
20
loan
10
10
40
10
30

Each head learns different patterns. Their outputs are concatenated and projected.

Each head learns to capture different patterns. Some heads specialize in syntax, others in coreference, others in domain relationships. The outputs are concatenated and projected back to the embedding dimension.

MultiHead(Q,K,V)=Concat(head1,...,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O

Transformer Architecture

A transformer stacks multiple layers of attention. Each layer:

  1. Self-attention: Each word attends to all words (including itself)
  2. Feed-forward network: Applies non-linear transformation
  3. Residual connections: Add input to output
  4. Layer normalization: Stabilize training

Interactive: Transformer layers

Deeper attention captures phrase meaning and semantic relationships.

Lower layers capture local patterns (syntax, nearby words). Higher layers capture global patterns (document topic, long-range dependencies).

By layer 12 (in BERT-base), the embedding for "bank" in "river bank" has been transformed to be near water and nature concepts. The same word in "bank account" has been transformed to be near finance concepts.

Encoder vs Decoder

Encoder models (BERT, sentence transformers): Process entire text at once. Each word attends to all words. Good for understanding, classification, embeddings.

Decoder models (GPT): Process left-to-right. Each word only attends to previous words (causal attention). Good for generation.

Encoder-decoder models (T5): Encoder processes input, decoder generates output attending to encoder states. Good for translation, summarization.

For semantic search, encoder models are standard. We need rich, bidirectional understanding of text.

From Tokens to Embeddings

Transformers process text as tokens, not words. Each token gets a contextual embedding. For semantic search, we need a single vector for the entire text.

Common strategies:

CLS token: BERT prepends a special [CLS] token. Its final-layer embedding represents the whole sequence. Works but was designed for classification, not similarity.

Mean pooling: Average all token embeddings. Simple, often effective. What sentence transformers typically use.

Max pooling: Take element-wise maximum across tokens. Captures strongest signal per dimension.

Weighted pooling: Weight tokens by attention or importance. More sophisticated but requires tuning.

Positional Encoding

Attention is permutation-invariant—it does not inherently know word order. "Dog bites man" and "man bites dog" would produce identical attention if we did not encode position.

Sinusoidal encoding (original transformer): Add sine/cosine waves of different frequencies to embeddings. Position information is injected.

Learned encoding (BERT): Learn a position embedding for each position up to maximum sequence length.

Rotary encoding (RoPE, modern): Rotate embedding dimensions by angle proportional to position. Enables better length generalization.

Why Transformers Work for Search

Contextual embeddings are powerful for semantic search because:

  1. Disambiguation: Same word, different contexts → different vectors
  2. Compositionality: Phrase meaning emerges from word interactions
  3. Long-range dependencies: Words can influence distant words
  4. Transfer learning: Pretrained on massive corpora, fine-tuned for retrieval

The attention mechanism explicitly models relationships between all words. This is exactly what we need for understanding meaning.

Key Takeaways

  • Static embeddings (Word2Vec) assign one vector per word regardless of context
  • Transformers compute contextual embeddings where meaning depends on surrounding words
  • Attention allows each word to "attend to" every other word through Query/Key/Value projections
  • Multi-head attention captures different types of relationships in parallel
  • Stacking layers captures increasingly global patterns
  • For search, we pool token embeddings into a single vector (CLS, mean, max)
  • Positional encoding provides word order information to the permutation-invariant attention