Scaled Dot-Product Attention

The core operation

From Intuition to Math

In the previous chapter, we built intuition for attention: queries looking for relevant keys, values being retrieved and combined. Now we formalize this with matrices. The math is beautiful once you see it.

Every token in our sequence has an embedding — a vector representation. We stack these into a matrix XX, where each row is one token's embedding. From this single input, we create three different views through learned linear projections:

Q=XWQK=XWKV=XWVQ = XW^Q \quad K = XW^K \quad V = XW^V

These weight matrices WQW^Q, WKW^K, and WVW^V are learned during training. They transform the same input into three different roles:

  • Queries (Q): What each position is looking for
  • Keys (K): What each position offers to be found
  • Values (V): What each position contributes when matched

The genius is that these projections are learned. The network discovers what makes a good query, what makes positions findable, and what information should flow when attention happens.

Interactive: Step Through Attention

1Input X
2Compute Q
3Compute K
4Compute V
5QKᵀ
6Scale
7Softmax
8Output

We start with our input embeddings XX — one row per token.

X (Input Embeddings)
[
The
0.30
0.60
0.10
0.70
cat
0.54
0.51
0.37
0.68
sat
0.72
0.27
0.34
0.63
down
0.80
-0.00
0.05
0.55
]

Walk through each step to see how a simple 4-token sentence gets transformed. Notice how the input XX gets projected into Q, K, V, then combined to produce contextualized outputs.

The Dot Product as Similarity

How do we measure which keys match which queries? The dot product.

When two vectors point in similar directions, their dot product is large and positive. When they are orthogonal, it is zero. When they point opposite ways, it is negative. This makes the dot product a natural measure of alignment — exactly what we need for attention.

For a single query vector qiq_i and key vector kjk_j, the score is simply qikjq_i \cdot k_j. But we need all pairwise scores between every query and every key. Matrix multiplication gives us this in one shot:

QKTQK^T

The result is an n×nn \times n matrix where entry (i,j)(i, j) tells us how well query ii matches key jj. Position ii can now see which other positions are relevant to it.

Interactive: Attention Weights Heatmap

The
curious
cat
watched
The
curious
cat
watched

Click any cell to see the computation details

Click on any cell to see the underlying computation. Notice how some words attend strongly to related words — "cat" attends to "curious" because adjectives modify nouns, "watched" attends to "cat" because verbs need their subjects.

Why Scale by √d_k?

Here is where many explanations gloss over something crucial. The original paper includes a seemingly arbitrary dk\sqrt{d_k} in the denominator. Why?

"We suspect that for large values of dkd_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients."

Attention Is All You Need, Vaswani et al.

Let's unpack this. When dkd_k is large, dot products between random vectors have larger variance. If we have two random unit vectors in dkd_k dimensions, their expected dot product squared scales with dkd_k. Large dot products push softmax toward extreme values — nearly 1 for the maximum, nearly 0 for everything else.

This is a problem. When softmax saturates:

  1. Gradients vanish. The derivative of softmax approaches zero in saturated regions. No gradient means no learning.
  2. Attention becomes one-hot. Instead of blending information from multiple sources, each position fixates on exactly one other position.

Dividing by dk\sqrt{d_k} normalizes the variance of dot products back to roughly 1, keeping softmax in a regime where it can still learn.

Interactive: Softmax Saturation Demo

÷ √64 = 8.0

Softmax Output Distribution

k1
16.1%
k2
24.0%
k3
9.8%
k4
32.4%
k5
17.8%

Gradient Magnitude

0.2190

✓ Healthy for learning

Max Probability

32.4%

Spread across keys

Distribution Spread

95%

Well distributed

Try it: Increase d_k with scaling OFF and watch the distribution collapse to nearly one-hot. The gradient approaches zero, making learning impossible. Turn scaling ON to see how dividing by dk\sqrt{d_k} keeps gradients healthy regardless of dimension.

Try increasing dkd_k with scaling off. Watch the distribution collapse into a near-one-hot spike. The gradient magnitude drops to nearly zero. Now turn scaling on — the distribution stays healthy regardless of dimension.

The Complete Formula

Putting it all together, scaled dot-product attention is:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Let's trace through what happens:

  1. Compute scores: QKTQK^T gives raw attention scores — how much each query matches each key.

  2. Scale: Divide by dk\sqrt{d_k} to stabilize gradients.

  3. Softmax: Apply softmax row-wise. Each row now sums to 1, giving us a probability distribution over keys.

  4. Weighted sum: Multiply by VV. Each output row is a weighted combination of value vectors, where weights come from attention.

The output has the same shape as the input. But now each position's representation has been enriched with information from other positions, weighted by relevance.

This is what makes attention powerful. A word's meaning can now depend on its context. "Bank" in "river bank" attends to "river" and gets a different representation than "bank" in "savings bank" attending to "savings."

Key Takeaways

  • Queries, Keys, and Values are learned projections from the same input: Q=XWQQ = XW^Q, K=XWKK = XW^K, V=XWVV = XW^V
  • The dot product QKTQK^T measures pairwise similarity between all queries and keys
  • Scaling by dk\sqrt{d_k} prevents softmax saturation and keeps gradients healthy
  • The complete formula softmax(QKT/dk)V\text{softmax}(QK^T / \sqrt{d_k}) \cdot V produces context-aware representations
  • Each output position is a weighted sum of values, where attention weights reflect query-key similarity