Scaled Dot-Product Attention

The core operation

From Intuition to Math

In the previous chapter, we built intuition for attention: queries looking for relevant keys, values being retrieved and combined. Now we formalize this with matrices. The math is beautiful once you see it.

Every token in our sequence has an embedding — a vector representation. We stack these into a matrix $X$ , where each row is one token's embedding. From this single input, we create three different views through learned linear projections:

Q = XW^Q \quad K = XW^K \quad V = XW^V

These weight matrices $W^Q$ , $W^K$ , and $W^V$ are learned during training. They transform the same input into three different roles:

Queries (Q): What each position is looking for
Keys (K): What each position offers to be found
Values (V): What each position contributes when matched

The genius is that these projections are learned. The network discovers what makes a good query, what makes positions findable, and what information should flow when attention happens.

Interactive: Step Through Attention

1Input X

2Compute Q

3Compute K

4Compute V

5QKᵀ

6Scale

7Softmax

8Output

We start with our input embeddings $X$ — one row per token.

X (Input Embeddings)

[

The

0.30

0.60

0.10

0.70

cat

0.54

0.51

0.37

0.68

sat

0.72

0.27

0.34

0.63

down

0.80

-0.00

0.05

0.55

]

Walk through each step to see how a simple 4-token sentence gets transformed. Notice how the input $X$ gets projected into Q, K, V, then combined to produce contextualized outputs.

The Dot Product as Similarity

How do we measure which keys match which queries? The dot product.

When two vectors point in similar directions, their dot product is large and positive. When they are orthogonal, it is zero. When they point opposite ways, it is negative. This makes the dot product a natural measure of alignment — exactly what we need for attention.

For a single query vector $q_i$ and key vector $k_j$ , the score is simply $q_i \cdot k_j$ . But we need all pairwise scores between every query and every key. Matrix multiplication gives us this in one shot:

QK^T

The result is an $n \times n$ matrix where entry $(i, j)$ tells us how well query $i$ matches key $j$ . Position $i$ can now see which other positions are relevant to it.

Interactive: Attention Weights Heatmap

The

curious

cat

watched

The

curious

cat

watched

Click any cell to see the computation details

Click on any cell to see the underlying computation. Notice how some words attend strongly to related words — "cat" attends to "curious" because adjectives modify nouns, "watched" attends to "cat" because verbs need their subjects.

Why Scale by √d_k?

Here is where many explanations gloss over something crucial. The original paper includes a seemingly arbitrary $\sqrt{d_k}$ in the denominator. Why?

"We suspect that for large values of $d_k$ , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients."

— Attention Is All You Need, Vaswani et al.

Let's unpack this. When $d_k$ is large, dot products between random vectors have larger variance. If we have two random unit vectors in $d_k$ dimensions, their expected dot product squared scales with $d_k$ . Large dot products push softmax toward extreme values — nearly 1 for the maximum, nearly 0 for everything else.

This is a problem. When softmax saturates:

Gradients vanish. The derivative of softmax approaches zero in saturated regions. No gradient means no learning.
Attention becomes one-hot. Instead of blending information from multiple sources, each position fixates on exactly one other position.

Dividing by $\sqrt{d_k}$ normalizes the variance of dot products back to roughly 1, keeping softmax in a regime where it can still learn.

Interactive: Softmax Saturation Demo

Dimension d_k64

÷ √64 = 8.0

Softmax Output Distribution

16.1%

24.0%

9.8%

32.4%

17.8%

Gradient Magnitude

0.2190

✓ Healthy for learning

Max Probability

32.4%

Spread across keys

Distribution Spread

95%

Well distributed

Try it: Increase d_k with scaling OFF and watch the distribution collapse to nearly one-hot. The gradient approaches zero, making learning impossible. Turn scaling ON to see how dividing by $\sqrt{d_k}$ keeps gradients healthy regardless of dimension.

Try increasing $d_k$ with scaling off. Watch the distribution collapse into a near-one-hot spike. The gradient magnitude drops to nearly zero. Now turn scaling on — the distribution stays healthy regardless of dimension.

The Complete Formula

Putting it all together, scaled dot-product attention is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Let's trace through what happens:

Compute scores: $QK^T$ gives raw attention scores — how much each query matches each key.
Scale: Divide by $\sqrt{d_k}$ to stabilize gradients.
Softmax: Apply softmax row-wise. Each row now sums to 1, giving us a probability distribution over keys.
Weighted sum: Multiply by $V$ . Each output row is a weighted combination of value vectors, where weights come from attention.

The output has the same shape as the input. But now each position's representation has been enriched with information from other positions, weighted by relevance.

This is what makes attention powerful. A word's meaning can now depend on its context. "Bank" in "river bank" attends to "river" and gets a different representation than "bank" in "savings bank" attending to "savings."

Key Takeaways

Queries, Keys, and Values are learned projections from the same input: $Q = XW^Q$ , $K = XW^K$ , $V = XW^V$
The dot product $QK^T$ measures pairwise similarity between all queries and keys
Scaling by $\sqrt{d_k}$ prevents softmax saturation and keeps gradients healthy
The complete formula $\text{softmax}(QK^T / \sqrt{d_k}) \cdot V$ produces context-aware representations
Each output position is a weighted sum of values, where attention weights reflect query-key similarity