The Transformer Block

Putting pieces together

The Anatomy of a Block

We have explored attention in depth. We have seen how queries match with keys, how multiple heads learn different patterns, and how positional information lets the model know word order. Now it is time to assemble these pieces into the complete unit that makes transformers work: the transformer block.

The original "Attention Is All You Need" paper describes it clearly:

"Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network."

A transformer block has two main components working together:

  1. Multi-head attention — allows tokens to communicate with each other
  2. Feed-forward network — processes each token's information independently

But there is more to the story. The paper also introduces two critical techniques that make deep transformers trainable: residual connections and layer normalization. Without these, a 12-layer or 96-layer transformer would be nearly impossible to train.

Interactive: Transformer Block

InputLayerNormMulti-Head Attention+LayerNormFeed-Forward Network(Position-wise)+Outputresidualresidual

Hover over any component to learn what it does, or click the button to see data flow through the block.

Hover over each component to see what it does. Notice how data flows through the block—attention first, then feed-forward, with residual connections adding the original input back at each stage.

The Feed-Forward Network

After attention mixes information between positions, we need to process that mixed information. This is the job of the feed-forward network (FFN), sometimes called the MLP (multi-layer perceptron).

The structure is simple: two linear transformations with an activation function between them.

FFN(x)=ReLU(xW1+b1)W2+b2\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2

Modern transformers typically use GELU or SiLU instead of ReLU, but the principle remains the same. The first linear layer expands the dimension (often by 4×), the activation adds non-linearity, and the second linear layer projects back down.

Here is the key insight: the same FFN is applied to every position independently. This is what "position-wise" means. If your sequence has 512 tokens, the exact same FFN processes each of them separately, with no communication between positions at this stage.

FFN Processing

Processing token: "The"

Inputd=4W₁Expandedd=16 (4× larger)LinearW₂Outputd=4FFN: Input → Expand (×4) → Activate → Project backClick the toggle to see the activation function's effect

The FFN first expands to a higher dimension (typically 4× the model dimension), applies a non-linear activation like GELU, then projects back. This same transformation is applied independently to every token position.

Why this division of labor?

  • Attention moves information between positions. It answers: "What context does this token need from elsewhere?"
  • FFN transforms information within each position. It answers: "Given all this gathered context, what should this token represent now?"

Think of attention as gathering ingredients and FFN as cooking them. Each position collects the relevant information from the sequence, then the FFN processes that gathered information into a more useful representation.

Residual Connections

Here is a problem: if we stack many layers, gradients during training tend to vanish or explode. Information struggles to flow through a deep network. The solution is beautifully simple—residual connections (also called skip connections).

output=x+Sublayer(x)\text{output} = x + \text{Sublayer}(x)

Instead of replacing the input with the sublayer's output, we add them together. The original input takes a shortcut around the sublayer.

Residual Connections and Gradient Flow

L1L2L3L4L5L6InputOutputGradient Flow (Backpropagation)

With residual connections, gradients have a direct path (dashed green lines) to early layers. Even with many layers, gradient strength stays high.

Why does this work so well?

During backpropagation, gradients need to flow from the output back to early layers. Without residual connections, gradients must pass through every transformation in sequence. Each layer can shrink or distort the gradient, and after many layers, the signal becomes too weak to provide useful learning.

With residual connections, gradients have a direct path. Even if the gradient through the sublayer is small, the gradient through the skip connection passes unchanged. The network can always learn at least the identity function—passing the input through untouched—and then learn useful transformations on top.

This is why modern transformers can be so deep. GPT-3 has 96 layers, and each layer benefits from clear gradient flow thanks to residual connections.

Layer Normalization

Even with residual connections, training deep networks can be unstable. Activations can grow very large or very small as they pass through layers. Layer normalization keeps activations in a reasonable range.

LayerNorm(x)=γxμσ+ϵ+β\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta

For each token, we normalize across its feature dimensions: compute the mean (μ\mu) and standard deviation (σ\sigma), subtract and divide to get zero mean and unit variance, then scale and shift using learned parameters γ\gamma and β\beta.

The original transformer paper used post-norm—normalization after each sublayer:

x -> Sublayer -> Add -> LayerNorm
text

Most modern transformers use pre-norm—normalization before each sublayer:

x -> LayerNorm -> Sublayer -> Add
text

Pre-norm tends to be more stable during training, especially for very deep networks. The gradients are more predictable, and training converges more reliably.

The Complete Block

Putting it all together, here is the data flow through a single transformer block (using modern pre-norm):

  1. Input arrives — token representations from the previous layer (or from embeddings for the first layer)
  2. First normalization — LayerNorm stabilizes the activations
  3. Multi-head attention — tokens attend to each other, gathering context
  4. First residual addition — add the original input back
  5. Second normalization — another LayerNorm before FFN
  6. Feed-forward network — each position is processed independently
  7. Second residual addition — add the pre-FFN input back
  8. Output — refined representations ready for the next layer

In code, this looks like:

def transformer_block(x):
    # Attention sublayer with residual
    normed = layer_norm(x)
    attended = multi_head_attention(normed)
    x = x + attended
    
    # FFN sublayer with residual
    normed = layer_norm(x)
    fed = feed_forward(normed)
    x = x + fed
    
    return x
python

Each block refines the representations. The first blocks might learn basic patterns—which words are nouns, which are verbs. Middle blocks might capture syntactic structure—subjects, objects, relationships. Deeper blocks develop more abstract semantic understanding.

By stacking many such blocks, transformers build increasingly sophisticated representations of language.

Key Takeaways

  • A transformer block has two main components: multi-head attention and a feed-forward network (FFN)
  • Attention mixes information between positions; FFN processes each position independently
  • Residual connections allow gradients to flow directly through skip paths, enabling very deep networks
  • Layer normalization stabilizes activations, preventing them from growing too large or small
  • Modern transformers use pre-norm (normalize before sublayers) for more stable training
  • Each block refines representations, building from local patterns to abstract understanding