The Decoder Stack

Generating output autoregressively

The Decoder's Dual Role

The encoder understands input. The decoder generates output. In a sequence-to-sequence transformer (like the original translation model), the decoder takes the encoder's understanding and produces new tokens, one at a time.

The original paper describes the decoder's unique structure:

"In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack."

A decoder block has three sub-layers instead of two:

Masked self-attention — the decoder attends to its own generated tokens (with masking)
Cross-attention — the decoder attends to the encoder's output
Feed-forward network — same position-wise processing as the encoder

Interactive: Decoder Block

Hover over any component to learn what it does. Notice the three sub-layers and how the encoder output feeds into cross-attention.

Hover over each component. Notice the key difference from the encoder: there are two attention mechanisms here. The first looks at the decoder's own tokens (masked so it cannot peek ahead). The second looks at the encoder's output (where the input information lives).

Masked Self-Attention

The decoder generates tokens one at a time. When predicting the third token, it should only see tokens 1 and 2—not tokens 4, 5, or 6. We enforce this with masking.

The attention mask is triangular: position 1 can only attend to position 1. Position 2 can attend to positions 1 and 2. Position 3 can attend to positions 1, 2, and 3. And so on.

\text{Mask} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \end{bmatrix}

Where there is a 0, the attention score is set to $-\infty$ before the softmax. This makes those positions contribute zero weight. The future is invisible.

Why is this necessary? During training, we feed the entire target sequence to the decoder at once (for efficiency). Without masking, the decoder could cheat—it could simply copy the answer from future positions instead of learning to predict. The mask forces it to predict each token using only the past.

During generation, the mask is somewhat redundant—we literally do not have future tokens yet. But the same trained model works in both scenarios because the mask was there during training.

Cross-Attention to Encoder

Here is where the encoder's output becomes useful. In cross-attention:

Queries come from the decoder (the current state of generation)
Keys and Values come from the encoder (the encoded input)

Each decoder position asks: "What information from the input is relevant to me right now?"

For translation, imagine generating "El gato se sentó" from "The cat sat." When the decoder is about to generate "gato," it needs to know that the source said "cat." Cross-attention lets the decoder look at the encoder's representation of "cat" and gather that information.

The formula is the same scaled dot-product attention:

\text{CrossAttention}(Q_{\text{dec}}, K_{\text{enc}}, V_{\text{enc}}) = \text{softmax}\left(\frac{Q_{\text{dec}} K_{\text{enc}}^T}{\sqrt{d_k}}\right) V_{\text{enc}}

This is how information flows from input to output. The encoder creates rich representations of the source. The decoder queries those representations to decide what to generate next.

Autoregressive Generation

Here is the key point: generation is a loop.

The decoder does not produce the entire output at once. It generates one token, then uses that token as input to generate the next one. Each step builds on all previous steps.

Interactive: Autoregressive Generation

Each step generates one token by:

Processing all previous tokens through the decoder
Computing attention to both previous output and encoder
Predicting a probability for each vocabulary token
Selecting the highest probability token (or sampling)
Adding it to the sequence and repeating

The process works like this:

Start with a special token (often <start> or <bos> for "beginning of sequence")
Predict the probability distribution over all possible next tokens
Sample (or select) the most likely token
Append that token to the sequence
Repeat until we generate an end token or reach maximum length

At each step, the decoder has access to:

All previously generated tokens (through masked self-attention)
The entire encoder output (through cross-attention)

This autoregressive nature is why generation is slow. Each new token requires a full forward pass through the decoder. For a 100-token response, that is 100 sequential forward passes. This is fundamentally different from the encoder, which processes all tokens in parallel.

Modern techniques like KV caching speed this up by storing previous attention computations, but the sequential nature remains. The decoder must wait for token 1 before generating token 2.

Decoder-Only Models

Not all transformers have encoders. Decoder-only models like GPT use just the decoder stack.

Without an encoder, there is no cross-attention—only masked self-attention and FFN. The input (prompt) and output (completion) are concatenated into a single sequence. The model predicts each token based on all previous tokens, whether they came from the user or were generated.

[User: What is 2+2?] [Assistant: The answer is 4.]
 ↑ input (masked)     ↑ generated autoregressively

text

The masking still applies: when generating "The answer is 4," the model cannot see ahead to "4" while generating "is."

Decoder-only architectures dominate modern language models (GPT-4, Claude, Llama) because:

Simpler architecture (no encoder/decoder separation)
Unified handling of input and output
Natural fit for chat and completion tasks

Encoder-decoder models (like T5 or the original transformer) still excel at tasks with clear input/output separation, like translation or summarization.

Key Takeaways

The decoder generates output autoregressively—one token at a time, each conditioned on previous tokens
Decoder blocks have three sub-layers: masked self-attention, cross-attention, and FFN
Masked self-attention uses a triangular mask to prevent attending to future positions
Cross-attention allows the decoder to query the encoder's representations
Generation is a sequential loop: predict next token, append, repeat
Decoder-only models (GPT, Claude) skip the encoder and use only masked self-attention
The autoregressive nature makes generation slower than encoding, but KV caching helps