GPT Deep Dive

Autoregressive generation

GPT's Key Insight

GPT stands for Generative Pre-trained Transformer. Where BERT uses only the encoder, GPT uses only the decoder—but with a crucial modification.

The original transformer decoder included cross-attention to the encoder's output. GPT removes this entirely. What remains is a stack of layers with causal self-attention: each token can only attend to tokens that came before it.

GPT Architecture

GPT: Decoder-Only ArchitectureThecatsatonmatToken + Position EmbeddingsTransformer Decoder12 layers × Causal Self-AttentionP(t2)P(t3)P(t4)P(t5)P(t6)DecoderCausal Attn

Hover over an input token to see causal attention. Each token can only attend to itself and previous tokens—never to future tokens.

This constraint makes GPT an autoregressive model. It generates text one token at a time, each new token conditioned only on what came before. The same architecture works for both training and generation—a beautiful simplicity that proved remarkably powerful.

Unlike BERT, which processes text bidirectionally, GPT always moves left to right. When processing "The cat sat," the word "sat" can attend to "The" and "cat," but neither "The" nor "cat" can see "sat." This mimics how humans write: we can only use what we've written so far to decide what comes next.

Next Token Prediction

GPT's training objective could not be simpler: predict the next token.

Given a sequence of tokens, the model learns to predict what comes next. For the input "The quick brown," GPT should predict "fox" (or at least assign it high probability).

Training: Given "The quick brown" → Predict "fox"

The key point: this simple objective is deceptively effective.

To predict well, the model must learn:

  • Grammar: "The quick brown ___" calls for a noun, not a verb
  • Facts: Brown foxes are common in stories and language
  • Common sense: Animals jump, inanimate objects usually do not
  • Style: Formal text differs from casual conversation
  • Reasoning: Later models can even perform multi-step inference

All of this emerges from learning to predict the next word. No labeled data, no task-specific training—just prediction.

The Causal Mask

The mathematical heart of GPT is the causal attention mask.

Causal Attention Mask

Attention Mask

ThecatsatonthematThecatsatonthemat100000110000111000111100111110111111Keys (what we attend to)Queries (who is attending)

Causal Mask

Click on any row to see what that position can attend to. The triangular pattern ensures tokens only see the past, never the future.

The mask is triangular: position 0 sees only itself, position 1 sees 0-1, position 2 sees 0-2, and so on. This prevents "cheating" during training by looking at future tokens.

In standard attention, every position can attend to every other position. GPT adds a mask: a triangular pattern that prevents tokens from "peeking" at future tokens.

Position 0 can only see itself. Position 1 can see positions 0 and 1. Position 4 can see positions 0 through 4. This creates a strict left-to-right information flow.

The elegance: during training, GPT processes entire sequences in parallel—all positions compute simultaneously—but the mask ensures each position only "sees" what it should. At generation time, this same architecture produces tokens one at a time, naturally extending the sequence.

Why Decoder-Only Won

The original transformer used both encoder and decoder. BERT kept only the encoder. GPT kept only the decoder. Why did the decoder-only approach scale so successfully?

Simplicity. With one component instead of two, there are fewer moving parts. The same architecture, same objective, same code works for pre-training and every downstream task.

Efficient scaling. Compute scales with sequence length and model size. A single stack is easier to parallelize across GPUs than an encoder-decoder pair.

Unified objective. Pre-training uses next-token prediction. Fine-tuning uses next-token prediction. Generation uses next-token prediction. Everything aligns.

Emergent generality. By scaling up, models learn to perform diverse tasks without task-specific training—just by continuing to predict the next token in various contexts.

The encoder-decoder architecture is not dead; it excels at tasks like translation where distinct input and output sequences exist. But for general-purpose language modeling, the decoder-only design proved easier to scale and surprisingly general.

The GPT Family

GPT evolved rapidly from research curiosity to global phenomenon.

GPT-1 (2018): 117 million parameters. OpenAI's proof of concept that pre-training followed by fine-tuning could work across diverse NLP tasks. The model was notable but not world-changing.

GPT-2 (2019): 1.5 billion parameters. A 10× scale-up that produced remarkably coherent text. OpenAI initially withheld the full model, citing concerns about misuse—the first time a language model made headlines for being "too dangerous to release."

GPT-3 (2020): 175 billion parameters. Another 100× scale-up. GPT-3 demonstrated emergent abilities: it could perform tasks without any fine-tuning, simply by being prompted with examples. This was the model that convinced the world that something fundamental had changed.

GPT-4 (2023): Multimodal, rumored to use a mixture of experts with 1.7 trillion total parameters. GPT-4 set new benchmarks across almost every evaluation, including passing professional exams at human expert levels.

Each generation showed that scale unlocks capabilities. More parameters, more data, more compute—and suddenly the model does things its predecessors could not.

In-Context Learning

GPT-3 revealed a surprising capability: in-context learning.

Without updating any weights, the model can "learn" new tasks from examples provided in the prompt.

In-Context Learning

Task: Translate English to French
sea otter=>loutre de mer
cheese=>fromage
plush giraffe=>???

Few-Shot Learning

2 examples shown in prompt. No weight updates—GPT "learns" the pattern from context alone.

GPT learns new tasks from just a few examples in the prompt. No fine-tuning, no gradient updates—just pattern recognition at inference time.

Few-shot: Provide several examples in the prompt.

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
cheese => fromage
plush giraffe =>
text

The model completes with "girafe en peluche" despite never being explicitly trained on this format.

One-shot: Provide a single example.

Zero-shot: Provide no examples—just describe the task.

This capability emerged from scale. Smaller models cannot do it reliably. But GPT-3 and beyond can adapt to new patterns from just a few demonstrations, without any gradient updates.

The mechanism is still debated. Some argue the model is "learning" in the prompt; others say it is retrieving patterns from its vast training data. Regardless, the practical effect transformed how we interact with language models: instead of fine-tuning, we can often just explain what we want.

Token-by-Token Generation

How does GPT actually generate text? The process is surprisingly simple.

Token-by-Token Generation

Onceupona?
Next token probabilities:Step 1 of 5
time
72%
day
12%
night
8%
moment
5%
Selected: "time"

Watch GPT generate "Once upon a time, there was a" one token at a time. At each step, the model predicts a probability distribution and samples the next token.

  1. Start with a prompt: "Once upon a"
  2. Feed the prompt through the model
  3. The model outputs a probability distribution over all tokens
  4. Sample from this distribution (or pick the most likely token)
  5. Append the sampled token to the sequence
  6. Repeat from step 2 with the extended sequence

Each step adds one token. The model never "plans ahead"—it just predicts the next token, over and over. Yet from this simple loop emerges stories, code, poems, and conversations.

Temperature controls randomness. Temperature 0 always picks the most likely token (greedy decoding). Higher temperatures make the distribution more uniform, encouraging diversity. Too high, and the output becomes nonsensical.

Top-k and top-p (nucleus) sampling restrict choices to the most likely candidates, balancing coherence and creativity.

Key Takeaways

  • GPT is a decoder-only transformer using causal (left-to-right) attention
  • Next-token prediction is a simple objective that teaches grammar, facts, and reasoning
  • The causal mask ensures tokens cannot see future tokens during training or generation
  • Decoder-only architectures proved easier to scale than encoder-decoder
  • Each GPT generation demonstrated new emergent capabilities from scale
  • In-context learning allows task adaptation without fine-tuning, using only prompt examples
  • Generation works by repeatedly predicting and appending the next token