How LLMs Work

Inside a modern language model

You type a question. Moments later, coherent text appears. The response seems to understand context, follows instructions, and produces fluent language. But what is actually happening inside the model?

This chapter pulls back the curtain on the inference process—the journey from your input to the model's output.

The Inference Pipeline

When you send text to a language model, it passes through a series of transformations. Each step converts the input into a form the next step can process.

Interactive: The Inference Pipeline

Step through the inference pipeline to see how text is transformed into a prediction. Each stage prepares the data for the next.

Let us walk through each stage:

1. Tokenization: Your text is split into tokens—subword units the model understands. "Hello world" might become ["Hello", " world"]. Unusual words get split further: "tokenization" becomes something like ["token", "ization"].

2. Embedding: Each token is converted into a vector—a list of numbers that represents its meaning. These vectors live in a high-dimensional space where similar tokens are close together.

3. Position Encoding: Since transformers process all tokens in parallel, they need explicit position information. Position encodings are added to tell the model which token comes first, second, third.

4. Transformer Layers: The embedded, positioned tokens pass through many transformer blocks. Each block applies attention (tokens looking at other tokens) and feed-forward networks (processing each token independently). Layer by layer, the representations become richer.

5. Output Projection: The final representations are projected to vocabulary size—a score for every possible next token. These raw scores are called logits.

6. Sampling: The logits are converted to probabilities (via softmax), and a token is selected. This token is added to the sequence, and the process repeats for the next token.

I Do Not "Know" Things

Here is a crucial insight about how language models work: LLMs predict likely next tokens. That is all.

When GPT appears to "know" that Paris is the capital of France, it is not retrieving from a database. It is not looking up facts. It is simply predicting that "Paris" is the most likely continuation after "The capital of France is."

This prediction ability comes from patterns learned during training. The model saw "The capital of France is Paris" countless times. It learned the statistical association. But there is no separate knowledge store, no fact retrieval system.

The key point: Knowledge is implicit in the weights. Everything the model "knows" is encoded as patterns in billions of parameters. These patterns influence which tokens are likely in which contexts.

This explains some quirks of LLMs:

  • They can be confidently wrong (the training data was wrong, or they are pattern-matching incorrectly)
  • They cannot distinguish what they "know" from what they are guessing
  • They do not know when their training data ended
  • They can produce plausible-sounding nonsense

Pattern completion that looks like knowing is not the same as actually knowing. The model is a sophisticated autocomplete, not an oracle.

Temperature and Sampling

The model outputs a probability distribution over possible next tokens. How do we choose which token to generate?

Temperature controls randomness. It scales the logits before converting to probabilities:

P(token)=elogit/Telogits/TP(token) = \frac{e^{logit / T}}{\sum e^{logits / T}}
  • Temperature = 0: Always pick the highest-probability token (deterministic, repetitive)
  • Temperature = 1: Sample according to the original distribution
  • Temperature > 1: Flatten the distribution, increase randomness
  • Temperature < 1: Sharpen the distribution, decrease randomness

Interactive: Temperature and Sampling

Balanced: samples near the original distribution

Low temperature sharpens the distribution, making the model more deterministic. High temperature flattens it, increasing randomness. Watch how the probability bars change as you adjust the slider.

Other sampling strategies provide finer control:

Top-k sampling: Only consider the k highest-probability tokens. If k=50, tokens ranked 51st and below are never selected, regardless of their probability.

Top-p (nucleus) sampling: Include tokens until their cumulative probability exceeds p. If p=0.9, keep adding tokens (in probability order) until they sum to 90%. This adapts to the distribution—when the model is confident, fewer tokens are considered.

Min-p sampling: Only consider tokens with probability at least p times the highest probability. This filters out unlikely tokens while adapting to confidence.

In practice, APIs combine these: temperature adjusts randomness, top-p filters unlikely tokens, and the combination produces coherent but varied text.

The Chat Format

Modern LLMs are trained to follow a specific conversation format. Understanding this format helps you use them effectively.

A typical chat message has three parts:

System Prompt: Instructions that set behavior for the entire conversation. "You are a helpful coding assistant" or "Respond only in JSON format." The model sees this first and treats it as context for everything that follows.

User Messages: Your inputs. Questions, instructions, or context you provide.

Assistant Messages: The model's responses. In multi-turn conversations, the model sees its own previous responses and continues accordingly.

Under the hood, this is all concatenated into one sequence:

<|system|>You are a helpful assistant.<|end|>
<|user|>What is 2+2?<|end|>
<|assistant|>2+2 equals 4.<|end|>
<|user|>And 3+3?<|end|>
<|assistant|>
text

The model generates tokens to continue after that final <|assistant|> marker.

Multi-turn conversations work because the entire history is included in the context. The model does not "remember" previous turns—it sees them all as input and predicts what should come next.

This has implications:

  • Longer conversations use more tokens (and more compute)
  • The model might "forget" details from early in a long conversation if they scroll out of the context window
  • You can steer behavior by including examples in the conversation history

Key Takeaways

  • Inference is a pipeline: tokenize → embed → add positions → transform → project → sample
  • LLMs predict likely next tokens—knowledge is patterns in weights, not stored facts
  • Temperature controls randomness: 0 is deterministic, higher values increase variety
  • Top-k and top-p sampling filter which tokens can be selected
  • Chat models use structured formats with system, user, and assistant roles
  • Multi-turn conversations include full history—the model does not have persistent memory