The Inference Pipeline
Making generation fast
Generating text one token at a time sounds simple. But when each token requires a full forward pass through a model with billions of parameters, the compute adds up quickly. Making inference fast and memory-efficient is an engineering challenge that shapes how we deploy language models.
The Cost of Generation
Every time the model generates a token, it must run a complete forward pass. The input—everything generated so far—flows through every layer, every attention head, every feed-forward network.
For a model like GPT-3 with 175 billion parameters:
- Each token requires approximately 350 billion floating-point operations
- Generating 100 tokens means 100 forward passes
- That is 35 trillion operations just for a short paragraph
This explains why language model APIs charge by the token. Each token costs compute. Longer responses cost more.
But the real issue is not just total compute—it is sequential compute. You cannot generate token 5 until you have token 4. You cannot start the 5th forward pass until the 4th completes. This serial dependency limits parallelism and creates latency.
KV Caching
Here is the insight that makes modern inference practical: when generating token N, the Keys and Values for tokens 1 through N-1 do not change.
In attention, each token computes:
- Query (Q): What am I looking for?
- Key (K): What do I contain that others might want?
- Value (V): What information do I contribute?
When we generate a new token, only that token's Q, K, V need to be computed. The K and V for all previous tokens are exactly the same as when we computed them before.
Interactive: KV Cache in Action
Watch how each token's Key and Value are computed once, then cached. When generating the next token, we only compute new K, V for that token—all previous values are reused from cache.
The KV cache stores these computed Keys and Values. When generating the next token, we:
- Compute Q, K, V only for the new token
- Append the new K, V to the cache
- Use the cached K, V from all previous tokens in attention
Without caching, generating N tokens requires O(N²) attention computations—each token attends to all previous tokens, and we recompute everything each step.
With caching, we only need O(N) attention computations—each new token attends to all previous tokens once, using cached values.
Interactive: With vs Without KV Cache
Without Cache
36
K,V computations
O(N²) scaling
With Cache
8
new K,V computations
O(N) scaling
Without caching, each token requires recomputing K and V for all previous tokens. With caching, we only compute new K, V for the current token and reuse cached values. The grid shows all operations—red/green cells are computed, blue cells are cached.
The savings are dramatic. For a 1000-token generation:
- Without cache: ~500,000 attention operations (quadratic)
- With cache: ~500,000 attention operations spread differently, but only ~1,000 new K,V computations
In practice, KV caching provides 10-100× speedup for long generations.
The Memory Problem
There is no free lunch. What we save in compute, we spend in memory.
The KV cache must store Keys and Values for every token, in every layer, for every attention head. For a large model:
- 96 layers
- 96 attention heads per layer
- 128 dimensions per head
- 16 bits per value
For a single sequence of 2048 tokens:
That is 9 GB of memory just for the KV cache of one sequence. The model weights take another 350 GB for a 175B model.
This memory pressure explains several constraints:
Context length limits: Longer contexts mean larger KV caches. A 128K context window requires proportionally more memory. This is why some models have shorter context limits.
Batch size trade-offs: Serving multiple users means multiple KV caches. Memory limits how many concurrent sequences you can handle.
GPU memory requirements: Inference at scale requires high-memory GPUs (80GB A100s, H100s) or distributed systems.
Optimization Techniques
Engineers have developed many techniques to make inference faster and more memory-efficient:
Quantization reduces precision. Instead of 16-bit floats, use 8-bit or even 4-bit integers:
- 16-bit → 8-bit: Half the memory, minor accuracy loss
- 16-bit → 4-bit: Quarter the memory, more accuracy trade-off
Modern quantization techniques (GPTQ, AWQ, GGML) preserve accuracy surprisingly well. A 4-bit quantized 70B model often outperforms a full-precision 13B model while fitting in similar memory.
Batching processes multiple sequences together. The model weights are loaded once but applied to many sequences. This amortizes the cost of loading weights from memory.
Continuous batching is smarter: as sequences finish, new ones start immediately, keeping the GPU busy. No waiting for the longest sequence in a batch.
Flash Attention is an algorithmic optimization that:
- Reduces memory reads/writes between GPU compute units and memory
- Never materializes the full attention matrix
- Provides 2-4× speedup with identical outputs
Speculative decoding uses a smaller "draft" model to propose multiple tokens, then the large model verifies them in parallel. If the draft is usually correct, you generate multiple tokens per large-model forward pass.
Paged Attention (vLLM) manages KV cache like virtual memory. Instead of pre-allocating maximum sequence length, it allocates cache in pages as needed. This reduces memory waste and enables more concurrent sequences.
The Latency Budget
In production, latency matters as much as throughput. Users expect responses to start quickly.
Time to first token (TTFT): How long before the first token appears? This requires processing the entire input prompt through the model—the "prefill" phase. Longer prompts mean longer TTFT.
Time per output token (TPOT): Once generation starts, how fast do tokens appear? This is bounded by the forward pass time for one token.
A typical latency budget:
- TTFT: < 500ms (users notice longer delays)
- TPOT: < 50ms (feels like smooth streaming)
Meeting these budgets requires:
- Fast hardware (modern GPUs, optimized kernels)
- Efficient batching (spread fixed costs across sequences)
- Model size trade-offs (smaller models are faster)
- Quantization (fewer bits = faster math)
Key Takeaways
- Each generated token requires a full forward pass—100 tokens means 100 forward passes
- KV caching is essential: store computed Keys and Values instead of recomputing them
- Without caching, attention is O(N²) per token; with caching, new computation is O(N)
- The KV cache grows with sequence length and consumes significant GPU memory
- Quantization (8-bit, 4-bit) reduces memory and speeds up computation with minor accuracy loss
- Flash Attention, speculative decoding, and paged attention further optimize inference
- Production systems balance latency (time to first token, time per token) with throughput