The Encoder Stack
Understanding context bidirectionally
Stacking Transformer Blocks
A single transformer block is useful, but the real power comes from stacking them. Each block refines the representations further, building increasingly sophisticated understanding.
The original "Attention Is All You Need" paper describes this clearly:
"The encoder is composed of a stack of N = 6 identical layers."
Six layers was the starting point. Modern transformer models go much deeper:
- BERT-base: 12 layers
- BERT-large: 24 layers
- GPT-3: 96 layers
- PaLM: 118 layers
Why stack so many? Each layer performs one round of "look at everything and update." The first layer captures basic patterns. The second layer refines those patterns. By the time you reach layer 12 or 96, the representations encode remarkably abstract relationships.
Interactive: Encoder Stack
Data flows from input embeddings at the bottom through each encoder layer. Each layer contains attention (tokens communicate) and FFN (tokens process independently).
Watch the data flow from input at the bottom to output at the top. Each block receives the previous block's output and produces refined representations. The same architecture is repeated, but the weights are different in each layer—each layer learns to extract different patterns.
Bidirectional Attention
The encoder has a special property: bidirectional attention. Every token can attend to every other token, regardless of position. There is no masking.
When processing "The cat sat on the mat":
- "The" can see "cat," "sat," "on," "the," and "mat"
- "cat" can see "The," "sat," "on," "the," and "mat"
- "sat" can see everything else
- And so on...
This is fundamentally different from how humans read (left to right) or how language models generate text. The encoder does not care about order in its attention—it sees the whole sentence simultaneously.
Why is this useful? Consider the word "bank" in "I walked to the river bank." To understand that this means a riverbank (not a financial institution), we need context from both before ("walked," "river") and after. With bidirectional attention, "bank" has direct access to "river," making the disambiguation trivial.
Contrast this with a left-to-right model that must figure out the meaning of "bank" before seeing "river." Bidirectional attention gives the encoder a significant advantage for tasks that require understanding complete sentences, like classification or question answering.
Layer-by-Layer Refinement
The key insight about deep transformers: each layer builds more abstract representations.
In early layers, tokens mostly capture their own identity plus local context. "cat" knows it is an animal word, "sat" knows it is a verb. The attention patterns are often local—words pay attention to their immediate neighbors.
In middle layers, syntactic structure emerges. The model learns grammatical roles. Which word is the subject? Which is the object? How do clauses relate to each other? The attention patterns become more structured, following linguistic relationships.
In later layers, semantic understanding develops. The model grasps meaning, inference, and higher-level concepts. "The cat that ate the fish was hungry"—by the final layers, the model understands that "hungry" describes the cat, not the fish.
Representation Evolution Across Layers
Watch how "river" evolves through the layers.
This layered refinement is not programmed—it emerges from training. Researchers have probed transformer layers and found consistent patterns: early layers for surface features, middle layers for syntax, late layers for semantics.
Think of it like this: layer 1 asks "what words are here?" Layer 6 asks "what is the grammatical structure?" Layer 12 asks "what does this mean?"
The Encoder Output
Let us trace the full journey through an encoder:
Input: Raw tokens are converted to embeddings. Each word becomes a vector, maybe 768 dimensions for BERT-base. Positional encodings are added so the model knows word order.
Processing: The embeddings flow through all N encoder layers. At each layer, attention allows tokens to exchange information, and the FFN processes each position.
Output: After the final layer, we have contextualized representations. Each token's vector now encodes not just its own meaning, but information gathered from the entire sequence.
The word "bank" in isolation might have a generic embedding that blends all possible meanings. After passing through the encoder with "river" nearby, its output representation is specific—it encodes the riverbank meaning.
This is the encoder's gift: it transforms context-independent word embeddings into context-dependent representations. Every token "knows" about every other token. The outputs are ready for downstream tasks:
- Classification: Use the [CLS] token's representation to predict sentiment, topic, or category
- Question answering: Find which tokens in the passage answer the question
- Named entity recognition: Classify each token as a person, place, organization, etc.
- Feeding a decoder: For translation, the encoder outputs become the "memory" that the decoder attends to
The encoder does not generate text—it understands text. It creates rich, contextual representations that other systems can use.
Key Takeaways
- An encoder stack consists of multiple identical transformer blocks, each with different learned weights
- Modern encoders range from 12 layers (BERT-base) to over 100 layers (PaLM)
- Bidirectional attention lets every token attend to every other token—no masking
- This bidirectional nature gives encoders an advantage for understanding tasks
- Early layers capture local patterns and word identity; deep layers capture semantic meaning
- The encoder transforms context-independent embeddings into context-dependent representations
- Encoder outputs can be used for classification, extraction, or as input to a decoder