The Sequence Problem
Why sequences are hard for computers
Context Changes Everything
Consider the word "bank." Read it in isolation and notice how your mind hovers between meanings. Now read these two sentences:
- She sat by the river bank and watched the fish swim by.
- She walked into the bank and deposited her paycheck.
The exact same word, but the surrounding words completely change its meaning. In the first sentence, "bank" conjures images of grass and water. In the second, you picture marble floors and teller windows.
This is the fundamental challenge of processing language: words get their meaning from context. A single word in isolation is ambiguous. Only when we see it alongside its neighbors does the meaning crystallize.
Interactive: Context Changes Meaning
She sat by the bank🏞️ and watched the fish swim by.
Meaning: The edge of a river or stream
The word "bank" is identical in both sentences, but the surrounding words completely determine its meaning. This is why context is essential for understanding language.
This phenomenon goes far deeper than simple ambiguity. Even when words aren't ambiguous, their exact shade of meaning shifts based on neighbors. The word "cold" in "cold water" versus "cold war" versus "cold personality" carries subtly different connotations each time.
For computers to truly understand language, they must somehow capture these context-dependent meanings.
Word Order Matters
Context isn't just about which words appear together—it's about their arrangement. Swap two words and the entire meaning can flip.
Interactive: Drag Words to Reorder
"The dog bit the man"
The dog attacked the man
Drag words to rearrange them. Notice how swapping "dog" and "man" completely changes who is doing the biting.
The same five words, yet the meaning depends entirely on their sequence. This is why we call language data "sequential"—the order carries as much information as the words themselves.
Think about it from a computer's perspective. If you simply count which words appear in a sentence (a "bag of words"), "The dog bit the man" and "The man bit the dog" look identical. Both contain the same words with the same frequencies. Only by tracking the sequence can we distinguish these very different events.
This sequential nature is what makes language fundamentally different from tabular data or images. In a spreadsheet, rearranging rows might not change the analysis. In an image, local patches often make sense on their own. But in language, remove the sequence and you lose the meaning.
The Naive Approach: One Word at a Time
Early neural networks tried to handle sequences by processing one word at a time, maintaining a running "memory" of what came before. These are called Recurrent Neural Networks (RNNs).
The idea is elegant: read a word, update your hidden state, then pass that state to process the next word. The hidden state acts as a compressed summary of everything seen so far. In theory, this lets the network carry context forward through the entire sequence.
Animation: RNN Information Flow
Watch how information flows through the RNN...
RNNs process one word at a time, compressing all prior context into a fixed-size hidden state. Information from early words must survive many update steps, causing it to decay exponentially.
Watch what happens to the hidden state as we move through the sequence. Early information must pass through many transformation steps before it can influence processing of later words. At each step, the state gets updated, and older information gradually fades.
This creates the information decay problem. By the time an RNN reaches the end of a long sentence, it has "forgotten" much of the beginning. The hidden state can only hold so much, and recent words tend to dominate. Critical context from early in the sequence becomes diluted, compressed, distorted.
Imagine trying to summarize a whole book while only being allowed to keep a single page of notes. Every time you read a new chapter, you must compress your notes further. By the end, crucial early details are lost.
This bottleneck limits RNNs severely. They struggle with long-range dependencies—cases where understanding a word requires remembering something from far earlier in the sequence. "The cat that ate the fish that lived in the pond that my grandmother built was hungry." By the time we reach "was," we need to remember that the subject is "cat," not "fish" or "pond."
What If We Could See Everything at Once?
Here's a radical thought: what if instead of processing words one at a time, we could look at all positions simultaneously?
Imagine spreading the entire sentence out in front of you like cards on a table. You can see every word at once. When you want to understand what "bank" means, you simply glance at the surrounding words—no matter how far away they are. No need to pass information through a chain of hidden states. Just direct access.
This is the key insight behind transformers: parallel processing of all positions. Every word can attend to every other word directly, without the information having to flow through intermediate steps.
Instead of a sequential chain that loses information, we have a fully connected web where any word can inform the meaning of any other word. The cost of looking at a distant word is the same as looking at an adjacent one.
This "attention" to all positions at once eliminates the bottleneck. Long-range dependencies become just as easy to capture as short-range ones. The beginning of a sentence is just as accessible as the end.
In the coming chapters, we'll see exactly how this attention mechanism works—how the model decides which words to attend to, how it combines information from multiple positions, and how this simple idea scales to handle the complexity of human language.
But first, we need to solve another problem: how do we turn words into numbers that a neural network can process?
Key Takeaways
- Words derive meaning from context—the same word means different things in different sentences
- Word order carries crucial information; rearranging words can completely change meaning
- RNNs process sequences one step at a time, maintaining a hidden state as compressed memory
- The sequential nature of RNNs causes information decay over long sequences—early context fades
- Transformers solve this by processing all positions in parallel, allowing direct attention between any words