Positional Encoding

How transformers know word order

The Permutation Problem

There is a fundamental issue with the attention mechanism we just learned: it has no sense of order.

Consider these two sentences:

"Dog bites man"

"Man bites dog"

They contain the exact same words. Without any position information, attention treats them identically—it sees a bag of words, not a sequence. But these sentences have completely different meanings.

The Permutation Problem

Position Encoding:OFF

pos 0Dog

pos 1bites

pos 2man

Dogpos 0

35%

30%

bitespos 1

40%

20%

40%

manpos 2

30%

35%

Without position encoding: Both sentences have the same attention pattern! The model cannot distinguish "Dog bites man" from "Man bites dog" because it only sees word content, not order.

Toggle position encoding to see how it changes the attention pattern.

When attention computes how much "bites" should attend to "dog" versus "man," it only looks at the content of those embeddings. It does not know that "dog" comes first in one sentence and "man" comes first in the other.

This is called the permutation invariance problem. Shuffle the input words, and you get the same attention patterns. For language, where word order is crucial, this is a serious limitation.

Adding Position Information

The solution is elegantly simple: add position information directly to the embeddings.

Before any attention computation, we inject each word's position into its representation. The embedding for "dog" at position 0 becomes different from "dog" at position 2. Now the model can distinguish between subjects and objects based on where they appear.

Position Encoding Effect

Add Position Encoding:

Thepos 0

8 dims

catpos 1

8 dims

satpos 2

8 dims

onpos 3

8 dims

thepos 4

8 dims

matpos 5

8 dims

Embedding (+)

Embedding (-)

Base embeddings only: The two instances of "the" have identical embeddings. Without position information, the model cannot tell them apart.

The key insight from the original transformer paper: instead of teaching the model positions through complex mechanisms, we simply add a position encoding vector to each word embedding. The sum contains both what the word means and where it appears.

\text{input}_i = \text{embedding}_i + \text{position}_i

This addition works because the model learns to use both pieces of information. The semantic content of the word and its positional information coexist in the same vector space.

Sinusoidal Encodings

The original "Attention Is All You Need" paper introduced a clever scheme using sine and cosine waves at different frequencies.

For a position $\text{pos}$ and dimension $i$ :

PE_{(\text{pos}, 2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d}}\right)

PE_{(\text{pos}, 2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d}}\right)

where $d$ is the total embedding dimension.

Sinusoidal Encoding Patterns

Positions32

Dimensions64

Dimension →

Position →

03264

-1 (negative)

+1 (positive)

Each row is a position, each column is a dimension. Lower dimensions (left) have high-frequency waves that change rapidly. Higher dimensions (right) have low-frequency waves that change slowly. This creates unique patterns for each position.

Why sines and cosines? The authors hypothesized this would help the model learn to attend to relative positions. For any fixed offset $k$ , the encoding at position $\text{pos}+k$ can be expressed as a linear function of the encoding at position $\text{pos}$ .

Think of it like a clock: the hour hand and minute hand together tell you the exact time, and you can easily compute "3 hours from now" by adding to the current position. Similarly, sinusoidal encodings let the model reason about "2 positions ahead" or "5 positions back."

The paper notes:

"We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, $PE_{\text{pos}+k}$ can be represented as a linear function of $PE_{\text{pos}}$ ."

Another benefit: the model can extrapolate to sequence lengths it has never seen during training. Since sine and cosine are continuous functions defined everywhere, position 1000 has a well-defined encoding even if training only used sequences up to length 512.

Learned vs Fixed Encodings

There are two main approaches to positional encoding:

Fixed (Sinusoidal):

Determined by a mathematical formula
No learnable parameters
Can extrapolate to longer sequences
Used in the original transformer

Learned:

A lookup table of position vectors
Learned during training
More flexible—can learn arbitrary patterns
Limited to positions seen during training (e.g., max 512)

In practice, learned positional embeddings often perform similarly to sinusoidal ones for typical sequence lengths. Many modern models like BERT use learned positions.

Modern variants have pushed further:

RoPE (Rotary Position Embedding): Encodes positions by rotating the embedding vectors. Used in LLaMA and other modern models. Preserves relative position information elegantly.
ALiBi (Attention with Linear Biases): Instead of adding to embeddings, adds a bias to attention scores based on distance. Allows extrapolation to much longer sequences than training length.

These newer approaches address a key limitation: handling sequences much longer than those seen during training. As language models process increasingly long documents, position encoding becomes crucial for maintaining coherent understanding across thousands of tokens.

Key Takeaways

Attention is naturally order-agnostic—it treats inputs as an unordered set
Positional encoding solves this by adding position information to each embedding
Sinusoidal encodings use waves at different frequencies to create unique position patterns
The encoding allows models to reason about relative positions through linear transformations
Modern variants like RoPE and ALiBi improve long-sequence handling
Position information is essential for language understanding where word order changes meaning