Vision Transformers

Transformers see images

The Radical Idea

For a decade, Convolutional Neural Networks (CNNs) dominated computer vision. Their locally-connected filters, pooling layers, and hierarchical feature extraction seemed perfectly suited to images. The field assumed that visual tasks required architectures explicitly designed for spatial structure.

Then came a provocative paper: "An Image is Worth 16x16 Words."

The idea was almost absurdly simple: take a transformer encoder—the same architecture designed for text—and apply it directly to images. No convolutions. No pooling. No special treatment for spatial relationships. Just pure attention.

This was the Vision Transformer (ViT), and it worked remarkably well. The key insight? An image can be treated as a sequence, just like text. The question is: a sequence of what?

Images as Sequences of Patches

Here is the key insight: instead of processing an image pixel by pixel (too expensive) or with convolutions (architecturally constrained), we split the image into a grid of patches.

Take a 224×224 image. Divide it into non-overlapping 16×16 patches. Since 224 ÷ 16 = 14, we get a 14×14 grid—that's 196 patches total.

Interactive: Image to Patches

224 × 224 Image → 14 × 14 Patches

Sequence of 196 Patches

Each patch: 16 × 16 × 3 = 768 values

Hover over patches to highlight them. The 224×224 image is divided into 196 non-overlapping 16×16 patches, forming a sequence that the transformer processes like text tokens.

Each patch is like a "word" in this visual sentence. A transformer can process these patches exactly as it would process tokens in text: embed them, add positional information, and let attention figure out the relationships.

This reframing is elegant. We are not forcing the transformer to learn convolution-like operations. We are letting it discover what visual relationships matter, using the same attention mechanism that works so well for language.

The Patch Embedding Process

Once we have patches, we need to convert them into the vectors that transformers expect. Each 16×16 patch contains 16 × 16 × 3 = 768 values (assuming RGB color channels).

The embedding process is straightforward:

  1. Flatten each patch into a 768-dimensional vector
  2. Apply a linear projection to map it to the model's hidden dimension (e.g., 768 or 1024)
  3. Add a learnable position embedding so the model knows where each patch came from

Patch Embedding Visualization

16×16 Patch (RGB)

16 × 16 × 3

Flattened

768 values

Linear Projection

D dimensions

+ Position

Patch Embedding

Patch
Flatten
Project
Position

Each 16×16 RGB patch (768 values) is flattened into a vector, linearly projected to the model dimension, and combined with a learnable position embedding to encode spatial location.

That linear projection is actually equivalent to applying a single convolution with 16×16 kernels and stride 16. But conceptually, we are treating patches as independent tokens that will learn to communicate through attention.

The position embeddings are crucial. Unlike convolutions, which implicitly encode location through their receptive fields, attention has no notion of position. The position embeddings give the model spatial awareness: "this patch is in the top-left corner" or "this patch is adjacent to that one."

The [CLS] Token

ViT borrows another trick from language models: the [CLS] token. Before passing patches through the transformer, we prepend a special learnable token to the sequence.

This [CLS] token has no corresponding image patch. Instead, it attends to all patches across all transformer layers, gradually accumulating a holistic representation of the entire image.

[CLS] Token Aggregation

[CLS]

Classification Token

Subject (high attention)
Background (low attention)

The [CLS] token has no corresponding image patch. It attends to all patches, aggregating information weighted by importance. For classification, this final representation captures the whole image.

For classification tasks, we take the final representation of the [CLS] token and pass it through a simple linear classifier. The idea is that after many layers of attention, the [CLS] token has "seen" everything and can summarize the image's content.

This mirrors BERT's approach for text classification. The [CLS] token acts as a global aggregator, learning to extract whatever information is most useful for the downstream task.

Why ViT Needs Scale

There is a catch. ViT lacks the inductive biases that make CNNs data-efficient:

  • CNNs assume locality matters—nearby pixels are related. This is baked into the architecture.
  • CNNs assume translation invariance—a cat in the corner should be recognized the same as a cat in the center.
  • CNNs build hierarchical features—edges → textures → parts → objects.

ViT has none of these assumptions. It starts from scratch, learning all spatial relationships purely from data. This flexibility is a double-edged sword.

With limited data, ViT underperforms CNNs. The model has too many degrees of freedom and not enough signal to constrain them. It might overfit to spurious patterns or fail to discover basic visual primitives.

But with enough data—millions or billions of images—ViT surpasses CNNs. The lack of inductive bias becomes an advantage. ViT can learn relationships that CNNs cannot express, discovering visual patterns that human-designed architectures would never capture.

This is the lesson of scale: simpler, more general architectures win when data is abundant. The attention mechanism's flexibility, which seems wasteful with small datasets, becomes powerful at scale.

Attention on Patches

What does attention look like when applied to image patches?

Unlike text, where attention patterns often highlight syntactic relationships, visual attention reveals semantic and spatial groupings. Patches attend to related patches: sky to sky, face to face, wheel to wheel.

Patch Attention Patterns

Click a patch to see its attention

Attention Strength

Low
High

Pattern Type

Attention patterns in ViT reveal how patches relate to each other. Different heads learn different patterns: local spatial relationships, semantic groupings, or global context.

Several interesting patterns emerge:

  • Local attention: Nearby patches often attend to each other, discovering local structure without explicit locality bias.
  • Global attention: Some patches attend across the entire image, capturing long-range relationships that CNNs would need many layers to model.
  • Object-based attention: Patches from the same object tend to form attention clusters, even without any object supervision.

This is the power of learned attention. The model discovers what relationships matter, rather than having them hardcoded. For some images, local context dominates. For others, distant patches are highly relevant. Attention adapts to the content.

Early layers tend to show more local patterns, while later layers show more global, semantic attention. The model builds up from low-level features to high-level concepts, not through architectural hierarchy, but through learned attention patterns.

Key Takeaways

  • Vision Transformers treat images as sequences of patches—each patch is like a "word"
  • A 224×224 image becomes 196 patches of size 16×16, plus a special [CLS] token
  • Each patch is flattened and linearly projected, then given a position embedding
  • The [CLS] token attends to all patches and aggregates global information for classification
  • ViT has no built-in inductive biases—it needs large datasets to learn what CNNs get for free
  • At scale, this flexibility becomes an advantage: ViT discovers visual relationships CNNs cannot express
  • Attention patterns on images reveal local, global, and object-based groupings