Multimodal Transformers

Combining vision, language, and beyond

The Multimodal Insight

We have seen transformers conquer text (GPT, BERT), images (ViT), and audio (Whisper). Each domain required adapting the input representation—tokens for text, patches for images, spectrogram frames for audio—but the core architecture remained the same.

This raises a tantalizing question: what if one model could understand all modalities at once?

The insight is almost obvious in hindsight. A transformer does not care what its input embeddings represent. It simply processes sequences of vectors, letting attention discover relationships. Whether those vectors come from words, image patches, or audio frames is irrelevant to the attention mechanism itself.

The transformer is modality-agnostic. It just processes sequences of embeddings.

This realization opened the door to multimodal models—systems that can see, read, and listen, all within a single architecture.

Early vs Late Fusion

When combining modalities, the fundamental question is: when do we merge them?

Early fusion combines modalities at the input. We project all inputs into a shared embedding space, concatenate them into a single sequence, and let the transformer process everything together. Text tokens sit alongside image patches in the same attention computation.

Late fusion keeps modalities separate. Each has its own encoder, processing independently. Only at the end do we combine their outputs—perhaps by concatenating final representations or using a small fusion network.

Cross-modal attention offers a middle ground. One modality can query another through cross-attention, similar to how a decoder attends to an encoder. The image encoder's outputs become keys and values for the text decoder's queries.

Interactive: Early vs Late Fusion

ImageTextAudioEmbedEmbedEmbedConcatenateTransformer

All modalities are embedded into a shared space and concatenated into a single sequence. The transformer processes everything together, allowing deep cross-modal interaction from the start.

Each approach has tradeoffs. Early fusion allows deep interaction between modalities but requires processing everything together—expensive for long sequences. Late fusion is efficient but limits cross-modal reasoning. Modern systems often use hybrid approaches.

CLIP: Contrastive Language-Image Pretraining

CLIP, released by OpenAI in 2021, demonstrated a powerful approach to connecting vision and language. Rather than training on labeled datasets like ImageNet, CLIP learned from 400 million image-text pairs scraped from the internet.

The architecture is elegantly simple: two separate encoders, one for images (a ViT) and one for text (a transformer). Each encoder produces a single embedding vector for its input.

The training objective is contrastive learning. Given a batch of image-text pairs, CLIP learns to:

  • Maximize similarity between matching pairs (an image and its caption)
  • Minimize similarity between non-matching pairs (an image and someone else's caption)

Interactive: CLIP Contrastive Learning

🐱
🐕
🚗
🌳
a photo of a cat
a photo of a dog
a photo of a car
a photo of a tree
0.92
0.14
0.25
0.17
0.12
0.92
0.24
0.16
0.20
0.30
0.96
0.12
0.17
0.16
0.13
0.96
Matching pairs (maximize)
Non-matching (minimize)

CLIP learns by maximizing similarity on the diagonal (matching image-text pairs) while minimizing off-diagonal similarity. Hover over cells to explore.

The result is a shared embedding space where semantically similar concepts cluster together, regardless of modality. The embedding of "a photo of a cat" lands near the embedding of actual cat images.

This enables something remarkable: zero-shot classification. To classify an image, we do not need to train a classifier. We simply:

  1. Embed the image
  2. Embed text descriptions of each class ("a photo of a dog", "a photo of a cat", ...)
  3. Find which text embedding is closest to the image embedding

Interactive: Zero-Shot Classification

Select image:
🐱
Image Embedding
a photo of a cat
0.93
a photo of a dog
0.17
a photo of a bird
0.39
a photo of a car
0.39

Zero-shot classification: Compare the image embedding against text embeddings for each class. The highest similarity wins—no training required for new categories.

CLIP can classify images into categories it has never explicitly been trained on, as long as it can understand the text description. This is the power of learning in a shared semantic space.

Vision-Language Models

Models like GPT-4V and Claude bring multimodal understanding into conversational AI. You can show them an image and ask questions about it—"What's happening in this photo?" or "Can you read the text in this screenshot?"

How do they work? The key insight is treating image patches as tokens in the language model's sequence.

When you send an image to GPT-4V:

  1. The image is split into patches (like ViT)
  2. Each patch is embedded into a vector
  3. These image embeddings are inserted into the token sequence
  4. The language model processes text and image tokens together

Interactive: Multimodal Token Sequence

[IMG]
patch1
patch2
patch3
patch4
[/IMG]
What
color
is
the
car
?
Image patches
Text tokens
Special tokens

Click on a token to see what it attends to. Notice how "car" attends to image patches to answer the question.

The attention mechanism handles the rest. Text tokens can attend to image patches, allowing the model to ground its language understanding in visual information. When asked "What color is the car?", the word "car" can attend to the relevant image patches to find the answer.

This is fundamentally different from CLIP. CLIP produces a single embedding per image—good for classification but not for detailed understanding. Vision-language models maintain the full patch-level representation, enabling fine-grained visual reasoning.

The Unified Architecture

The frontier of multimodal AI moves toward truly unified models. GPT-4o and Gemini process text, images, and audio through a single architecture, not as separate systems stitched together.

In these models, all modalities become tokens:

  • Text: Subword tokens (as usual)
  • Images: Patch embeddings
  • Audio: Spectrogram frame embeddings

These tokens flow through the same transformer layers, sharing parameters. The model learns unified representations where concepts connect across modalities. The sound of a dog barking, the image of a dog, and the word "dog" all activate similar patterns.

This unification enables new capabilities:

  • Native multimodal generation: Generate images, audio, and text in a single forward pass
  • Cross-modal reasoning: "This sounds like..." or "This looks like it would sound like..."
  • Seamless context: Reference earlier images when discussing later text

We are still early in this journey. Current models handle some modalities better than others, and truly fluid multimodal generation remains challenging. But the direction is clear: the transformer's modality-agnostic nature makes unified multimodal AI not just possible, but natural.

Key Takeaways

  • Transformers are modality-agnostic—they process sequences of embeddings regardless of what those embeddings represent
  • Early fusion combines modalities at input; late fusion combines at output; cross-modal attention allows modalities to query each other
  • CLIP learns a shared embedding space through contrastive learning, enabling zero-shot classification
  • Vision-language models insert image patches as tokens, allowing text to attend to visual information
  • Unified multimodal models process all modalities through shared transformer layers, moving toward truly integrated understanding