34 Multimodal transformers

Where we are. We close Part V by stepping out of text. The transformer is not just for language: with one simple trick —chopping anything into “tokens”— it sees images (ViT), unites vision and language (CLIP, LLaVA) and hears audio (Whisper). And, as a bonus, we will see that ViTs develop attention sinks just like text —closing the circle with our Part II—.

34.1 The idea in one sentence

An image or a sound is turned into a sequence of tokens (patches, spectrogram fragments) and, from there, it is the same transformer that processes them —the architecture does not change, what changes is how you chop the input—.

34.2 Key concepts and their role in the transformer

Before diving in, we define this chapter’s terms and what each one is for inside a transformer:

Tokenizing any modality. Definition: chopping image or audio into “tokens” (patches, spectrogram fragments). In the transformer: it is the trick that lets you process image or sound with the same transformer, without changing the architecture.
Vision Transformer (ViT). Definition: applying the transformer to an image chopped into patches. In the transformer: it treats each patch like a word → “an image is worth 16×16 words”.
Patch. Definition: a fixed chunk of image (e.g. 16×16 px) flattened and projected into a vector. In the transformer: it is the visual equivalent of a token.
CLIP / contrastive objective. Definition: two encoders (image and text) trained to pull together the correct pairs in a shared space. In the transformer: it uses language as supervision → zero-shot classification.
Bridge / projection (Q-Former). Definition: a module that translates vision features into the LLM’s token space. In the transformer: it “gives eyes” to a frozen LLM without retraining it whole.
Cross-attention. Definition: the queries come from one modality and the keys/values from another. In the transformer: the mechanism that lets text “look at” the image patches.
Fusion strategies. Definition: early fusion (concatenate everything), dedicated cross-attention or projection. In the transformer: the three ways to combine modalities, with different cost and modularity.
Sinks / register tokens. Definition: high-norm artifact tokens the model uses as a scratchpad. In the transformer: the same phenomenon as attention sinks (Part II) reappears in ViTs.

With that in mind, let’s start with the image.

34.3 Vision Transformer (ViT): the image as a sentence of “stamps”

The ViT (Dosovitskiy et al. 2021) applied the transformer to images with a direct trick:

Split the image into fixed patches (e.g. 16×16 pixels), non-overlapping.
Flatten and project each patch into a vector → a “token” (like a word).
Add positional embeddings and a [CLS] token (whose final state does the classification).
Feed it all into a standard transformer encoder, unchanged.

Hence its title: “an image is worth 16×16 words”.

🧩 Analogy — the grid of stamps. You cut the photo into a grid of postage stamps and treat each stamp as a “word”. The transformer then “reads” the sentence of stamps —attending between patches just as it attended between words—.

⚠ Honest — the ViT is data-hungry (with caveats)

The ViT matches or beats CNNs (ResNet) only if pretrained on massive data (JFT-300M); on ImageNet-1k alone, it falls short. Why? Because it lacks the inductive biases that convolution brings “out of the box” —locality and translation equivariance—; the ViT has to learn them from data. Important caveat: DeiT (Touvron et al. 2021) showed that with good augmentation + distillation a ViT competes on ImageNet-1k alone → part of the hunger was about the recipe, not the architecture. Let’s not oversell it as an insurmountable limit.

34.4 CLIP: a shared image-text meaning space

What if the supervision were not fixed labels, but language? CLIP (Radford et al. 2021) trains two encoders —one for image, one for text— on ~400 million (image, caption) pairs with the contrastive objective of Ch. 26: pull the correct pairs together in a shared space and push the incorrect ones apart.

The standout result is zero-shot classification: to classify an image, you embed phrases like “a photo of a {cat}” and assign it the highest-similarity label —without having seen a single ImageNet label, it matches a supervised ResNet-50—. The big idea: natural language as a flexible, scalable supervision signal.

🧩 Analogy — matching photos with their captions. CLIP learns to match photos with their descriptions over and over until it builds a shared “meaning space”: the photo of a dog and the words “a dog” land in the same neighborhood. Classifying is then a matter of seeing which phrase the image is closest to.

⚠ Honest — what CLIP does NOT do well

The authors themselves acknowledge it: (1) social biases inherited from uncurated web data; (2) weak on abstract/systematic tasks —e.g. counting objects, telling apart fine-grained models—; (3) sensitive to the prompt wording. “Matches ResNet-50 zero-shot” is real on ImageNet, not universal parity.

34.5 Giving an LLM eyes: generative vision-language models

For an LLM to talk about an image, the common pattern is: vision encoder → bridge/projection → the LLM’s token space.

BLIP-2 (Li et al. 2023): a lightweight Q-Former (with learnable queries) extracts, via cross-attention, the features of the frozen image encoder and projects them into the frozen LLM. Only the bridge is trained → cheap.
Flamingo (Alayrac et al. 2022): it interleaves gated cross-attention layers inside a frozen LM (the gate starts at 0 so as not to break it); strong multimodal few-shot.
LLaVA (Liu et al. 2023): it connects a CLIP encoder to an LLM with a simple projection and instruction-tunes with synthetic multimodal data. It is the simple, today-dominant recipe.

🧩 Analogy — the “image-language” translator. The bridge (Q-Former or projection) is a translator that turns the “language of images” into tokens the LLM already knows how to read —it does not teach it to see from scratch, it hands it vision in its own vocabulary—.

34.6 How two modalities are fused: cross-attention

The mechanism for one modality to “look at” another is cross-attention: the queries (Q) come from one modality and the keys/values (K,V) from the other —e.g. the text queries the image patches—. Three strategies:

Early fusion / concatenation: you join the tokens of both modalities into a single sequence and apply self-attention over everything (everything attends to everything). Simple and expressive, but quadratic cost in the combined length. (It is, in essence, what LLaVA does after projecting.)
Dedicated cross-attention (Flamingo): separate streams with cross-attention layers where one modality queries the other; modular, it leaves the base LM intact.
Projection (LLaVA): it adds no new attention; it translates vision into the LLM’s space and lets its normal self-attention mix them.

🧩 Analogy — questions and answers. In cross-attention, the text asks the questions (the queries) and the image patches answer (they provide keys and values): each word “looks at” the regions of the image that matter to it.

34.7 Audio: turning sound into an “image”

The pattern repeats: spectrogram (image-like) or learned features → transformer.

Whisper (Radford et al. 2022): an encoder-decoder transformer over log-Mel spectrograms, trained on 680,000 hours of weakly supervised audio → robust zero-shot speech recognition.
Wav2Vec 2.0 (Baevski et al. 2020): self-supervised —it masks features and solves a contrastive task—, then fine-tunes with little labeled audio.
AST (Gong et al. 2021): a ViT applied directly to spectrograms —the most literal example of “audio as image”, with no convolutions—.

34.8 The frontier: natively multimodal (brief)

The trend is to train models that are multimodal from scratch (not an encoder + LLM glued on afterward) with unified tokenization (“any-to-any”: text, image, audio in a common vocabulary). Models like GPT-4V/4o or Gemini point that way —but their architecture and data are proprietary and unpublished, so we cite them by name only, as a direction of the field, not as a verifiable fact—.

34.9 The circle closes: ViTs have sinks too

A nice connection with our Part II. ViTs develop very-high-norm artifact tokens in low-information background patches, which the model reuses as memory/scratchpad for global computation —and they show up as spikes in the attention maps— (Darcet et al. 2023). It is the same phenomenon as attention sinks in text (Ch. 17): “cheap” tokens that accumulate a disproportionate mass of attention. The solution —adding dedicated register tokens— is the visual parallel of reserving sinks in LLMs. (Honest caveat: later work debates whether all ViTs need them; a robust phenomenon, with caveats by architecture.)

🧪 Try it — tafagent

The text-sink ↔︎ image-register parallel is exactly the kind of phenomenon that tafagent diagnoses in text (concentration mass, η-regime; Ch. 17). The same lens of attention across the sequence —γ, sinks— applies to a ViT’s patches: concentration in a few tokens is not exclusive to language.

34.10 Summary

ViT (Dosovitskiy et al. 2021): image → patches = tokens → a normal transformer encoder. Data-hungry for lack of inductive biases —but DeiT mitigates it with the recipe—.
CLIP (Radford et al. 2021): image+text encoders, contrastive (Ch. 26) → shared space → zero-shot classification (matches ResNet-50 on ImageNet). Honest: weak at counting/abstracting, with biases.
Vision→LLM: the pattern encoder → bridge → the LLM’s tokens: BLIP-2 (Q-Former), Flamingo (gated cross-attention), LLaVA (projection + instruction tuning).
Fusion: cross-attention (Q from one modality, K/V from another); early-fusion vs cross-attention vs projection.
Audio: spectrogram → transformer (Whisper, Wav2Vec 2.0, AST).
Frontier: natively multimodal / unified tokenization (GPT-4o, Gemini — by name, unpublished).
Circle closed: ViTs have sinks/registers (Darcet et al. 2023) just like text (Part II).

Next (Part VI): we have used the model; now it is time to make it efficient and deployable —quantization, distillation, pruning, serving—, where our KV window (Ch. 20) reappears.

34.11 Exercises

Patches. How does the ViT turn an image into “tokens”? What role does the [CLS] play?
Data hunger. Why does the ViT need more data than a CNN, and what did DeiT show in that regard?
CLIP zero-shot. Explain how CLIP classifies an image without training labels for that task. With what objective is it trained?
The bridge. What does a Q-Former / a projection do to “give eyes” to a frozen LLM?
Cross-attention. Which modality do the queries come from and which the keys/values when text looks at an image?
The circle. What do the register tokens of a ViT and the attention sinks of a text LLM have in common?

References

Alayrac, Jean-Baptiste et al. 2022. “Flamingo: A Visual Language Model for Few-Shot Learning.” NeurIPS. https://arxiv.org/abs/2204.14198.

Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” NeurIPS. https://arxiv.org/abs/2006.11477.

Darcet, Timothée, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2023. Vision Transformers Need Registers. https://arxiv.org/abs/2309.16588.

Dosovitskiy, Alexey et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” ICLR. https://arxiv.org/abs/2010.11929.

Gong, Yuan, Yu-An Chung, and James Glass. 2021. “AST: Audio Spectrogram Transformer.” Interspeech. https://arxiv.org/abs/2104.01778.

Li, Junnan, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. “BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models.” ICML. https://arxiv.org/abs/2301.12597.

Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. “Visual Instruction Tuning.” NeurIPS. https://arxiv.org/abs/2304.08485.

Radford, Alec, Jong Wook Kim, Chris Hallacy, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” ICML. https://arxiv.org/abs/2103.00020.

Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. https://arxiv.org/abs/2212.04356.

Touvron, Hugo, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. “Training Data-Efficient Image Transformers and Distillation Through Attention.” ICML. https://arxiv.org/abs/2012.12877.