34 Multimodal transformers
Where we are. We close Part V by stepping out of text. The transformer is not just for language: with one simple trick —chopping anything into “tokens”— it sees images (ViT), unites vision and language (CLIP, LLaVA) and hears audio (Whisper). And, as a bonus, we will see that ViTs develop attention sinks just like text —closing the circle with our Part II—.
34.1 The idea in one sentence
An image or a sound is turned into a sequence of tokens (patches, spectrogram fragments) and, from there, it is the same transformer that processes them —the architecture does not change, what changes is how you chop the input—.
34.2 Key concepts and their role in the transformer
Before diving in, we define this chapter’s terms and what each one is for inside a transformer:
- Tokenizing any modality. Definition: chopping image or audio into “tokens” (patches, spectrogram fragments). In the transformer: it is the trick that lets you process image or sound with the same transformer, without changing the architecture.
- Vision Transformer (ViT). Definition: applying the transformer to an image chopped into patches. In the transformer: it treats each patch like a word → “an image is worth 16×16 words”.
- Patch. Definition: a fixed chunk of image (e.g. 16×16 px) flattened and projected into a vector. In the transformer: it is the visual equivalent of a token.
- CLIP / contrastive objective. Definition: two encoders (image and text) trained to pull together the correct pairs in a shared space. In the transformer: it uses language as supervision → zero-shot classification.
- Bridge / projection (Q-Former). Definition: a module that translates vision features into the LLM’s token space. In the transformer: it “gives eyes” to a frozen LLM without retraining it whole.
- Cross-attention. Definition: the queries come from one modality and the keys/values from another. In the transformer: the mechanism that lets text “look at” the image patches.
- Fusion strategies. Definition: early fusion (concatenate everything), dedicated cross-attention or projection. In the transformer: the three ways to combine modalities, with different cost and modularity.
- Sinks / register tokens. Definition: high-norm artifact tokens the model uses as a scratchpad. In the transformer: the same phenomenon as attention sinks (Part II) reappears in ViTs.
With that in mind, let’s start with the image.
34.3 Vision Transformer (ViT): the image as a sentence of “stamps”
The ViT (Dosovitskiy et al. 2021) applied the transformer to images with a direct trick:
- Split the image into fixed patches (e.g. 16×16 pixels), non-overlapping.
- Flatten and project each patch into a vector → a “token” (like a word).
- Add positional embeddings and a
[CLS]token (whose final state does the classification). - Feed it all into a standard transformer encoder, unchanged.
Hence its title: “an image is worth 16×16 words”.
🧩 Analogy — the grid of stamps. You cut the photo into a grid of postage stamps and treat each stamp as a “word”. The transformer then “reads” the sentence of stamps —attending between patches just as it attended between words—.
The ViT matches or beats CNNs (ResNet) only if pretrained on massive data (JFT-300M); on ImageNet-1k alone, it falls short. Why? Because it lacks the inductive biases that convolution brings “out of the box” —locality and translation equivariance—; the ViT has to learn them from data. Important caveat: DeiT (Touvron et al. 2021) showed that with good augmentation + distillation a ViT competes on ImageNet-1k alone → part of the hunger was about the recipe, not the architecture. Let’s not oversell it as an insurmountable limit.
34.5 Giving an LLM eyes: generative vision-language models
For an LLM to talk about an image, the common pattern is: vision encoder → bridge/projection → the LLM’s token space.
- BLIP-2 (Li et al. 2023): a lightweight Q-Former (with learnable queries) extracts, via cross-attention, the features of the frozen image encoder and projects them into the frozen LLM. Only the bridge is trained → cheap.
- Flamingo (Alayrac et al. 2022): it interleaves gated cross-attention layers inside a frozen LM (the gate starts at 0 so as not to break it); strong multimodal few-shot.
- LLaVA (Liu et al. 2023): it connects a CLIP encoder to an LLM with a simple projection and instruction-tunes with synthetic multimodal data. It is the simple, today-dominant recipe.
🧩 Analogy — the “image-language” translator. The bridge (Q-Former or projection) is a translator that turns the “language of images” into tokens the LLM already knows how to read —it does not teach it to see from scratch, it hands it vision in its own vocabulary—.
34.6 How two modalities are fused: cross-attention
The mechanism for one modality to “look at” another is cross-attention: the queries (Q) come from one modality and the keys/values (K,V) from the other —e.g. the text queries the image patches—. Three strategies:
- Early fusion / concatenation: you join the tokens of both modalities into a single sequence and apply self-attention over everything (everything attends to everything). Simple and expressive, but quadratic cost in the combined length. (It is, in essence, what LLaVA does after projecting.)
- Dedicated cross-attention (Flamingo): separate streams with cross-attention layers where one modality queries the other; modular, it leaves the base LM intact.
- Projection (LLaVA): it adds no new attention; it translates vision into the LLM’s space and lets its normal self-attention mix them.
🧩 Analogy — questions and answers. In cross-attention, the text asks the questions (the queries) and the image patches answer (they provide keys and values): each word “looks at” the regions of the image that matter to it.
34.7 Audio: turning sound into an “image”
The pattern repeats: spectrogram (image-like) or learned features → transformer.
- Whisper (Radford et al. 2022): an encoder-decoder transformer over log-Mel spectrograms, trained on 680,000 hours of weakly supervised audio → robust zero-shot speech recognition.
- Wav2Vec 2.0 (Baevski et al. 2020): self-supervised —it masks features and solves a contrastive task—, then fine-tunes with little labeled audio.
- AST (Gong et al. 2021): a ViT applied directly to spectrograms —the most literal example of “audio as image”, with no convolutions—.
34.8 The frontier: natively multimodal (brief)
The trend is to train models that are multimodal from scratch (not an encoder + LLM glued on afterward) with unified tokenization (“any-to-any”: text, image, audio in a common vocabulary). Models like GPT-4V/4o or Gemini point that way —but their architecture and data are proprietary and unpublished, so we cite them by name only, as a direction of the field, not as a verifiable fact—.
34.9 The circle closes: ViTs have sinks too
A nice connection with our Part II. ViTs develop very-high-norm artifact tokens in low-information background patches, which the model reuses as memory/scratchpad for global computation —and they show up as spikes in the attention maps— (Darcet et al. 2023). It is the same phenomenon as attention sinks in text (Ch. 17): “cheap” tokens that accumulate a disproportionate mass of attention. The solution —adding dedicated register tokens— is the visual parallel of reserving sinks in LLMs. (Honest caveat: later work debates whether all ViTs need them; a robust phenomenon, with caveats by architecture.)
The text-sink ↔︎ image-register parallel is exactly the kind of phenomenon that tafagent diagnoses in text (concentration mass, η-regime; Ch. 17). The same lens of attention across the sequence —γ, sinks— applies to a ViT’s patches: concentration in a few tokens is not exclusive to language.
34.10 Summary
- ViT (Dosovitskiy et al. 2021): image → patches = tokens → a normal transformer encoder. Data-hungry for lack of inductive biases —but DeiT mitigates it with the recipe—.
- CLIP (Radford et al. 2021): image+text encoders, contrastive (Ch. 26) → shared space → zero-shot classification (matches ResNet-50 on ImageNet). Honest: weak at counting/abstracting, with biases.
- Vision→LLM: the pattern encoder → bridge → the LLM’s tokens: BLIP-2 (Q-Former), Flamingo (gated cross-attention), LLaVA (projection + instruction tuning).
- Fusion: cross-attention (Q from one modality, K/V from another); early-fusion vs cross-attention vs projection.
- Audio: spectrogram → transformer (Whisper, Wav2Vec 2.0, AST).
- Frontier: natively multimodal / unified tokenization (GPT-4o, Gemini — by name, unpublished).
- Circle closed: ViTs have sinks/registers (Darcet et al. 2023) just like text (Part II).
Next (Part VI): we have used the model; now it is time to make it efficient and deployable —quantization, distillation, pruning, serving—, where our KV window (Ch. 20) reappears.
34.11 Exercises
- Patches. How does the ViT turn an image into “tokens”? What role does the
[CLS]play? - Data hunger. Why does the ViT need more data than a CNN, and what did DeiT show in that regard?
- CLIP zero-shot. Explain how CLIP classifies an image without training labels for that task. With what objective is it trained?
- The bridge. What does a Q-Former / a projection do to “give eyes” to a frozen LLM?
- Cross-attention. Which modality do the queries come from and which the keys/values when text looks at an image?
- The circle. What do the register tokens of a ViT and the attention sinks of a text LLM have in common?