11 Encoder, decoder and encoder-decoder

Where we are. In Ch. 9 we built a generative transformer (a decoder, with a causal mask). But with the same pieces, by changing what each token can look at and whether there’s an encoder, a decoder or both, three families of models emerge: BERT, GPT and T5. This chapter is the transformers’ “tree of life”: it gives you the map to place any model you come across.

11.1 The idea in one sentence

The three families use the same blocks (attention + FFN + scaffolding); what sets them apart is, above all, the attention mask —what each token can see— and whether there’s cross-attention between an encoder and a decoder.

11.2 Key concepts and their role in the transformer

Before we dig in, let’s define this chapter’s terms and what each one is for inside a transformer:

Encoder. Definition: a stack of blocks with bidirectional attention. In the transformer: it serves to understand a text in depth; each token sees the whole context, past and future.
Decoder. Definition: a stack of blocks with causal attention. In the transformer: it serves to generate; each token only sees the past and predicts what comes next.
Bidirectional mask. Definition: no restriction; each token attends to all. In the transformer: it’s the switch that turns a model into “comprehension” (BERT).
Causal mask. Definition: blocks the future (the positions yet to come). In the transformer: it’s the switch that turns a model into “generation” (GPT).
Cross-attention. Definition: the decoder’s queries look at the encoder’s keys and values. In the transformer: the input→output bridge; it lets the text being written consult the input text (translate, summarize).
Masked LM (fill in the blanks). Definition: an objective that hides random words and asks the model to guess them using both sides. In the transformer: the typical training of encoder-only models; it produces bidirectional comprehension.
Next-token prediction. Definition: an objective that predicts what comes after the context. In the transformer: the training of decoder-only models; a single objective that scales very well and yields generation.
Denoising (seq2seq). Definition: an objective that reconstructs corrupted text. In the transformer: the training of encoder-decoder models (T5/BART); it fits input→output tasks.

With this vocabulary, the three families stop being loose names and become combinations of mask + objective.

11.3 The three families

Encoder-only (BERT, RoBERTa). Bidirectional attention: each token sees all the others, past and future. It’s trained by filling in the blanks (masked language modeling: hiding words and guessing them from the context on both sides). Function: to understand a text in depth → classification, entity recognition, sentence embeddings. It doesn’t generate text naturally.

Decoder-only (GPT, LLaMA, Mistral). Causal attention: each token only sees the past. It’s trained by predicting the next token. Function: to generate. It’s the dominant architecture of today’s LLMs.

Encoder-decoder (original Transformer, T5, BART). An encoder reads the input bidirectionally; a decoder generates the output causally, looking at the input via cross-attention. Function: sequence to sequence → translation, summarization.

Table 11.1: The three families of transformers

Family	What each token sees (mask)	Objective	What for	Examples
Encoder-only	bidirectional (everything)	fill in blanks (MLM)	understand	BERT, RoBERTa
Decoder-only	causal (the past only)	next token	generate	GPT, LLaMA
Encoder-decoder	bidir. enc. + causal dec. + cross	denoising / spans	seq2seq	T5, BART

11.4 The key difference: the mask (and cross-attention)

It’s striking how little changes inside: what really separates the families is the attention mask —seeing everything (bidirectional) or only the past (causal)—. That single switch decides whether the model understands or generates.

In the encoder-decoder there’s also a third piece: cross-attention. In plain terms: while the decoder writes the output, its queries look at the encoder’s keys and values —that is, the text being written consults the input text to decide what to say—. It’s the input→output bridge.

🧩 Analogy. Think of three distinct roles: Encoder (BERT) = a reader who can look at the whole page at once to understand it, but doesn’t write. Decoder (GPT) = a writer who only sees what’s already been written and adds the next word. Encoder-decoder (T5/BART) = a translator who first reads the whole source and then writes the target word by word, glancing sideways at the original (that glance is cross-attention).

11.5 Each family, its training objective

The pretraining objective shapes what each model is good for:

BERT → masked LM (fill in the blanks) → bidirectional comprehension.
GPT → next token → generation.
T5 / BART → denoising (reconstruct corrupted text) → seq2seq.

11.6 Why “decoder-only” won for LLMs

The field consolidated around decoder-only for the large models for several reasons: a single objective (next token) that scales very well and that yields both generation and —via prompting— comprehension, plus in-context learning and generality. Honest caveat: for pure embeddings and classification, encoder-only models are still cheaper and often better; and encoder-decoder models shine on tasks with well-defined input→output (translation, summarization).

🧠 Curiosity — architecture isn’t destiny

How much does choosing the right architecture really matter? Less than it seems. An empirical study (Wang et al. 2022) found that the “best” architecture changes with the recipe: decoder-only wins at pure zero-shot, but the non-causal model with a fill-in objective wins after multitask finetuning. And, surprisingly, you can convert one family into another by adapting the weights, without retraining from scratch. A single model can even learn all three objectives at once (UL2 (Tay et al. 2022), a mixture of denoisers). Moral: architecture decides what kind of model it is, but scale and the training objective matter more —and the families are more fluid than their names suggest—.

11.7 Summary

The three families share pieces; what sets them apart is the mask (bidirectional vs causal) and the presence of cross-attention.
Encoder-only (BERT): bidirectional, fills in blanks, understands (doesn’t generate).
Decoder-only (GPT/LLaMA): causal, next token, generates — dominant today.
Encoder-decoder (T5/BART): encoder + decoder + cross-attention, seq2seq.
Decoder-only won for LLMs through scale + generality; encoder-only still wins at cheap embeddings/classification.

Next (Chapter 11): we now know the architectures. How is one of these models actually trained? Objective, optimization, data and scaling laws.

11.8 Exercises

The switch. What is the single main change that turns a “comprehension” model into a “generation” one?
Why doesn’t BERT write an essay? Explain it in terms of its mask and its training objective.
Cross-attention. In a translator (T5), who looks at whom, and what for?
Choosing the tool. Which family would you use for: (a) classifying reviews as positive/negative; (b) a chatbot; (c) translating from English to Spanish? Justify your answer.

References

Tay, Yi et al. 2022. UL2: Unifying Language Learning Paradigms. https://arxiv.org/abs/2205.05131.

Wang, Thomas et al. 2022. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? https://arxiv.org/abs/2204.05832.