8 Residual connections and normalization

Where we are. We now have the two pieces that do the work: attention (blending between tokens) and the FFN (processing each token). But a transformer stacks dozens of these blocks, and stacking that many layers, just like that, doesn’t work: the signal vanishes or degrades. This chapter covers the scaffolding that makes it possible —residual connections and normalization. They don’t “compute” anything about language; they’re what keeps the machine standing and trainable when it’s deep.

8.1 The idea in one sentence

Each sublayer adds its result to what was already there (it doesn’t replace it), and a normalization keeps the signal at a stable scale. That’s what lets you stack many layers without the model breaking.

8.2 Key concepts and their role in the transformer

Before diving into the details, let’s define this chapter’s terms and what each one is for inside a transformer:

Residual connection. Definition: each sublayer adds its result to the vector that was already there, instead of replacing it (x ← x + Sublayer(x)). In the transformer: it’s the highway the vector travels on; each layer only proposes a correction, which makes depth possible and creates the residual stream from Ch. 3.
Residual stream. Definition: the shared vector that runs through the whole block and that every layer reads from and adds to. In the transformer: the model’s common workspace, its shared “whiteboard” across layers.
Gradient path. Definition: the direct route the residual sum leaves so the learning signal can cross many layers without vanishing. In the transformer: it’s what makes it possible to train deep networks (the ResNet idea, He et al. (2016)).
Rank collapse. Definition: without residuals or FFN, pure attention makes all tokens converge to the same vector. In the transformer: it explains why the scaffolding isn’t optional —without it, a deep network self-destructs (Dong et al. 2021).
Normalization. Definition: readjusting the scale of each token’s vector to keep it in a comfortable range. In the transformer: it’s the volume control that keeps the signal from exploding or shrinking as it passes from layer to layer.
LayerNorm. Definition: normalizes to mean 0 and variance 1 and applies a learned gain \(\gamma\) and shift \(\beta\) (Ba et al. 2016). In the transformer: the classic normalization, with knobs to recover whatever scale is convenient.
RMSNorm. Definition: a simpler version that skips re-centering (it doesn’t subtract the mean), only rescaling by the root mean square (Zhang and Sennrich 2019). In the transformer: nearly as good and cheaper; that’s why LLaMA and company use it.
Pre-LN vs Post-LN. Definition: where the normalization is placed relative to the residual sum —before the sublayer (Pre-LN) or after adding (Post-LN). In the transformer: Pre-LN trains stably and without warmup and is today’s standard; switching to it was key to training enormous models.

With these in mind, let’s see what role each part of the scaffolding plays.

8.3 What they’re for (their role in the transformer)

Two pieces of scaffolding, two jobs:

The residual connection is the highway the vector travels on: each layer doesn’t write a new vector, it proposes a correction to the one already running. That makes depth possible (information and gradient don’t get lost along the way) and creates the residual stream from Ch. 3 —the common workspace every layer reads and every layer adds to.
The normalization is the volume control: it keeps the signal neither too strong nor too weak, layer after layer, so that training is stable.

Without them, a deep transformer simply doesn’t train (the gradients vanish) or collapses (we see this below).

8.4 The residual connection

The formula couldn’t be simpler:

\[ x \leftarrow x + \text{Sublayer}(x) \tag{8.1}\]

What each part does:

\(x\) = the vector traveling along the residual stream (what’s known about the token so far).
\(\text{Sublayer}(x)\) = what this layer contributes (the attention or the FFN).
\(x + \dots\) = it’s added, not substituted. The layer only adds an edit.

🧩 Analogy. It’s a conveyor belt with a shared whiteboard: the vector travels along the belt and, at each station (layer), someone reads the whiteboard and adds their notes without erasing what came before. (The same image from Ch. 3: here you can see why the writing is additive.)

Why add instead of replace? For two reasons: (1) it leaves a direct path for the gradient, which then doesn’t vanish as it crosses many layers (the ResNet idea, He et al. (2016)); and (2) it preserves what’s already been computed, instead of risking losing it at every layer.

🧠 Curiosity — without residuals, everything becomes the same

Are they really that important? Decisively so. Dong et al. (2021) showed that pure attention —without residual connections or FFN— collapses doubly-exponentially with depth: all tokens converge to the same vector (rank 1) and all the information is lost, and faster the deeper the network. What rescues it is the residual connections and the FFN together. Hence the paper’s title: “attention is not all you need.” The scaffolding isn’t optional: it’s what keeps the transformer from self-destructing as it grows.

8.5 Normalization

The other half of the scaffolding keeps the scale under control.

LayerNorm (Ba et al. 2016) does this, for each token separately: it takes the numbers in its vector, readjusts them to mean 0 and variance 1, and then applies a learned gain (\(\gamma\)) and shift (\(\beta\)).

What it does: it keeps the values from exploding or shrinking as they pass from layer to layer —it leaves them in a comfortable range.
\(\gamma\) and \(\beta\) = two learned knobs that let the model recover whatever scale suits it, in case “mean 0, variance 1” is too strict.

🧩 Analogy. An automatic volume knob: it keeps the signal neither saturated nor inaudible before moving on to the next stage.

RMSNorm (Zhang and Sennrich 2019) is a simpler and cheaper version: it skips re-centering (it doesn’t subtract the mean), only rescaling by the root mean square, with a learned gain. It gives almost the same result and is faster; that’s why LLaMA and company use it. (Careful: the difference from LayerNorm is exactly that —dropping the mean— not “LayerNorm without β”.)

8.6 Where does the norm go? Pre-LN vs Post-LN

A small detail in the formula, huge in practice: where the normalization is placed relative to the residual sum.

Post-LN (original Transformer): normalizes after adding → \(x = \text{LN}(x + \text{Sublayer}(x))\). Trains with difficulty; needs warmup (raising the learning rate gradually at the start).
Pre-LN (from GPT-2 onward, today’s standard): normalizes the input of the sublayer → \(x = x + \text{Sublayer}(\text{LN}(x))\). The gradients behave well from the start, so it trains stably and without warmup (Xiong et al. 2020).

It’s not just “a variant”: the switch to Pre-LN is one of the reasons why enormous transformers are trained today without as many headaches.

8.7 The full block, in code

Here’s what a modern (Pre-LN) transformer block looks like, putting together everything from Ch. 4–7:

# x: (n_tokens, d_model) traveling along the residual stream
x = x + attention(norm1(x))  # blend between tokens  (+ residual sum)
x = x + ffn(norm2(x))        # process each token    (+ residual sum)

Two lines. Notice the pattern: normalize → sublayer → add, twice. Stack this block 12, 32, or 96 times and you have the body of a transformer.

8.8 Summary

The residual connection (x ← x + Sublayer(x)) makes each layer add a correction instead of replacing; it keeps a path for the gradient and creates the residual stream. Without it (+ FFN), pure attention collapses to rank 1.
Normalization keeps the signal at a stable scale: LayerNorm (mean 0, var 1, + γ, β) and the simpler RMSNorm (RMS only, no re-centering).
Pre-LN (normalize before the sublayer) is today’s standard: it trains stably without warmup, unlike the original Post-LN.
Together they’re the scaffolding that makes stacking dozens of blocks trainable.

Next (Chapter 8): we have attention, the FFN, and the scaffolding. What’s missing is a sense that attention, on its own, doesn’t have: order. Time for position information and RoPE.

8.9 Exercises

Add vs replace. Explain in your own words two reasons why the layer adds its output instead of substituting the vector.
LayerNorm vs RMSNorm. What is the single essential difference between the two?
Pre vs Post. Write the two formulas (Pre-LN and Post-LN) for a sublayer. Which needs warmup and which doesn’t?
Collapse. If you remove the residual connections and leave only attention, what happens to the token representations as depth increases?

References

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. https://arxiv.org/abs/1607.06450.

Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. “Attention Is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth.” ICML. https://arxiv.org/abs/2103.03404.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” CVPR. https://arxiv.org/abs/1512.03385.

Xiong, Ruibin et al. 2020. “On Layer Normalization in the Transformer Architecture.” ICML. https://arxiv.org/abs/2002.04745.

Zhang, Biao, and Rico Sennrich. 2019. “Root Mean Square Layer Normalization.” NeurIPS. https://arxiv.org/abs/1910.07467.