7 The feed-forward network (FFN)

Where we are. Attention (Ch. 4–5) makes each token gather information from the others. But gathering isn’t processing: now each token needs to think about what it has collected. That’s the job of the feed-forward network (FFN), the other half of each transformer block. It’s the piece with the most parameters in the model —and yet the one many tutorials dispatch in a single line.

7.1 The idea in one sentence

After attention, each token passes through a small two-layer network that transforms it —the same network for all tokens, applied to each one separately.

7.2 Key concepts and their role in the transformer

Before diving into the details, let’s define this chapter’s terms and what each one is for inside a transformer:

FFN (feed-forward network). Definition: a two-layer network (widen → nonlinearity → compress) applied to each token separately. In the transformer: it’s the “computation” or “thinking” step; while attention moves information between tokens, the FFN processes it within each one.
Position-wise processing. Definition: applying the same weights to each token independently. In the transformer: it guarantees that the FFN never passes information from one token to another —that work belongs exclusively to attention.
Expansion (intermediate width, \(d_{ff}\)). Definition: the middle layer is wider than the model vector, typically 4×. In the transformer: it makes room to extract many features in parallel before compressing the useful part back.
Activation / nonlinearity (\(\sigma\)). Definition: the function (ReLU, GELU, SwiGLU…) applied between the two layers. In the transformer: it’s indispensable; without it the two tables would collapse into one and the block couldn’t compute anything complex.
Gating (SwiGLU). Definition: a second projection that decides, element by element, how much to let through from the first. In the transformer: it acts as a learned filter over the activation; because of the extra table, models that use it shrink the width to ≈2.7× so as not to add parameters.
Key-value memory. Definition: a reading of the FFN in which the rows of \(W_1\) detect patterns (keys) and \(W_2\) stores what those patterns write (values). In the transformer: it turns the FFN into a store of associations “if I see this pattern → I write this.”
Localized knowledge (ROME). Definition: concrete facts live mostly in the middle FFNs and can be edited by touching those weights. In the transformer: it makes the FFN the model’s editable knowledge store.
Superposition / polysemy. Definition: the model packs more features than it has neurons available, so many neurons fire on unrelated things. In the transformer: it’s the reason “one neuron = one concept” is the exception, not the rule.

With these in mind, let’s see what role the FFN plays against attention.

7.3 What it’s for (its role in the transformer)

Here’s the key that organizes the whole block: attention and the FFN do opposite, complementary jobs.

Attention moves information between tokens (blends, routes): it’s the social part.
The FFN processes the information within each token, separately: it’s the individual part, the “computation” or “thinking” step.

And they alternate, block after block: blend (attention) → process (FFN) → blend → process… Without the FFN, the model would only know how to average and route, but not compute rich features from what it gathered.

🧩 Analogy. Attention is the team meeting: everyone shares notes and each takes away what’s relevant from the others. The FFN is each person going back to their desk, alone, to think it over and update their notebook. Meeting (attention) → solo reflection (FFN), over and over in each block.

7.4 The formula

It’s a two-layer network with an “activation” in the middle:

\[ \text{FFN}(x) = W_2\,\sigma(W_1 x + b_1) + b_2 \tag{7.1}\]

What each piece does:

\(W_1\) = a learned table that widens the token vector, projecting it into a larger space (where more “features” computed at once can fit).
\(\sigma\) = the activation (the nonlinearity). It’s the indispensable piece: without it, \(W_1\) and \(W_2\) would collapse into a single table and the whole block would be a plain multiplication —incapable of computing anything complex. The activation is what makes it possible to represent relationships that aren’t mere proportions.
\(W_2\) = a second table that compresses back to the original size (\(d_{model}\)), to return the result to the residual stream.
\(b_1, b_2\) = biases (a constant term that shifts each layer).

In one sentence: widen → apply the nonlinearity → compress.

7.5 The expansion: why wider in the middle?

The middle layer is wider than the model vector —typically 4×— (in the original Transformer, 512 → 2048; in GPT-2, 768 → 3072). Why? The wide space gives room to extract many features in parallel before compressing the useful part back. Think of fanning out a range of possibilities, keeping what matters, and folding it back up.

Careful: the “4×” is a convention, not a law. Modern LLaMA-style models use ≈2.7× because their activation (SwiGLU) adds an extra table, and they shrink the width to keep the parameter count the same.

7.6 The activations: ReLU → GELU → SwiGLU

The activation \(\sigma\) has evolved over time, but its job is always the same: to inject the nonlinearity that makes complex computation possible.

ReLU (original): lets the positive part through, sharply cuts the negative to
GELU (GPT-2/3, BERT): a smooth version of ReLU, without the abrupt cutoff.
SwiGLU (LLaMA, PaLM): adds a gating —a second projection that decides, element by element, how much to let through from the first. It’s like a learned filter on top of the activation.

7.7 A detail that’s often confused

✓ Verified — the FFN does NOT mix tokens

The FFN is applied to each token separately and with the same weights (“position-wise” in the paper). It never passes information from one token to another —that work is exclusive to attention. It’s a common mistake to think the FFN “combines words”: it doesn’t, it only processes each one in place.

7.8 The FFN holds most of the parameters

Even though it looks like the humble component, the FFN concentrates ~2/3 of a transformer’s parameters (because of that 4× width).

🔍 Going deeper — the 2/3 count

Per layer, attention uses ≈ \(4\,d_{model}^2\) parameters (the Q, K, V tables and the output one). The FFN uses \(2\,d_{model}\cdot d_{ff} = 2\,d_{model}\,(4\,d_{model}) = 8\,d_{model}^2\). So FFN/total \(= 8/(8+4) = 2/3\). (Approximate: it ignores embeddings and normalization, and it changes with the width and with GQA.)

7.9 The fascinating part: the FFN is a “memory”

What does the FFN actually store? A beautiful answer: it works like a key-value memory (Geva et al. 2021). The rows of \(W_1\) act as keys that detect patterns in the input (in the lower layers, surface/syntactic patterns; in the upper ones, semantic), and \(W_2\) stores the values those patterns write toward the output. The FFN isn’t a jumble: it’s a store of associations “if I see this pattern → I write this.”

🧠 Curiosity — facts live in the FFN

If this is a memory, can it be edited? Yes. Facts of the type “the capital of France is Paris” are localized mostly in the FFNs of the middle layers, and you can change a specific fact by directly editing those weights —that’s what the ROME method does (Meng et al. 2022). The FFN is, to a large degree, the model’s editable knowledge store.

7.10 And are the “neurons” interpretable?

It would be nice if each neuron of the wide layer represented one clean concept. In practice, many are polysemantic (they fire on unrelated things), due to a phenomenon called superposition: the model packs more features than it has neurons available. Some neurons are interpretable, but it’s best not to overstate it: “one neuron = one concept” is the exception, not the rule. (We return to this in Part VII, interpretability.)

7.11 In code

import torch, torch.nn as nn

d_model, d_ff = 768, 3072          # GPT-2 small: 4× expansion
ffn = nn.Sequential(
    nn.Linear(d_model, d_ff),      # W1: widens 768 -> 3072
    nn.GELU(),                     # the nonlinearity
    nn.Linear(d_ff, d_model),      # W2: compresses 3072 -> 768
)
x = torch.randn(5, d_model)        # 5 tokens
print(ffn(x).shape)                # (5, 768): same size, ready for the residual

Notice: it comes in and goes out at the same size (\(d_{model}\)), because its output is added to the residual stream (Ch. 3). And it’s applied to all 5 tokens alike, each on its own.

7.12 Summary

The FFN is a two-layer network (widen → nonlinearity → compress) that processes each token separately.
Job: while attention moves info between tokens, the FFN processes it within each one; the block alternates blend→process. Without its nonlinearity, there’d be no complex computation.
The middle layer is ~4× wider (convention; SwiGLU uses ≈2.7×).
The FFN concentrates ~2/3 of the parameters.
It works as a key-value memory (Geva et al. 2021) and is where facts live (editable, ROME (Meng et al. 2022)). Its neurons tend to be polysemantic (superposition).

Next (Chapter 7): we now have attention (blend) and the FFN (process). What’s missing is the scaffolding that keeps everything standing as we stack many layers: residual connections and normalization.

7.13 Exercises

The nonlinearity. Explain why, if you remove \(\sigma\), the two layers \(W_1\) and \(W_2\) are equivalent to a single one. What does the model lose?
The count. With \(d_{model}=1024\) and a 4× expansion, how many parameters does \(W_1\) have? And \(W_2\)? (Ignore biases.)
Division of labor. For each task, say whether it’s done by attention or the FFN: (a) connecting “she” with its referent; (b) deciding that a token activates the concept “capital of a country”; (c) averaging information from distant tokens.
Editing a fact. If facts live in the middle FFNs, why would it make sense to edit there and not in the embedding layer? (Think about what each one stores.)

References

Geva, Mor, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. “Transformer Feed-Forward Layers Are Key-Value Memories.” EMNLP. https://arxiv.org/abs/2012.14913.

Meng, Kevin, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. “Locating and Editing Factual Associations in GPT.” NeurIPS. https://arxiv.org/abs/2202.05262.