38  A primer on mechanistic interpretability

Where we are. This opens Part VII —the book’s underlying question: what does a transformer really do inside?—. Mechanistic interpretability (MI) tries to reverse- engineer the algorithms the network learned: turning opaque weights into understandable mechanisms, with causal evidence. Here are its concepts (residual stream, QK/OV circuits, superposition), its methods (logit lens, activation patching) and —true to this book— its honest limits, which set up Ch. 38.

38.1 The idea in one sentence

Mechanistic interpretability is about decompiling the network —recovering from the weights the readable “algorithm” it runs— and, above all, demonstrating it with causal interventions, not correlations.

38.2 Key concepts and their role in the transformer

Before we dig in, let’s define this chapter’s terms and what each one is for inside a transformer:

  • Mechanistic interpretability (MI). Definition: reverse-engineering the network’s internal circuits. In the transformer: it turns opaque weights into verifiable mechanisms —the opposite of treating it as a black box.
  • Circuit. Definition: a subset of components that, together, implement a specific algorithm. In the transformer: the unit MI seeks to isolate (e.g. the induction head).
  • Residual stream (as a channel). Definition: the shared “lane” that each component reads from and writes to (Ch. 3). In the transformer: the medium through which heads and MLPs communicate across layers.
  • QK vs OV circuit. Definition: QK decides where to attend; OV, what to write. In the transformer: it decomposes each head into “who it looks at” and “what it contributes”.
  • Superposition. Definition: storing more features than dimensions in nearly orthogonal directions. In the transformer: it causes polysemanticity (one neuron fires for disparate things) → the big obstacle.
  • Sparse autoencoder (SAE). Definition: an overcomplete, sparse dictionary that decomposes activations into monosemantic features. In the transformer: it tries to undo superposition and yield interpretable features.
  • Logit lens. Definition: projecting the intermediate activation through the output matrix to read the model’s “bet” at each layer. In the transformer: it shows how the prediction forms with depth.
  • Activation patching (causal tracing). Definition: transplanting an activation between a “clean” run and a “corrupted” one and measuring the effect. In the transformer: it isolates which component is causally responsible for a behavior.

With this in hand, let’s open the box.

38.3 What MI is (and isn’t)

Mechanistic interpretability is reverse engineering: from the weights of a trained network, recovering the algorithm it runs —like decompiling a binary to recover the readable source code. Its hallmark relative to other ways of “explaining”:

  • Vs. black-box / behavioral: that only looks at inputs→outputs; MI opens the model.
  • Vs. attribution/saliency (attention maps, gradients): these are correlational and can mislead —it’s our “attention ≠ explanation” of Chs. 4 and 13. MI demands causal intervention, not correlation.

🧩 Analogy — decompiling a program. The weights are the compiled binary (unreadable); the circuit MI recovers is the readable source code. Doing MI is sitting down to decompile that binary until you understand which algorithm it implements.

38.4 The foundational framework: circuits in the residual stream

The canonical text (Elhage et al. 2021) gave us the vocabulary. Its ideas, which connect with Chs. 3 and 5:

  • The residual stream is a communication channel (Ch. 3): each head and each MLP reads from subspaces of the stream and writes to others. It’s not just “memory”; it’s the bus over which components pass information to one another.
  • QK vs OV circuit: an attention head decomposes into two nearly independent computations —the QK circuit decides where it looks (the attention pattern) and the OV circuit decides what it writes into the stream when it looks there.
  • Composition: heads chain across layers (one writes something that another, higher up, reads). The canonical example is the induction head (Ch. 5): it needs two layers —a previous-token head feeds the induction head— and it’s the mechanical substrate of in-context learning (Olsson et al. 2022).

🧩 Analogy — the shared whiteboard. The residual stream is a whiteboard (or a bus) that all components share: each head and each MLP reads notes others left and writes its own. The circuit is the chain of who writes what so that, in the end, the computation comes out.

38.5 Superposition: why neurons aren’t interpretable

If we open a neuron expecting “the dog neuron”, we’re in for a letdown: it fires for disparate things. The reason is superposition (Elhage et al. 2022): the network represents more features than dimensions by placing them in nearly orthogonal directions that overlap a little. That produces polysemanticity (one neuron = many concepts), the big obstacle to interpreting at the neuron level.

🧩 Analogy — the 50-slot drawer. Superposition is putting 100 labeled objects in a 50-slot drawer, tilting them so they overlap but can still be told apart. You gain capacity, but there’s no longer “one object per slot” → opening a slot (a neuron) gives you a mix.

The most promising solution is sparse autoencoders (SAEs) (Cunningham et al. 2023; Bricken et al. 2023): you train an overcomplete, sparse dictionary that decomposes the activation into many monosemantic features —undoing superposition. Scaled to a frontier model (Templeton et al. 2024), this extracted millions of features in Claude 3 Sonnet, including the famous “Golden Gate Bridge” feature: when forced, the model behaves as if it were the bridge (causal control of a feature).

Warning⚠ Contested — do SAEs find the “true” features?

SAEs are promising but debated. Two problems: feature splitting (a single concept is split into many latents to cut the sparsity cost) and absorption (a “monosemantic” latent stops firing when a more specific one “steals” the case) (Chanin et al. 2024). Open: the reconstruction error and whether SAEs recover the model’s true features or just a convenient decomposition.

38.6 The methods: seeing and, above all, intervening

  • Logit lens (nostalgebraist 2020): projects the residual-stream activation at an intermediate layer through the output matrix (unembedding) to read the model’s “current bet” at that floor. It reveals that the prediction forms gradually with depth. The tuned lens (Belrose et al. 2023) refines it with a per-layer linear probe (more reliable than the raw version).
  • Activation patching / causal tracing: the central causal method. You run a clean pass and a corrupted one, transplant an activation from one to the other and measure how much the output changes → you locate which component is causally responsible. With this technique, ROME (Meng, Bau, et al. 2022) localized factual knowledge in the MLPs of intermediate layers (and then edited it; MEMIT (Meng, Sen Sharma, et al. 2022) scaled it to thousands of facts). The canonical full-circuit example is indirect object identification (IOI) in GPT-2 (Wang et al. 2023): 26 heads in 7 classes, evaluated by faithfulness/completeness/minimality.
  • Probing: training a linear classifier on the activations to see whether certain information is present. Its important limit: correlation ≠ use —a probe shows that the information is decodable, not that the model uses it. That’s why MI prefers the causal methods.

🧩 Analogy — the transplant. Activation patching is like transplanting an organ between two patients to see which one carries the symptom: if moving that component changes the result, that was the one responsible. It’s a causal test, not a hunch.

38.7 What MI can and can’t claim (honest)

Here, true to the book, the caveats —which set up Ch. 38:

  • It’s mostly about small models / narrow behaviors. Scaling clean circuits to frontier models is not solved.
  • Cherry-picking and incompleteness: published circuits are usually partial; and the streetlight effect lurks —we study what current tools illuminate, not necessarily what matters.
  • Activation patching itself has subtleties: different metrics (logit difference vs probability vs KL) and ways of “corrupting” can give different conclusions (Zhang and Nanda 2023); there’s no single agreed protocol.
  • SAEs may not recover the true features (above).

In one honest sentence: MI offers mechanistic hypotheses with causal evidence about specific behaviors, not a complete explanation of the model.

38.8 Bridge to our work (honest)

Our γ / attention-over-distance and the grokking CKA re-rise (Ch. 24) are neighbors of interpretability: they measure internal structure (per-head γ, inter-layer CKA). But the honest asymmetry is worth stating: they are descriptive/correlational measures of the aggregate structure, not mechanical circuits. MI decompiles a specific algorithm with causal intervention; we measure global geometry/statistics without isolating a circuit. They are complementary lenses —structure-level vs algorithm-level—, not the same thing.

Note🧪 Try it — tafagent

tafagent operates at the structure level (per-head γ, regime, sinks), not the circuit level (it doesn’t do activation patching or SAEs). It’s a diagnostic of the geometry of a model’s attention —useful as a first screen before fine-grained MI, but not a substitute for it: it tells you how attention is distributed, not what algorithm generates it.

38.9 Summary

  • MI = decompiling the network into circuits, with causal intervention (not correlation) —as opposed to black box and to saliency (“attention ≠ explanation”).
  • Framework (Elhage et al. 2021): residual stream as channel (Ch. 3); QK (where to look) vs OV (what to write); composition → the induction head as canonical circuit (Ch. 5).
  • Superposition (Elhage et al. 2022): more features than dimensions → polysemanticity; SAEs (Cunningham et al. 2023; Templeton et al. 2024) try to undo it (“Golden Gate” feature). Contested (splitting/absorption, (Chanin et al. 2024)).
  • Methods: logit lens (the per-layer bet) / tuned lens; activation patching (causal: ROME, IOI (Wang et al. 2023)); probing (present ≠ used).
  • Honest: small models, partial circuits, patching metrics without consensus (Zhang and Nanda 2023); causal hypotheses about specific behaviors, not the whole model.
  • Bridge: our γ/CKA = aggregate structure (descriptive), not circuits —a lens complementary to MI.

Next (Chapter 38): with MI as the yardstick for “what’s real”, comes the chapter most our own: Verified vs Folklore vs Numerology —the field’s myths and our formula audit, with Lean receipts and data.

38.10 Exercises

  1. Causal vs correlational. Why does MI insist on intervention and distrust attention maps as “explanation”?
  2. QK vs OV. What does a head’s QK circuit decide and what does the OV decide? Why separate “where it looks” from “what it writes”?
  3. Superposition. Use the 50-slot drawer to explain why a neuron fires for disparate concepts. What do SAEs try to do?
  4. Patching. Describe activation patching and why it’s causal. How does it differ from a linear probe (probing)?
  5. Honesty. Cite two reasons why a published “circuit” might not tell the model’s whole story.
  6. Structure vs circuit. Why is our per-head γ not the same as a mechanical circuit?

References

Belrose, Nora, Igor Ostrovsky, Lev McKinney, et al. 2023. Eliciting Latent Predictions from Transformers with the Tuned Lens. https://arxiv.org/abs/2303.08112.
Bricken, Trenton, Adly Templeton, et al. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” In Transformer Circuits Thread (Anthropic). https://transformer-circuits.pub/2023/monosemantic-features/index.html.
Chanin, David et al. 2024. A Is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders. https://arxiv.org/abs/2409.14507.
Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse Autoencoders Find Highly Interpretable Features in Language Models. https://arxiv.org/abs/2309.08600.
Elhage, Nelson et al. 2021. “A Mathematical Framework for Transformer Circuits.” Transformer Circuits Thread (Anthropic). https://transformer-circuits.pub/2021/framework/index.html.
Elhage, Nelson et al. 2022. “Toy Models of Superposition.” Transformer Circuits Thread (Anthropic). https://arxiv.org/abs/2209.10652.
Meng, Kevin, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. “Locating and Editing Factual Associations in GPT.” NeurIPS. https://arxiv.org/abs/2202.05262.
Meng, Kevin, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2022. Mass-Editing Memory in a Transformer. https://arxiv.org/abs/2210.07229.
nostalgebraist. 2020. Interpreting GPT: The Logit Lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
Olsson, Catherine et al. 2022. “In-Context Learning and Induction Heads.” Transformer Circuits Thread (Anthropic). https://arxiv.org/abs/2209.11895.
Templeton, Adly et al. 2024. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” In Transformer Circuits Thread (Anthropic). https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
Wang, Kevin, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.” ICLR. https://arxiv.org/abs/2211.00593.
Zhang, Fred, and Neel Nanda. 2023. Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. https://arxiv.org/abs/2309.16042.