How Transformers Attend
The complete field guide and reference — from tokens to long context, every formula verified, measured and built from scratch

Preface
This is a complete field guide to how transformers attend. It starts from scratch —text, tokens, vectors— and climbs, in a single continuous book, to the research frontier of attention across distance: how attention decays, why long context breaks, how to compress a KV-cache, and what the 2026 literature does —and does not— establish about attention collapse.
It is written to be read in two ways. On the surface, every idea is introduced in plain language and with an everyday analogy before any formula, so that a curious reader with no prior background can follow the entire story. Underneath, every claim is backed by a program that runs, a measurement you can reproduce, or a formal proof.
How claims are flagged
Throughout the book, coloured boxes mark the status of each claim. They are a key, not a slogan:
Derivable from first principles or measured first-hand —with the derivation or the data in plain sight.
A popular claim that is unjustified or contested in the literature —always accompanied by the citation that disputes it.
A numerical coincidence with no mechanism, or a published error that we correct.
Optional rigour —derivations, proofs. It can be skipped without losing the thread. Fold it away if you only want the intuition.
How to read it
- New to transformers? Read the prose and the analogies; skip the “Going deeper” boxes and the code. You will still understand every idea.
- Building with transformers? The second half (attention across distance, KV-cache, long context) and the formula table at the end are made for you.
- Want to tinker? Open the interactive companion, tafagent, and measure these quantities on the model of your choice as you read.