What it is. The lookup table for every important formula in the book, each one with what it does, where it is explained in depth, and —what sets us apart— its receipt: whether it is proved in Lean, backed by data, derived, a labelled analogy, or was corrected (erratum). No narrative here: this is for looking up and checking.
Receipt legend
| 📐 Lean |
Identity formally proved in Lean (algebraic consistency, not causality over the model) |
| 📊 Data |
Backed by reproducible data (atlas, experiments) |
| ✓ Derived |
Derived from first principles or a standard definition |
| ~ Analogy |
A useful analogy/lens, labelled as such (not a theorem) |
| ✗→✓ Erratum |
Had an error, now corrected (and, where applicable, re-proved) |
| 🟡 Provisional |
Claim not fully validated / awaiting replication |
(The philosophy behind this column is Ch. 38: honesty is demonstrated with receipts.)
Part I — Foundations
| \(\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\big(\tfrac{QK^\top}{\sqrt{d_k}}\big)V\) |
Weighted mix: each token gathers info from the rest by affinity |
4 |
✓ Derived |
| \(1/\sqrt{d_k}\) (scaling) |
Keeps dot products from growing with \(d_k\) and saturating the softmax |
4 |
✓ Derived |
| \(\mathrm{MHA}=\mathrm{Concat}(\text{heads})\,W^O\) |
Several heads in parallel + output projection |
5 |
✓ Derived |
| \(\mathrm{FFN}(x)=\sigma(xW_1+b_1)W_2+b_2\) |
Processes each token on its own (key-value memory) |
6 |
✓ Derived |
| SwiGLU: \((\mathrm{Swish}(xW)\odot xV)W_2\) |
Gated variant of the FFN (the modern standard) |
6 |
📊 Data |
| \(\mathrm{LayerNorm}(x)=\gamma\tfrac{x-\mu}{\sigma}+\beta\) |
Rescales the activation → stability |
7 |
✓ Derived |
| \(\mathrm{RMSNorm}(x)=\gamma\,\tfrac{x}{\sqrt{\overline{x^2}}}\) |
Like LayerNorm without centering (cheaper) |
7 |
✓ Derived |
| \(\omega_i=\theta^{-2i/d}\) (RoPE frequencies) |
Each pair of dimensions rotates at its own speed |
8 |
✓ Derived |
Part I — Training and inference
| \(\mathcal{L}=-\log p_\text{model}(\text{correct token})\) |
Cross-entropy: the training objective |
11 |
✓ Derived |
| \(\text{perplexity}=e^{\mathcal{L}}\) |
“How many tokens it hesitates among” (readable) |
11 |
✓ Derived |
| \(C\approx 6ND\) |
Compute cost ≈ 6 × parameters × tokens |
25 |
✓ Derived |
| \(L(N,D)=E+\tfrac{A}{N^\alpha}+\tfrac{B}{D^\beta}\) |
Scaling law: loss vs size and data |
25 |
📊 Data |
| \(\mathrm{softmax}(z/\tau)\) (temperature) |
\(\tau\) controls how “adventurous” the sampling is |
12, 29 |
✓ Derived |
Parts II–III — Our theory: attention over distance [EDGE]
| \(A(d)\propto d^{-\gamma}\) |
The decay law: mean attention vs distance |
15 |
📊 Data (R²>0.95, 30+ models) |
| \(\gamma_{\text{Padé}}=\dfrac{2\theta-T\sqrt2}{2\theta+T\sqrt2}\) |
Predicts γ from geometry (θ, T), without training |
15 |
📐 Lean (Cayley) · 📊 (median ~22% error, Phase A) |
| \(\lambda_i=2\pi\,\theta^{2i/d}\) |
Wavelength of each RoPE pair (aliasing) |
14 |
✓ Derived |
| \(n_{\text{active}}(d)=\tfrac{d_\text{head}}{2}\big(1-\log_\theta\tfrac{d}{2\pi}\big)\) |
How many pairs keep positional signal at distance \(d\) |
14 |
✓ Derived |
| \(T_{\text{cross}}=2\pi\sqrt\theta,\ \ T_{\text{max}}=2\pi\theta\) |
The two scales: starts to degrade / runs out |
14 |
✓ Derived |
| \(\gamma_{\text{obs}}=\gamma_{\text{geom}}+\gamma_{\text{train}}+\gamma_{\text{arch}}+\varepsilon\) |
Decomposes measured γ into its sources |
15 |
📊 Data · 🟡 (imprint axis provisional) |
| \(D_f\sim\varepsilon^{-1/(\gamma-1)}\) |
KV window needed, derived from γ (γ>1) |
20 |
🟡 Provisional (no benchmark yet) |
Part III — Physical lens [EDGE]
| \(Z=\mathrm{Li}_\gamma(e^{-\lambda})\) |
Partition function (polylogarithm) of attention |
21 |
✓ Derived |
| \(\text{Fisher}=C_V/\gamma^2\) |
Statistical sensitivity = thermal fluctuation |
22 |
📐 Lean |
| \(C_V(\gamma{=}1,N)=\dfrac{(\log N)^2}{12}\) |
Heat capacity at the boundary |
22 |
✗→✓ Erratum (was /4) · 📐 Lean |
| \(S_\gamma\sim\log N\) (area law) |
Attention entropy grows logarithmically |
22 |
📊 Data (56 models) |
| \(s=(\gamma-1)/2,\ \ \alpha=\gamma-1\) |
Fractional order / Lévy index from γ |
23 |
~ Analogy (labelled mapping) |
| \(\chi=1/|\gamma-1|\) |
Susceptibility near the γ=1 boundary |
21 |
✓ Derived |
| γ=1 ↔︎ Hagedorn temperature |
The phase boundary (Phase A / Phase B) |
21, 39 |
~ Analogy (our synthesis) |
| Imprint: slope \(\nu\approx-1/(2\pi)\) |
“Footprint” of the training data on γ |
15, 38 |
✗ Refuted on data (Lean holds; does not reproduce) |
Parts IV–V — Adaptation and use
| \(\mathcal{L}_R=-\log\sigma\big(r(x,y_w)-r(x,y_l)\big)\) |
Reward model (Bradley-Terry) from preferences |
27 |
✓ Derived |
| \(\max\ \mathbb{E}[r]-\beta\,\mathrm{KL}(\pi_\theta\|\pi_\text{ref})\) |
RLHF objective: maximize reward on a KL leash |
27 |
✓ Derived |
| \(\mathcal{L}_{\text{DPO}}=-\log\sigma\big(\beta\log\tfrac{\pi_\theta(y_w)}{\pi_\text{ref}(y_w)}-\beta\log\tfrac{\pi_\theta(y_l)}{\pi_\text{ref}(y_l)}\big)\) |
Optimizes preferences without RL or a reward model |
27 |
✓ Derived |
| NT-Xent: \(-\log\tfrac{e^{\mathrm{sim}(h,h^+)/\tau}}{\sum_j e^{\mathrm{sim}(h,h_j)/\tau}}\) |
Contrastive loss (pull positives close, push negatives away) |
26, 33 |
✓ Derived |
| CFG: \(\text{logits}_\text{without}+\gamma(\text{logits}_\text{with}-\text{logits}_\text{without})\) |
Classifier-free guidance: amplifies the prompt’s effect |
29 |
✓ Derived |
Part VI — Efficiency
| \(h=W_0x+\tfrac{\alpha}{r}BAx\) (LoRA) |
Low-rank delta; \(B{=}0\) at start; fusable |
28 |
📊 Data |
| \(x\approx S\,(x_q-Z)\) (quantization) |
Real↔︎integer map: scale \(S\) + zero-point \(Z\) |
35 |
✓ Derived |
| \(\big(\varphi(Q)\varphi(K)^\top\big)V=\varphi(Q)\big(\varphi(K)^\top V\big)\) |
Reassociation → linear attention O(n) |
34 |
✓ Derived |
| Attention cost \(=O(n^2 d)\) compute, \(O(n^2)\) memory |
Why attention is expensive at long context |
34 |
✓ Derived |
| Latency \(\approx \text{TTFT}+\text{TPOT}\times n_\text{tokens}\) |
Decomposes serving latency |
36 |
✓ Derived |
How to read this table
- To understand a formula term by term, go to the chapter listed: there each letter is defined, along with its role in the transformer and its analogy.
- The Receipt column tells you how much to trust it: a 📐 Lean entry is proved algebraically; 📊 Data, checked empirically; ~ Analogy, a useful lens but not a theorem; 🟡 and ✗ flag what is unvalidated or refuted —including our own cases—.
- The Lean identities link to the formal proofs (📐); the 📊 results link to the open data of Paper I (Zenodo). Nothing here asks you to take our word for it: it gives you where to check it.
Next reference (R2): the cookbook —step-by-step recipes for measuring γ, sizing the KV cache, and applying these formulas to your own model—.