42  R1 · Master formulary

What it is. The lookup table for every important formula in the book, each one with what it does, where it is explained in depth, and —what sets us apart— its receipt: whether it is proved in Lean, backed by data, derived, a labelled analogy, or was corrected (erratum). No narrative here: this is for looking up and checking.

42.1 Receipt legend

Mark Meaning
📐 Lean Identity formally proved in Lean (algebraic consistency, not causality over the model)
📊 Data Backed by reproducible data (atlas, experiments)
✓ Derived Derived from first principles or a standard definition
~ Analogy A useful analogy/lens, labelled as such (not a theorem)
✗→✓ Erratum Had an error, now corrected (and, where applicable, re-proved)
🟡 Provisional Claim not fully validated / awaiting replication

(The philosophy behind this column is Ch. 38: honesty is demonstrated with receipts.)

42.2 Part I — Foundations

Formula What it does Ch. Receipt
\(\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\big(\tfrac{QK^\top}{\sqrt{d_k}}\big)V\) Weighted mix: each token gathers info from the rest by affinity 4 ✓ Derived
\(1/\sqrt{d_k}\) (scaling) Keeps dot products from growing with \(d_k\) and saturating the softmax 4 ✓ Derived
\(\mathrm{MHA}=\mathrm{Concat}(\text{heads})\,W^O\) Several heads in parallel + output projection 5 ✓ Derived
\(\mathrm{FFN}(x)=\sigma(xW_1+b_1)W_2+b_2\) Processes each token on its own (key-value memory) 6 ✓ Derived
SwiGLU: \((\mathrm{Swish}(xW)\odot xV)W_2\) Gated variant of the FFN (the modern standard) 6 📊 Data
\(\mathrm{LayerNorm}(x)=\gamma\tfrac{x-\mu}{\sigma}+\beta\) Rescales the activation → stability 7 ✓ Derived
\(\mathrm{RMSNorm}(x)=\gamma\,\tfrac{x}{\sqrt{\overline{x^2}}}\) Like LayerNorm without centering (cheaper) 7 ✓ Derived
\(\omega_i=\theta^{-2i/d}\) (RoPE frequencies) Each pair of dimensions rotates at its own speed 8 ✓ Derived

42.3 Part I — Training and inference

Formula What it does Ch. Receipt
\(\mathcal{L}=-\log p_\text{model}(\text{correct token})\) Cross-entropy: the training objective 11 ✓ Derived
\(\text{perplexity}=e^{\mathcal{L}}\) “How many tokens it hesitates among” (readable) 11 ✓ Derived
\(C\approx 6ND\) Compute cost ≈ 6 × parameters × tokens 25 ✓ Derived
\(L(N,D)=E+\tfrac{A}{N^\alpha}+\tfrac{B}{D^\beta}\) Scaling law: loss vs size and data 25 📊 Data
\(\mathrm{softmax}(z/\tau)\) (temperature) \(\tau\) controls how “adventurous” the sampling is 12, 29 ✓ Derived

42.4 Parts II–III — Our theory: attention over distance [EDGE]

Formula What it does Ch. Receipt
\(A(d)\propto d^{-\gamma}\) The decay law: mean attention vs distance 15 📊 Data (R²>0.95, 30+ models)
\(\gamma_{\text{Padé}}=\dfrac{2\theta-T\sqrt2}{2\theta+T\sqrt2}\) Predicts γ from geometry (θ, T), without training 15 📐 Lean (Cayley) · 📊 (median ~22% error, Phase A)
\(\lambda_i=2\pi\,\theta^{2i/d}\) Wavelength of each RoPE pair (aliasing) 14 ✓ Derived
\(n_{\text{active}}(d)=\tfrac{d_\text{head}}{2}\big(1-\log_\theta\tfrac{d}{2\pi}\big)\) How many pairs keep positional signal at distance \(d\) 14 ✓ Derived
\(T_{\text{cross}}=2\pi\sqrt\theta,\ \ T_{\text{max}}=2\pi\theta\) The two scales: starts to degrade / runs out 14 ✓ Derived
\(\gamma_{\text{obs}}=\gamma_{\text{geom}}+\gamma_{\text{train}}+\gamma_{\text{arch}}+\varepsilon\) Decomposes measured γ into its sources 15 📊 Data · 🟡 (imprint axis provisional)
\(D_f\sim\varepsilon^{-1/(\gamma-1)}\) KV window needed, derived from γ (γ>1) 20 🟡 Provisional (no benchmark yet)

42.5 Part III — Physical lens [EDGE]

Formula What it does Ch. Receipt
\(Z=\mathrm{Li}_\gamma(e^{-\lambda})\) Partition function (polylogarithm) of attention 21 ✓ Derived
\(\text{Fisher}=C_V/\gamma^2\) Statistical sensitivity = thermal fluctuation 22 📐 Lean
\(C_V(\gamma{=}1,N)=\dfrac{(\log N)^2}{12}\) Heat capacity at the boundary 22 ✗→✓ Erratum (was /4) · 📐 Lean
\(S_\gamma\sim\log N\) (area law) Attention entropy grows logarithmically 22 📊 Data (56 models)
\(s=(\gamma-1)/2,\ \ \alpha=\gamma-1\) Fractional order / Lévy index from γ 23 ~ Analogy (labelled mapping)
\(\chi=1/|\gamma-1|\) Susceptibility near the γ=1 boundary 21 ✓ Derived
γ=1 ↔︎ Hagedorn temperature The phase boundary (Phase A / Phase B) 21, 39 ~ Analogy (our synthesis)
Imprint: slope \(\nu\approx-1/(2\pi)\) “Footprint” of the training data on γ 15, 38 ✗ Refuted on data (Lean holds; does not reproduce)

42.6 Parts IV–V — Adaptation and use

Formula What it does Ch. Receipt
\(\mathcal{L}_R=-\log\sigma\big(r(x,y_w)-r(x,y_l)\big)\) Reward model (Bradley-Terry) from preferences 27 ✓ Derived
\(\max\ \mathbb{E}[r]-\beta\,\mathrm{KL}(\pi_\theta\|\pi_\text{ref})\) RLHF objective: maximize reward on a KL leash 27 ✓ Derived
\(\mathcal{L}_{\text{DPO}}=-\log\sigma\big(\beta\log\tfrac{\pi_\theta(y_w)}{\pi_\text{ref}(y_w)}-\beta\log\tfrac{\pi_\theta(y_l)}{\pi_\text{ref}(y_l)}\big)\) Optimizes preferences without RL or a reward model 27 ✓ Derived
NT-Xent: \(-\log\tfrac{e^{\mathrm{sim}(h,h^+)/\tau}}{\sum_j e^{\mathrm{sim}(h,h_j)/\tau}}\) Contrastive loss (pull positives close, push negatives away) 26, 33 ✓ Derived
CFG: \(\text{logits}_\text{without}+\gamma(\text{logits}_\text{with}-\text{logits}_\text{without})\) Classifier-free guidance: amplifies the prompt’s effect 29 ✓ Derived

42.7 Part VI — Efficiency

Formula What it does Ch. Receipt
\(h=W_0x+\tfrac{\alpha}{r}BAx\) (LoRA) Low-rank delta; \(B{=}0\) at start; fusable 28 📊 Data
\(x\approx S\,(x_q-Z)\) (quantization) Real↔︎integer map: scale \(S\) + zero-point \(Z\) 35 ✓ Derived
\(\big(\varphi(Q)\varphi(K)^\top\big)V=\varphi(Q)\big(\varphi(K)^\top V\big)\) Reassociation → linear attention O(n) 34 ✓ Derived
Attention cost \(=O(n^2 d)\) compute, \(O(n^2)\) memory Why attention is expensive at long context 34 ✓ Derived
Latency \(\approx \text{TTFT}+\text{TPOT}\times n_\text{tokens}\) Decomposes serving latency 36 ✓ Derived

42.8 How to read this table

  • To understand a formula term by term, go to the chapter listed: there each letter is defined, along with its role in the transformer and its analogy.
  • The Receipt column tells you how much to trust it: a 📐 Lean entry is proved algebraically; 📊 Data, checked empirically; ~ Analogy, a useful lens but not a theorem; 🟡 and flag what is unvalidated or refuted —including our own cases—.
  • The Lean identities link to the formal proofs (📐); the 📊 results link to the open data of Paper I (Zenodo). Nothing here asks you to take our word for it: it gives you where to check it.

Next reference (R2): the cookbook —step-by-step recipes for measuring γ, sizing the KV cache, and applying these formulas to your own model—.