42 R1 · Master formulary

What it is. The lookup table for every important formula in the book, each one with what it does, where it is explained in depth, and —what sets us apart— its receipt: whether it is proved in Lean, backed by data, derived, a labelled analogy, or was corrected (erratum). No narrative here: this is for looking up and checking.

42.1 Receipt legend

Mark	Meaning
📐 Lean	Identity formally proved in Lean (algebraic consistency, not causality over the model)
📊 Data	Backed by reproducible data (atlas, experiments)
✓ Derived	Derived from first principles or a standard definition
~ Analogy	A useful analogy/lens, labelled as such (not a theorem)
✗→✓ Erratum	Had an error, now corrected (and, where applicable, re-proved)
🟡 Provisional	Claim not fully validated / awaiting replication

(The philosophy behind this column is Ch. 38: honesty is demonstrated with receipts.)

42.2 Part I — Foundations

Formula	What it does	Ch.	Receipt
\(\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\big(\tfrac{QK^\top}{\sqrt{d_k}}\big)V\)	Weighted mix: each token gathers info from the rest by affinity	4	✓ Derived
\(1/\sqrt{d_k}\) (scaling)	Keeps dot products from growing with \(d_k\) and saturating the softmax	4	✓ Derived
\(\mathrm{MHA}=\mathrm{Concat}(\text{heads})\,W^O\)	Several heads in parallel + output projection	5	✓ Derived
\(\mathrm{FFN}(x)=\sigma(xW_1+b_1)W_2+b_2\)	Processes each token on its own (key-value memory)	6	✓ Derived
SwiGLU: \((\mathrm{Swish}(xW)\odot xV)W_2\)	Gated variant of the FFN (the modern standard)	6	📊 Data
\(\mathrm{LayerNorm}(x)=\gamma\tfrac{x-\mu}{\sigma}+\beta\)	Rescales the activation → stability	7	✓ Derived
\(\mathrm{RMSNorm}(x)=\gamma\,\tfrac{x}{\sqrt{\overline{x^2}}}\)	Like LayerNorm without centering (cheaper)	7	✓ Derived
\(\omega_i=\theta^{-2i/d}\) (RoPE frequencies)	Each pair of dimensions rotates at its own speed	8	✓ Derived

42.3 Part I — Training and inference

Formula	What it does	Ch.	Receipt
\(\mathcal{L}=-\log p_\text{model}(\text{correct token})\)	Cross-entropy: the training objective	11	✓ Derived
\(\text{perplexity}=e^{\mathcal{L}}\)	“How many tokens it hesitates among” (readable)	11	✓ Derived
\(C\approx 6ND\)	Compute cost ≈ 6 × parameters × tokens	25	✓ Derived
\(L(N,D)=E+\tfrac{A}{N^\alpha}+\tfrac{B}{D^\beta}\)	Scaling law: loss vs size and data	25	📊 Data
\(\mathrm{softmax}(z/\tau)\) (temperature)	\(\tau\) controls how “adventurous” the sampling is	12, 29	✓ Derived

42.4 Parts II–III — Our theory: attention over distance [EDGE]

Formula	What it does	Ch.	Receipt
\(A(d)\propto d^{-\gamma}\)	The decay law: mean attention vs distance	15	📊 Data (R²>0.95, 30+ models)
\(\gamma_{\text{Padé}}=\dfrac{2\theta-T\sqrt2}{2\theta+T\sqrt2}\)	Predicts γ from geometry (θ, T), without training	15	📐 Lean (Cayley) · 📊 (median ~22% error, Phase A)
\(\lambda_i=2\pi\,\theta^{2i/d}\)	Wavelength of each RoPE pair (aliasing)	14	✓ Derived
\(n_{\text{active}}(d)=\tfrac{d_\text{head}}{2}\big(1-\log_\theta\tfrac{d}{2\pi}\big)\)	How many pairs keep positional signal at distance \(d\)	14	✓ Derived
\(T_{\text{cross}}=2\pi\sqrt\theta,\ \ T_{\text{max}}=2\pi\theta\)	The two scales: starts to degrade / runs out	14	✓ Derived
\(\gamma_{\text{obs}}=\gamma_{\text{geom}}+\gamma_{\text{train}}+\gamma_{\text{arch}}+\varepsilon\)	Decomposes measured γ into its sources	15	📊 Data · 🟡 (imprint axis provisional)
\(D_f\sim\varepsilon^{-1/(\gamma-1)}\)	KV window needed, derived from γ (γ>1)	20	🟡 Provisional (no benchmark yet)

42.5 Part III — Physical lens [EDGE]

Formula	What it does	Ch.	Receipt
\(Z=\mathrm{Li}_\gamma(e^{-\lambda})\)	Partition function (polylogarithm) of attention	21	✓ Derived
\(\text{Fisher}=C_V/\gamma^2\)	Statistical sensitivity = thermal fluctuation	22	📐 Lean
\(C_V(\gamma{=}1,N)=\dfrac{(\log N)^2}{12}\)	Heat capacity at the boundary	22	✗→✓ Erratum (was /4) · 📐 Lean
\(S_\gamma\sim\log N\) (area law)	Attention entropy grows logarithmically	22	📊 Data (56 models)
\(s=(\gamma-1)/2,\ \ \alpha=\gamma-1\)	Fractional order / Lévy index from γ	23	~ Analogy (labelled mapping)
\(\chi=1/\|\gamma-1\|\)	Susceptibility near the γ=1 boundary	21	✓ Derived
γ=1 ↔︎ Hagedorn temperature	The phase boundary (Phase A / Phase B)	21, 39	~ Analogy (our synthesis)
Imprint: slope \(\nu\approx-1/(2\pi)\)	“Footprint” of the training data on γ	15, 38	✗ Refuted on data (Lean holds; does not reproduce)

42.6 Parts IV–V — Adaptation and use

Formula	What it does	Ch.	Receipt
\(\mathcal{L}_R=-\log\sigma\big(r(x,y_w)-r(x,y_l)\big)\)	Reward model (Bradley-Terry) from preferences	27	✓ Derived
\(\max\ \mathbb{E}[r]-\beta\,\mathrm{KL}(\pi_\theta\\|\pi_\text{ref})\)	RLHF objective: maximize reward on a KL leash	27	✓ Derived
\(\mathcal{L}_{\text{DPO}}=-\log\sigma\big(\beta\log\tfrac{\pi_\theta(y_w)}{\pi_\text{ref}(y_w)}-\beta\log\tfrac{\pi_\theta(y_l)}{\pi_\text{ref}(y_l)}\big)\)	Optimizes preferences without RL or a reward model	27	✓ Derived
NT-Xent: \(-\log\tfrac{e^{\mathrm{sim}(h,h^+)/\tau}}{\sum_j e^{\mathrm{sim}(h,h_j)/\tau}}\)	Contrastive loss (pull positives close, push negatives away)	26, 33	✓ Derived
CFG: \(\text{logits}_\text{without}+\gamma(\text{logits}_\text{with}-\text{logits}_\text{without})\)	Classifier-free guidance: amplifies the prompt’s effect	29	✓ Derived

42.7 Part VI — Efficiency

Formula	What it does	Ch.	Receipt
\(h=W_0x+\tfrac{\alpha}{r}BAx\) (LoRA)	Low-rank delta; \(B{=}0\) at start; fusable	28	📊 Data
\(x\approx S\,(x_q-Z)\) (quantization)	Real↔︎integer map: scale \(S\) + zero-point \(Z\)	35	✓ Derived
\(\big(\varphi(Q)\varphi(K)^\top\big)V=\varphi(Q)\big(\varphi(K)^\top V\big)\)	Reassociation → linear attention O(n)	34	✓ Derived
Attention cost \(=O(n^2 d)\) compute, \(O(n^2)\) memory	Why attention is expensive at long context	34	✓ Derived
Latency \(\approx \text{TTFT}+\text{TPOT}\times n_\text{tokens}\)	Decomposes serving latency	36	✓ Derived

42.8 How to read this table

To understand a formula term by term, go to the chapter listed: there each letter is defined, along with its role in the transformer and its analogy.
The Receipt column tells you how much to trust it: a 📐 Lean entry is proved algebraically; 📊 Data, checked empirically; ~ Analogy, a useful lens but not a theorem; 🟡 and ✗ flag what is unvalidated or refuted —including our own cases—.
The Lean identities link to the formal proofs (📐); the 📊 results link to the open data of Paper I (Zenodo). Nothing here asks you to take our word for it: it gives you where to check it.

Next reference (R2): the cookbook —step-by-step recipes for measuring γ, sizing the KV cache, and applying these formulas to your own model—.