39 Verified, folklore and numerology

Where we are. The chapter most our own —and the book’s ethical heart. With interpretability (Ch. 37) as the yardstick for “what’s real”, here we classify the claims of the field (and our own) into three buckets: verified, folklore and numerology. We don’t preach honesty: we demonstrate it —with receipts (Lean proofs, reproducible data) and by recounting our own errors and refutations.

39.1 The idea in one sentence

A formula can be derived and verified, be folklore (popular but unjustified) or numerology (a number that fits by chance, with no mechanism); this chapter learns to tell them apart —and applies the same yardstick, mercilessly, to our own work.

39.2 Key concepts and their role in the transformer

Before we dig in, let’s define this chapter’s terms and what each one is for:

Verified / derived. Definition: a claim with a receipt —a formal proof (Lean) or reproducible data. In the transformer: what you can use with confidence.
Folklore. Definition: a popular but unjustified belief (or one already contradicted). In the transformer: “everyone says so” isn’t evidence; it usually comes from misreading a tool (e.g. attention maps).
Numerology. Definition: a number or formula that fits the data with no mechanism to explain it. In the transformer: fitting a curve ≠ understanding it; the risk of mistaking coincidence for law.
Erratum. Definition: an error of our own, flagged, corrected and re-proved. In the transformer: recounting it raises the book’s reliability, doesn’t lower it.
Receipt. Definition: the proof backing a claim —a proof in Lean (correct algebra) or a reproducible datum. In the transformer: what separates “we believe it” from “we’ve checked it”.
Instructive refutation. Definition: a hypothesis of ours that the data knocked down, which we teach as a refutation. In the transformer: the scientific method turned into pedagogy.

With this, first others’ myths; then, our own.

39.3 Why a chapter like this

The field is full of phrases that get repeated until they become “truth” without anyone asking for the receipt. And a manual that only recounts its successes isn’t trustworthy. Our stance (Ch. 0): honesty is demonstrated, not proclaimed. That means two uncomfortable commitments: (1) labeling folklore as folklore even when it’s popular, and (2) applying the same yardstick to our formulas, including the ones that failed.

🧩 Analogy — the auditor. It’s like auditing accounts: it’s not enough for them to “roughly add up”. Every line item needs its invoice (the receipt: a proof or a datum). With no invoice, the item is suspect —however round the figure may be.

39.4 Three myths of the field

1. “Softmax gives a probability.” (Folklore.) The softmax produces numbers that sum to 1, but that doesn’t make them calibrated probabilities of anything. In fact, there’s work that attributes the softmax’s success to an implicit regularization (of the Frobenius norm), not to its probabilistic reading (On the (Implicit) Regularization Behind Softmax Attention 2024). Treating attention weights as “probability of relevance” is reading too much into them.

2. “Attention explains the model’s decision.” (Folklore, and dangerous.) It’s the myth we already fought (Chs. 4, 13): a bright attention map over a token doesn’t prove that that token caused the output —information can flow through the residual stream, the values and the MLPs without passing through where attention “looks”. Attention is correlation, not causal explanation (Jain and Wallace 2019; Attention Weights Are Not Explanations 2023). (That’s why Ch. 37 insisted on causal intervention.)

3. “RoPE forces attention to decay with distance.” (Disputed.) It’s tempting —and we ourselves had to be careful not to fall for it (Ch. 14). Round and Round We Go (Barbero et al. 2024) showed that, with real queries/keys, there’s no guaranteed decay: models use the low frequencies to match content almost regardless of position. What we do honestly hold: the geometry bounds the achievable resolution; γ is a measured descriptor, not a decree.

39.5 Our own audit: the same yardstick for ourselves

Here’s the part that sets us apart. We apply the taxonomy to our own claims —with their real status, no makeup:

Table 39.1: Audit of our own formulas and results

Our claim	Status	Receipt
Fisher = C_V / γ²	✅ Verified (Lean)	Formal proof in Lean (📐)
γ_Padé = Cayley transform of (θ, T)	✅ Derived (Lean)	Algebraic identity proved
A(d) ∝ d^−γ (the power-law form)	✅ Derived + data	R²>0.95 across 46 measurements, 30+ models
γ ⊥ sink-mass	✅ Data (within-model)	θ-rescale 256×, sink flat ~0.38
C_V(γ=1) = (logN)²/12	✗→✅ Erratum	Was /4; corrected and re-proved in Lean
KV window D_f	🟡 Partial	Missing benchmark vs heuristics (RULER/LongBench)
γ predicts long-context headroom	🟡 Not validated	avenue-2 crashed (CUDA-OOM); no data
Imprint ν ≈ −1/(2π)	❌ Refuted on data	Proved in Lean, does not reproduce; wide CI
Forcing CKA→1 prevents grokking	❌ Refuted	2/3 forced models grok anyway (20k steps)

Three honest readings of this table:

The C_V erratum. In Paper I we claimed the heat capacity at γ=1 was (logN)²/4. It was a factor-of-3 error: the correct value is (logN)²/12, which two independent derivations arrive at and which we re-proved in Lean (Ch. 22). We recount it because it’s ours: a book that teaches where it went wrong is more reliable, not less.
The imprint ν, refuted by the data. We had an identity for the “sixth axis” (how much the data imprints its mark on γ), with slope ν≈−1/(2π). It’s proved in Lean… but does not reproduce on the data (wide confidence interval, doesn’t converge in Pythia-70M). We mark it as provisional/refuted on data: the algebra can be correct and the empirical claim false at the same time.
Forcing CKA does not cause grokking. We thought the inter-layer CKA re-rise (Ch. 24) might be a causal lever. The experiment knocked it down: 2 of 3 models with forced CKA grok just the same. We teach it as a worked refutation —a predictive signal, not a causal one.

⚠ The distinction that matters most: Lean proves algebra, not reality

A Lean receipt proves that the formulas add up (algebraic consistency), not that they describe a transformer causally. That’s why the imprint ν can be “proved in Lean” and fail on the data: they’re two different things. One thing is “the algebra is correct”; another, “this is what the model does”. Each claim needs its own kind of receipt: identities → Lean; empirical claims → reproducible data; lenses (thermodynamic, fractional) → they’re labeled as analogy, not as theorem.

🧩 Analogy — the blueprint and the building. A proof in Lean is like verifying that the blueprint is well drawn (the measurements are consistent). But a perfect blueprint doesn’t guarantee that the constructed building resembles it: for that you have to go measure it (the data). Mistaking “correct blueprint” for “building is like this” is the error the imprint ν taught us not to make.

39.6 How to use the book’s labeling system

None of this is rhetoric: it’s the badge system that runs through the manual. When you see a 📐 Verified in Lean box, there’s a formal proof behind it; a 📄 Our paper with a Zenodo link, there’s reproducible data; a ⚠ Honest, we’re warning you of a limit or a refutation. The rule for the reader (and for us): ask for the receipt. If a claim —ours or others’— comes with no proof, no datum and no mechanism, treat it as folklore until one shows up.

🧪 Try it — tafagent

tafagent includes a falsification panel (“anti-bullshit”): it doesn’t just compute γ, it puts the claims to the test on your model —it compares predicted γ_Padé vs measured γ, flags when the fit is poor (low R²), and points out when a result doesn’t hold up. It’s the same philosophy as this chapter, turned into a tool: it doesn’t ask you to believe us; it lets you check it.

39.7 Summary

Three buckets: verified/derived (with receipt), folklore (popular without justification) and numerology (fits with no mechanism) —plus the erratum (our own error, corrected).
Field myths: “softmax = probability” (folklore (On the (Implicit) Regularization Behind Softmax Attention 2024)), “attention = explanation” (dangerous folklore (Jain and Wallace 2019)), “RoPE forces decay” (disputed (Barbero et al. 2024)).
Our audit (the same yardstick): ✅ Fisher=C_V and γ_Padé (Lean), A(d)∝d^−γ and γ⊥sink (data); ✗→✅ erratum C_V /4→/12; 🟡 D_f and headroom (not validated); ❌ imprint ν (Lean yes, data no) and forcing CKA (refuted).
The key distinction: Lean proves algebra, not reality; each claim needs its receipt (proof / datum / “it’s an analogy”).
The reader’s rule: ask for the receipt; with no proof, datum or mechanism → folklore.

Next (Chapter 39): with the “what’s real” yardstick sharpened, we draw the map of the 2026 collapse landscape —the four frameworks (sinks, temperature, covariance-grokking, fractional), how they relate and what each leaves open.

39.8 Exercises

The three buckets. Define verified, folklore and numerology, and give an example of each (it can be the field’s or ours).
The receipt. What kind of receipt does a mathematical identity need versus an empirical claim? Why doesn’t the same one work for both?
Lean ≠ reality. Use the imprint ν case to explain how something can be “proved in Lean” and false on the data.
Erratum. Why does recounting the C_V error (factor of 3) make the book more reliable?
Myth. Take “attention explains the model’s decision” and argue, with Ch. 37’s material, why it’s folklore.
Apply it. You come across a new formula in a paper with a number that fits very well. What three questions do you ask before believing it?

References

Attention Weights Are Not Explanations: Residual, MLP and Value Bypasses. 2023. https://arxiv.org/abs/2312.01429.

Barbero, Federico, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veličković. 2024. Round and Round We Go! What Makes Rotary Positional Encodings Useful? https://arxiv.org/abs/2410.06205.

Jain, Sarthak, and Byron C. Wallace. 2019. “Attention Is Not Explanation.” arXiv Preprint arXiv:1902.10186.

On the (Implicit) Regularization Behind Softmax Attention. 2024. https://arxiv.org/abs/2410.18613.