23  The thermodynamic dictionary

Where we are. Still in the physical lens (Part III). If γ=1 was the boundary (Ch. 21), here’s the dictionary that translates attention quantities into thermodynamics —temperature, heat capacity, Fisher information—. And, unlike the previous chapter, here there’s a piece that is formally proven (in Lean): the identity Fisher = C_V. We also tell, with no makeup, an error of ours that we corrected and re-proved.

23.1 The idea in one sentence

The attention distribution has a thermodynamic “dictionary” (temperature ↔︎ 1/γ, fluctuations ↔︎ heat capacity), and one of its identities —Fisher = C_V— is proven in Lean; another —the coefficient at γ=1— we got wrong and corrected.

23.2 Key concepts and their role in the transformer

Before we get into the detail, let’s define the terms of this chapter and what each one is for inside a transformer:

  • Thermodynamic dictionary. Definition: the table of correspondences that translates quantities of the attention distribution into thermal quantities. In the transformer: it gives physical intuition about attention —but it’s a lens, not literal physics—.
  • Effective temperature T. Definition: the thermal analog of the inverse exponent, T=1/γ. In the transformer: it measures how “hot” (spread) or “cold” (concentrated) the attention distribution is.
  • Heat capacity C_V. Definition: how much energy it costs to raise the temperature; equal to the size of the energy fluctuations (Var(E)/T²). In the transformer: it measures the sensitivity of the attention distribution to a change of the lever γ.
  • Fisher information. Definition: how abruptly a distribution changes when you move a parameter —high Fisher = a parameter easy to “pin down” from the data—. In the transformer: it quantifies how much the data give away the value of γ.
  • The Fisher = C_V identity. Definition: the result that statistical sensitivity and thermal fluctuation are the same thing (up to a factor γ²). In the transformer: it’s the piece that is formally proven in Lean for our distance distribution.
  • Entropy S and the area law. Definition: the entropy of the attention distribution, which grows logarithmically with length (S_γ ~ log N). In the transformer: it describes how the attention’s uncertainty is spread as the context grows.
  • Verification in Lean. Definition: a proof in the Lean proof assistant (Mathlib) that the algebra of an identity is correct. In the transformer: it’s a consistency “receipt” —it proves the formulas add up, not that they causally describe the model—.
  • Erratum. Definition: a published error that gets flagged, corrected, and re-proved. In the transformer: here, the C_V coefficient at γ=1 that we got wrong by a factor of 3 and rectified —honesty is demonstrated, not proclaimed—.

The underlying idea: move from the analogy to the exact identities, separating what’s only metaphor from what’s proven.

23.3 The dictionary (labeled analogy)

Since the softmax is Boltzmann (Ch. 4, 21), every statistical quantity of the attention distribution has a thermal analog:

Table 23.1: The attention ↔︎ thermodynamics dictionary
Attention Thermodynamics
1/γ effective temperature T
−log A(d) energy of the state at distance d
Z = Li_γ(e^−λ) partition function
attention entropy entropy S
sensitivity of the distribution heat capacity C_V

We insist (Ch. 21): it’s a lens. What’s valuable is that some of these correspondences aren’t just metaphor but exact, provable identities. Let’s look at the two most important.

23.4 Fisher = C_V (this one IS proven, in Lean)

Two concepts, in plain terms:

  • Fisher information: measures how abruptly a distribution changes when you move a parameter. High Fisher = the parameter is easy to “pin down” from the data.
  • Heat capacity C_V: how much energy it costs to raise the temperature; equal to the size of the energy fluctuations (Var(E)/T²).

The identity Fisher = C_V (up to a factor γ²) says something both profound and intuitive: the distribution’s statistical sensitivity and its thermal fluctuation are the same thing. It’s a classic result of statistical mechanics (the Fisher metric is the second derivative of the free energy), and for our distance distribution we’ve proven it formally:

Tip📐 Verified in Lean

The identity Fisher = C_V/γ² (zero residual) is proven in the Lean proof assistant (Mathlib), not just checked numerically. (The factor 1/γ² is no mystery: it appears when changing from the temperature T=1/γ to the variable γ —it’s the chain rule—.) It’s a receipt: the algebra is correct beyond any doubt. (Honest caveat: Lean proves the algebraic consistency of the identity —that the formulas add up—, not that it causally* describes transformers. One thing is “the algebra is correct”; another is “this is what the model does”.)*

23.5 The area law: entropy grows like log N

Another correspondence with empirical support: the entropy of the attention distribution grows logarithmically with length, S_γ ~ log N (measured across 56 models). In physics, “area laws” relate entropy to the boundary of a region; here, the log N form is consistent and citable.

23.6 Our own erratum (and why we tell it)

Here’s the chapter’s honesty showcase. In Paper I (Marín 2026) (§5.2) we claimed the heat capacity at the point γ=1 was:

\[ C_V(\gamma=1, N) \;=\; \frac{(\log N)^2}{4} \quad \textbf{[INCORRECT]} \]

(where N is the context length; the result says C_V grows like (log N)².)

It was an error by a factor of 3. The correct value, reached by two independent derivations (and which we verified in Lean), is:

\[ C_V(\gamma=1, N) \;\to\; \frac{(\log N)^2}{12} \quad \textbf{[correct]} \]

Caution✗ → ✓ Why we publish our error

We don’t hide this: we flag it as an erratum, correct it, and prove it. It’s exactly the book’s stance —honesty is demonstrated (with a Lean receipt), not proclaimed—. A manual that only tells its successes isn’t trustworthy; one that shows where it went wrong and how it fixed it, is. That the error was in our own paper makes it more valuable to tell, not less.

23.7 Summary

  • Attention has a thermodynamic dictionary (T=1/γ, Z, C_V, S) —a lens, not literal physics—.
  • Fisher = C_V/γ² is proven in Lean (📐): statistical sensitivity = thermal fluctuation. (Lean proves the algebra, not causality over transformers.)
  • Area law: entropy grows like log N (56 models).
  • Our own erratum (honest): C_V(γ=1) was (logN)²/4 → it’s (logN)²/12 (factor of 3), corrected and verified in Lean. We tell it because honesty is demonstrated.

Next (Chapter 23): another lens on the same phenomenon —attention as fractional transport (Lévy diffusion), with integration order (γ−1)/2—.

23.8 Exercises

  1. Fisher and C_V. Explain in one sentence what each one measures and why it makes sense that they’re “the same thing.”
  2. The Lean receipt. What exactly does a Lean proof of Fisher=C_V demonstrate, and what does it not demonstrate?
  3. The erratum. The value went from (logN)²/4 to (logN)²/12. By what factor was the error off?
  4. Honesty. Why does telling an error of our own make the manual more trustworthy, not less?
Tip📄 Our paper — data and details

The Fisher = C_V identity, the erratum of the correction, and the links to the Lean proofs are open: Predicting How Transformers Attend (Zenodo).

References

Marín, Carles. 2026. Predicting How Transformers Attend: Analytic Power-Law Theory, Phase Transitions, and Practical Compression Tools. https://zenodo.org/records/20314038.