40 The map of the 2026 collapse landscape

Where we are. The penultimate chapter of Part VII. Throughout the book we’ve seen, separately, the sinks (Ch. 17), the thermodynamic lens (Ch. 22), the fractional one (Ch. 23) and grokking (Ch. 24). In 2026 those are four frameworks that try to explain how attention “collapses” or concentrates. Here we put them on a single map: what each explains, how they relate, and —with Ch. 38’s yardstick— where our γ genuinely connects them and where it’s our own speculation.

40.1 The idea in one sentence

There are four distinct lenses on attention concentration in 2026, each with its own “thermometer”; they are not four views of one proven phenomenon, and γ serves to situate a model relative to three of them —but calling it “the coordinate that unifies them” would be overclaiming.

40.2 Key concepts and their role in the transformer

Before we dig in, let’s define this chapter’s terms and what each one is for:

Framework / lens. Definition: a theory with its own order parameter for describing attention concentration. In the transformer: each one measures a different object (tokens, energy, geometry, kernel).
Sinks. Definition: the attention mass that accumulates on a few low-information tokens. In the transformer: object = concentration on specific tokens (BOS).
Temperature / thermodynamics. Definition: reading the softmax as Boltzmann, with free energy and an effective temperature. In the transformer: object = global sharpness of the softmax + training dynamics.
Covariance / grokking. Definition: a parameter of the representational geometry that changes before generalization. In the transformer: object = representational geometry.
Fractional / Lévy. Definition: attention as anomalous diffusion with a fractional order α. In the transformer: object = a designed diffusion kernel.
Order parameter. Definition: the quantity that “betrays” a phase transition. In the transformer: each framework proposes its own —and that’s where the disagreement is.
Measured descriptor vs designed operator. Definition: something measured in real models vs something built by hand. In the transformer: confusing the two is the error that runs through the whole field.

With this, let’s go through the four frameworks.

40.3 The four frameworks

1. Sinks / concentration. (The most solid empirically.) Attention mass collapses onto a few low-information tokens (typically the first). Several pieces fit together: the sink emerges in pretraining, tied to the softmax having to sum to 1 (Gu et al. 2025); its correlate is the massive activations —a few enormous coordinates of the residual stream— (Sun et al. 2024); and it has been proved that the sink and the “compression valley” are the same phenomenon (Queipo-de-Llano et al. 2025). Its engineering payoff: keeping the sink token gives stable streaming (StreamingLLM (Xiao et al. 2024)). (The “gravitational” reading of (Zhang 2026) is more analogy than mechanism.) Explains: why models need sinks; stability at long context. Leaves open: the causal origin (softmax-normalization vs massive-activations) and the secondary sinks (On the Discrepancy of Secondary Attention Sinks in Large-Theta Models 2025).

2. Temperature / thermodynamics. Reads the softmax as Boltzmann: Kim (Kim 2026) proves that attention minimizes the free energy (Boltzmann form) and defines an effective temperature, a partition function and fluctuation peaks that precede generalization. Explains: it gives the vocabulary of phase transition (Hagedorn-like). Honest (derivation vs analogy): that the softmax minimizes the free energy is a derivation; but that “the transformer is a thermodynamic system with heat capacity and phases” is mostly analogy/isomorphism —the word in the paper itself is isomorphism.

3. Covariance / grokking. A parameter of the representational geometry changes before the model generalizes: the spectral entropy collapse of the covariance (Khanh et al. 2026), the dimensional transition (Wang 2026) or the commutator defect (Xu 2026). Explains: predicting when grokking occurs. Leaves open: the mechanism, and which parameter is the fundamental one —spectral entropy vs effective dimension vs commutator defect vs Kim’s energy fluctuation are four competing early signals, with no winner. It’s the clearest proof that the field has not converged.

4. Fractional / Lévy. Models the interaction between tokens as Lévy diffusion, with a fractional order α that controls the multiscale reach (Qu et al. 2025). Explains: the multiscale/long-range in a single tunable operator. Leaves open (the honest key): FNA is a designed operator (α is built by hand), not a descriptor measured in existing models; mapping it onto observed attention requires assuming that the observed decay is that kernel.

40.4 How do they relate? (and the uncomfortable question about γ)

The honest part first: they are not four views of one single, proven phenomenon. Each framework has its own order parameter and measures a different object:

Table 40.1: Each framework measures something different

Framework	Its order parameter	Object it measures
Sinks	sink mass / activation norm	concentration on specific tokens
Temperature	effective temperature / free energy	global sharpness + dynamics
Covariance	spectral entropy / effective dim.	geometry of representations
Fractional	fractional order α	designed diffusion kernel

Real overlaps (without us): (a) sink = massive activation = compression valley is a proved identity (Queipo-de-Llano et al. 2025); (b) Kim’s energy fluctuation and the grokking parameters point to the same event (the onset of generalization). The field acknowledges these bridges.

40.4.1 Our γ: a connecting thread, but with surgical honesty

Is γ “the coordinate that unifies the four”? No —and saying so would be exactly the overclaim that Ch. 38 forbids us. Let’s evaluate it bridge by bridge:

γ ↔︎ fractional (order (γ−1)/2): the strongest bridge, but it’s descriptor↔︎design. If the measured γ matches the tail of the fractional kernel, the algebra adds up. Legitimate as a cross-walk, but α is designed and γ is measured → “our descriptor matches their knob”, not an identity that the fractional field claims.
γ=1 ↔︎ Hagedorn boundary (temperature): analogy, our framing. Kim does not identify a static boundary at γ=1; his criticality is in the energy fluctuation during training. Equating the decay value γ=1 with a Hagedorn temperature is our interpretive layer. We mark it as speculation.
γ ⊥ sink-mass: here γ does NOT unify, it separates —and that’s the honest part. Our clean result (Ch. 17) says that sinks are a separate axis. It’s consistent with the sinks field (they live in the massive activations, a different object). It’s presented as “γ shows that sinks are orthogonal”, not “γ unifies the sinks”.
γ-rerise / CKA ↔︎ grokking: the weakest bridge, analogy/correlation. The covariance field’s parameters are representational; γ is a statistic of attention weights. A dynamic correlation is plausible, but with no mechanism and competing with its own signals (not settled).

⚠ The defensible claim (and the one that isn’t)

Overclaim: “γ is the coordinate that unifies the four frameworks.” Defensible: “γ is a measurable exponent that lets you situate a model relative to three of the lenses (fractional directly; temperature as analogy; grokking as correlation) and that shows the sinks are a fourth, orthogonal lens.” We mark γ=1↔︎Hagedorn and γ↔︎grokking explicitly as our synthesis/speculation, not as facts of the field.

🧩 Analogy — four maps of the same city. The four frameworks are like a street map, a temperature map, an elevation map and a transit map of the same city: they overlap at some points, but they measure different things. γ is like giving the GPS coordinates: it locates you on all of them, but it’s none of the four maps nor does it merge them into one.

40.5 What the landscape leaves open (2026, honest)

There’s no agreed first-principles theory. Almost everything is descriptive or analogical (thermo = “isomorphism”; gravitational = analogy; fractional = designed operator). The only rigorous local result (softmax = free-energy minimum) does not scale to a theory of the trained model.
The causal origin of sinks remains unresolved —and recent work even separates the massive activations from the sinks ( 2026), plus the unexplained secondary ones.
There’s no agreed grokking order parameter: four early signals compete.
Reproducibility and cross-model gaps: many 2026 results are from a single task (modular addition) or one model family. (Consistent with our own audit: a flagship claim —the imprint ν— did not reproduce on data, Ch. 38.)
The descriptor-measured vs operator-designed confusion runs through the whole field. γ’s value is that it is measured; but for that very reason it cannot, alone, give the causal mechanism.

🧪 Try it — tafagent

tafagent embodies γ’s honest role in this map: it gives you the measured γ and your model’s regime —that is, it situates you relative to the decay lenses— and, separately, the sink mass (η-regime), showing that it’s a different axis. It doesn’t sell you a “theory of everything”; it gives you the coordinates to place yourself in a landscape that has not yet converged.

40.6 Summary

Four frameworks, four different order parameters: sinks (concentration on tokens), temperature (free energy/softmax), covariance (geometry → grokking), fractional (designed kernel). They are not four views of one proven phenomenon.
Real field overlaps: sink = massive activation = compression valley (Queipo-de-Llano et al. 2025); temperature↔︎grokking point to the same event.
Our γ, honestly: strong connector with the fractional (order (γ−1)/2); our own analogy with Hagedorn (γ=1); separator of the sinks (γ⊥sink, a finding); weak correlation with grokking. Calling it “the unifying coordinate” is overclaiming; “an exponent that situates you relative to three lenses and separates the fourth” is the defensible part.
Open: no first-principles theory; causal origin of sinks; grokking parameter without consensus; cross-model reproducibility; descriptor-measured ≠ operator-designed.

Next (Chapter 40): we close Part VII —and the body of the book— with what no theory exempts: ethics, safety and limits —biases, hallucination, responsible use and, honestly, what we still DON’T know.

40.7 Exercises

Four objects. Say which object each framework measures (sinks, temperature, covariance, fractional). Why does that imply they aren’t “the same thing”?
Real overlap. What two things did the field prove are the same phenomenon?
Honest γ. Classify γ’s four bridges (fractional, Hagedorn, sinks, grokking) into “strong connector”, “our own analogy”, “separator” and “weak correlation”.
Overclaim. Rewrite “γ unifies the four frameworks” in the defensible version.
The open questions. Cite two things the 2026 landscape leaves unresolved and why they matter.
Descriptor vs design. Explain the difference between FNA’s designed α and the measured γ; why does it matter for claiming a mechanism?

References

. 2026. The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks. https://arxiv.org/abs/2603.05498.

Gu, Xiangming et al. 2025. “When Attention Sink Emerges in Language Models: An Empirical View.” ICLR. https://arxiv.org/abs/2410.10781.

Khanh, Hoa, Trung, and Duc. 2026. Spectral Entropy Collapse as a Phase Transition in Delayed Generalisation. https://arxiv.org/abs/2604.13123.

Kim, Gunn. 2026. Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics. https://arxiv.org/abs/2602.08216.

On the Discrepancy of Secondary Attention Sinks in Large-Theta Models. 2025. https://arxiv.org/abs/2512.22213.

Qu, Xiao, Cheng Ly, and Pulin Gong. 2025. Fractional Neural Attention for Efficient Multiscale Sequence Processing. https://arxiv.org/abs/2511.10208.

Queipo-de-Llano et al. 2025. Attention Sinks and Compression Valleys Are Two Sides of the Same Coin. https://arxiv.org/abs/2510.06477.

Sun, Mingjie, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. 2024. Massive Activations in Large Language Models. https://arxiv.org/abs/2402.17762.

Wang, P. 2026. Grokking as a Dimensional Phase Transition in Neural Networks. https://arxiv.org/abs/2604.04655.

Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. “Efficient Streaming Language Models with Attention Sinks.” ICLR. https://arxiv.org/abs/2309.17453.

Xu, Yongzhong. 2026. Early-Warning Signals of Grokking via Loss-Landscape Geometry. https://arxiv.org/abs/2602.16967.

Zhang. 2026. Attention’s Gravitational Field: A Power-Law Interpretation of Positional Correlation. https://arxiv.org/abs/2603.04805.