24 The fractional transport view
Where we are. The third physical lens, and the most exotic in name but intuitive in idea: seeing attention as a transport of information —a diffusion—. If it decays with distance like a power-law, it behaves like a special kind of diffusion (Lévy), and γ can be read as a “fractional order.” As in the previous chapters, we separate what’s ours from what others already did, and we flag what’s interpretation.
24.1 The idea in one sentence
Attention “transports” information between tokens; if its weight falls like d^−γ, that transport is an anomalous (Lévy) diffusion, and γ can be read as the order of that process —with most models in a smoothing regime—.
24.2 Key concepts and their role in the transformer
Before we get into the detail, let’s define the terms of this chapter and what each one is for inside a transformer:
- Transport / diffusion. Definition: the process by which something (heat, particles, information) spreads through a medium. In the transformer: the metaphor that attention “moves” information between positions in the sequence.
- Normal (Brownian) diffusion. Definition: spreading via many local steps; the territory grows slowly (∝√t) and the ordinary Laplacian governs it. In the transformer: the reference case —purely local attention— that we compare against.
- Lévy flight (anomalous diffusion). Definition: diffusion with rare, huge power-law-tailed jumps (
P(jump ℓ)∝ℓ^−(1+α)), faster than normal. In the transformer: what attention with ad^−γtail looks like —lots of local mixing plus the occasional “jump” to the far—. - Lévy index α. Definition: the exponent of the jump tail, valid for
α∈(0,2). In the transformer: it maps to the attention exponent asα=γ−1(interpretation, not a theorem); outside that range the analogy stops holding. - Fractional Laplacian
(−Δ)^s. Definition: the “fractional-order” version of the diffusion operator that generates Lévy flights. In the transformer: the operator whose jump kernel has the same power-law form as the attention kernel. - Fractional order s. Definition: a knob between identity, derivative, and integral:
s=0identity,s>0differentiates (roughness),s<0integrates (smoothing). In the transformer: it’s read from the measured γ ass=(γ−1)/2and tells you what attention does. - Smoothing vs. differentiation regime. Definition: the two sides of the knob: γ<1 → s<0 → smooth (long-range averaging), γ>1 → s>0 → differentiate. In the transformer: most trained models (γ<1) fall on the smoothing side, again with γ=1 as the crossover.
The underlying idea: a third lens —fractional transport— that once again places γ=1 at the center, as the zero order between smoothing and differentiating.
24.3 Normal vs. anomalous diffusion
🧩 Analogy. An ant explores with many small local steps: its territory grows slowly (∝√t). That’s normal diffusion (Brownian, Gaussian). An albatross forages with local loops… but every so often it launches a very long flight to another area; those rare, huge jumps dominate how far it gets. That’s a Lévy flight: anomalous diffusion, faster than normal.
The mathematical difference: normal diffusion uses the ordinary Laplacian; Lévy diffusion uses a fractional Laplacian (−Δ)^s. And Lévy’s signature is jumps with a power-law tail: P(jump of length ℓ) ∝ ℓ^−(1+α), with index α∈(0,2).
24.4 The connection to attention
Here’s the bridge: an attention kernel A(d) ∝ d^−γ is exactly that kind of power-law-tailed jump kernel. That is, attention “transports” information between positions like a Lévy diffusion: lots of local mixing, but with a tail that lets it “jump” far every so often (like the albatross). Matching exponents:
\[ \alpha = \gamma - 1, \qquad s = \frac{\alpha}{2} = \frac{\gamma-1}{2} \quad (\text{interpretation}) \]
where α is the Lévy index and s the order of the fractional Laplacian.
24.5 The “fractional order” and the smoothing regime
What’s a fractional order? A knob between the identity and the derivative:
- order 0 = identity (do nothing);
- positive order = differentiate (roughness: amplifies the fine);
- negative order = integrate (smoothing: averages).
With s = (γ−1)/2, this traces the regimes we already know:
- γ < 1 → s < 0 → smoothing (integration, averaging).
- γ = 1 → s = 0 → identity (the crossover, γ=1 again).
- γ > 1 → s > 0 → differentiation (the other side).
And, recalling the atlas (Ch. 16), most trained models (γ<1) fall in the smoothing regime —their attention acts as a long-range average—.
- “Attention is fractional/Lévy” is NOT our novelty. Fractional Neural Attention (FNA) (Qu et al. 2025) already designs attention as a fractional-Laplacian operator with a hand-chosen order α. The framework is taken.
- What’s plausibly ours: reading the measured γ (from the atlas, tied to RoPE) as a fractional order, and placing trained models in the smoothing regime —a descriptive claim, not a new operator—.
- Caveats: FNA decays by distance in feature space; we do it by position in the sequence —the analogy is by structural similarity of power-law kernels, not an identity—. And α=γ−1, s=(γ−1)/2 are interpretive mappings, not theorems. Also, a real Lévy index requires α∈(0,2): a large γ would push α>2 (out of the stable range) → the analogy holds within a regime, not always.
24.6 Why this lens adds something
Three lenses (phases, thermodynamics, fractional) and the same γ=1 at the center: it’s the phase boundary (Ch. 21), the point where Fisher/C_V behave (Ch. 22), and the order-0 that separates smoothing from differentiating (here). That three independent analogies point at the same threshold is, precisely, what makes the physical lens worth it —not because it’s elegant, but because it’s consistent—.
tafagent gives you your model’s γ; from there you read off directly its “fractional order” (γ−1)/2 and know whether it’s in the smoothing regime (γ<1) or not. Try it with a few from the atlas: almost all will come out smoothing.
24.7 Summary
- Attention transports information; with
A(d)∝d^−γit does so like a Lévy diffusion (power-law-tailed jumps), not Brownian. - γ reads as a fractional order
s=(γ−1)/2: γ<1 = smoothing, γ=1 = identity (crossover), γ>1 = differentiation. Most models smooth. - Honest: the fractional framework was designed by FNA (not ours); ours is reading the measured γ as an order. α=γ−1 and s=(γ−1)/2 are interpretation; the Lévy analogy holds for α∈(0,2).
- Consistency: the three lenses (phases, thermo, fractional) point at the same γ=1.
Next (Chapter 24): we close Part III with training dynamics and grokking —and here our own pilot paper enters, with its reach and its limits—.
24.8 Exercises
- Ant vs albatross. What kind of diffusion is each, and which one resembles power-law-tailed attention?
- The order. If
s=(γ−1)/2, what does the attention of a model with γ=0.7 do: smooth or differentiate? And one with γ=1.3? - Honesty. What part of this lens is ours and what part FNA already did?
- The Lévy caveat. Why does the Lévy analogy “break” for very large γ? (Hint: the valid range of α.)
The fractional/Lévy lens, the ranges of validity, and its connection to γ are open: Predicting How Transformers Attend (Zenodo).