45  R4 · Measuring your own γ (and reproducing the atlas)

What it is. The reproducible procedure for measuring the decay exponent \(\gamma\) of any model, the quality controls that make it trustworthy, and where the open data lives so you can reproduce (or refute) our atlas. The book’s idea: we don’t ask you to take our word for it; we give you how to check it.

45.1 The procedure, step by step

  1. Collect real attention. Run a representative batch of texts through the model and save the attention weights \(A_{ij}\) per head and layer. (In 🤗 Transformers: output_attentions=True.)
  2. Collapse by distance. For each distance \(d=|i-j|\), average the weight over all pairs at that distance (and over the batch) → the curve \(A(d)\).
  3. Fit the power law. In log-log axes, \(\log A(d)=c-\gamma\,\log d\) is a straight line; fit it by least squares. The slope with its sign flipped is γ.
  4. Save the R². It is your confidence measure (see below). Repeat per head (don’t mix heads with different behavior) and, if you want the model’s γ, aggregate afterwards.
# Sketch (not production code): from real attention to γ via a log-log fit.
import numpy as np
# A: mean attention matrix (from output_attentions), per head
dists, weights = [], []
n = A.shape[-1]
for i in range(n):
    for j in range(i):
        dists.append(i - j); weights.append(A[i, j])
d = np.array(dists); w = np.array(weights)
mask = (d > 0) & (w > 0)                       # log-log needs positives
slope, c = np.polyfit(np.log(d[mask]), np.log(w[mask]), 1)
gamma = -slope                                  # A(d) ∝ d^(−γ)
# R² of the fit = your reliability receipt

45.2 The controls that make it trustworthy (don’t fool yourself)

  • R² before γ. With R² > 0.95 the γ is a good summary; at R² ≈ 0.85 it is coarser —say so, don’t hide it—. A γ with low R² is not comparable to one with high R².
  • Don’t compare raw γ across different models. That mixes θ + data + architecture (the decomposition of Ch. 15). The clean control is within-model: change only θ in the same model (Ch. 16-17).
  • Average carefully. γ varies by head and by depth (γ-field, Ch. 16); a single number per model is a summary, not the whole story.
  • Separate the sink. The concentration mass (sink) is an axis apart from γ (Ch. 17); don’t fold it into the tail fit.

45.3 Compare with the geometric prediction

Once you have measured γ, compare it with γ_Padé (Ch. 15, Recipe 2 in R2): if they differ, the gap is a signal of the γ_train + γ_arch terms of the decomposition, not a failure of the measurement. Honest: the prediction gets the center right (median error ~22% in Phase A), not the exact value.

45.4 Open data to reproduce (or refute)

  • The γ atlas (γ measured across 42 models from 4 families) and the within-model experiments (θ-rescale, γ⊥sink) are published as open data alongside Paper I (Zenodo 20314038) — downloadable, reproducible, criticizable.
  • The model panel and the tool’s apparatus are open at github.com/karlesmarin/tafagent-registry and github.com/karlesmarin/lean-taf (formal proofs). (The Hugging Face datasets exist for reproduction; check the exact repository handle before linking them.)
Warning⚠ Honest — what this validates and what it doesn’t

Measuring γ and its R² validates the form \(A(d)\propto d^{-\gamma}\) as a description (solid: R²>0.95 across 30+ models). It does not by itself validate the claims derived from γ that are still open —the D_f window (🟡 no benchmark) or the context headroom (🟡, avenue-2 crashed)—. Reproduce the measurement; treat the derived parts with their label (R1, Ch. 38).

Note🧪 Try it — tafagent

If you don’t want to set up the pipeline, tafagent does steps 1-4 for you from a model id or config.json: it gives you γ_observed (from real weights), γ_Padé (predicted), the , the regime, and the horizon. It is the “one-click” version of this recipe; the full manual is in R9.

Next reference (R9): the tafagent manual —the 7 modes, the Anti-Bullshit Pack, the recipes, and the TAF Card—.