43  R2 · Cookbook

What it is. Step-by-step recipes for applying the book’s methods to your model: measure and predict γ, separate sink from decay, size the KV cache, extend context, choose a decoding strategy, and compress. Each recipe states what you need, the steps, the receipt (how to tell the result is trustworthy), and where it is explained in depth. (The full manual for the tool is in R9.)

43.1 Recipe 1 — Measure a model’s γ

Goal: obtain the decay exponent \(\gamma\) that summarizes how attention falls off with distance.

  1. Run a batch of texts through the model and record the attention weights of each head/layer.
  2. For each distance \(d\), average the attention weight → the curve \(A(d)\).
  3. Fit a line in log-log axes: \(\log A(d) = c - \gamma\,\log d\). The slope (sign flipped) is \(\gamma\).
  4. Receipt: look at the of the fit. With R² > 0.95 the γ is reliable; at R² ≈ 0.85 it is a coarser summary (note it). Average per head; don’t mix dissimilar heads.

Where: Ch. 13 (reading maps) and Ch. 15 (the law). Shortcut: Profile mode in tafagent.

43.2 Recipe 2 — Predict γ from geometry (no training)

Goal: estimate γ before measuring, using only RoPE’s base θ and the length \(T\).

  1. Take θ (a model hyperparameter, e.g. 10000) and the length \(T\) you are evaluating.
  2. Compute \(\displaystyle \gamma_{\text{Padé}}=\frac{2\theta-T\sqrt2}{2\theta+T\sqrt2}\).
  3. Honest receipt: this is the center, not the exact value —median error ~22% in Phase A—. If the measured γ (Recipe 1) drifts away, the decomposition (γ_train + γ_arch) explains the rest.

Where: Ch. 15. Formal receipt: the γ_Padé=Cayley identity is proved in Lean (📐, R1).

43.3 Recipe 3 — Separate concentration (sink) from decay (γ)

Goal: don’t conflate two things the literature mixes up.

  1. Measure two distinct signals: the sink mass (how much attention lands on the first tokens) and the γ (Recipe 1).
  2. Treat them as independent axes: a model can have high γ and little sink, or the other way round.
  3. Receipt: our within-model experiment (rescaling θ by 256×) moves γ (0.75→1.0) and leaves the sink flat (~0.38) → they are orthogonal (📊).

Where: Ch. 17. Shortcut: tafagent reports peak_max_share (sink) and γ separately.

43.4 Recipe 4 — Size the KV-cache budget

Goal: estimate how much KV cache you really need at the target length.

  1. Measure γ (Recipe 1). If γ > 1 (Phase B, concentrates nearby), the cache is compressible.
  2. Estimate the window \(D_f \sim \varepsilon^{-1/(\gamma-1)}\), where \(\varepsilon\) is the attention-mass tolerance you accept losing.
  3. Compute the memory: KV bytes ≈ batch × D_f × 2 × layers × d_model × bytes_per_number.
  4. Honest receipt: \(D_f\) is a derived rule, 🟡 not yet validated on benchmarks (RULER/LongBench). Use it as a starting estimate, not a guarantee.

Where: Ch. 20 (D_f) and Ch. 36 (serving). Shortcut: tafagent computes the KV memory.

43.5 Recipe 5 — Extend a model’s context

Goal: make a model work beyond its training length.

  1. Remember: RoPE’s wavelengths depend on θ; rescaling θ stretches them.
  2. Use NTK-aware / YaRN to rescale and, if you can, a short fine-tune to the new length.
  3. Receipt / caveat: more context is not used uniformly (“lost in the middle”, Ch. 31): put what matters at the edges. And our own γ-based headroom rule is 🟡 unvalidated (avenue-2 crashed) — don’t present it as established.

Where: Ch. 14 (scales), Ch. 19 (long context).

43.6 Recipe 6 — Choose a decoding strategy

Goal: get greedy/sampling right for the task.

  1. Closed task (code, factual, translation)? → greedy/beam (precision).
  2. Open/creative task? → sampling: temp ≈ 0.7–1.0, top-p ≈ 0.9 (or min-p if you raise the temperature).
  3. Need guaranteed format (JSON)? → constrained decoding —but remember: it guarantees form, not truth (Ch. 29).
  4. Receipt: evaluate fidelity, not just fluency; beware of the biases of the LLM-as-judge.

Where: Ch. 12 (basics) and Ch. 29 (in depth).

43.7 Recipe 7 — Decide how to compress a model

Goal: shrink it without losing (much) quality.

  1. First choice: quantization. 8 bits ≈ lossless; 4 bits nearly so (GPTQ/AWQ). Below ~3-4 bits, cliff (math/code suffer).
  2. Can you train? → distillation (small student imitates the teacher).
  3. Change of shape? → pruningstructured if you want real speedup on a normal GPU—.
  4. Receipt: don’t evaluate with perplexity alone —compression harms the rare stuff (the long tail); measure downstream tasks—.

Where: Ch. 35.

43.8 Recipe 8 — Estimate the cost of serving

Goal: anticipate throughput/latency/memory before deploying.

  1. Identify the bottleneck: decode (bandwidth-bound), not prefill.
  2. The KV cache limits how many requests fit → use Recipe 4 for the budget.
  3. Levers: continuous batching (throughput), PagedAttention (memory), speculative (latency), quantization (capacity).
  4. Honest receipt: the “X× faster” claims in papers are workload-specificmeasure on yours—.

Where: Ch. 34 and Ch. 36.

43.9 Recipe 0 — How NOT to fool yourself (the meta-recipe)

Before believing any number —yours or a paper’s—:

  1. Ask for the receipt: is there a proof (📐), data (📊), or just a claim? No receipt → folklore (Ch. 38).
  2. R² before γ: a γ with low R² is a poor summary; say so.
  3. Don’t compare raw γ across different models: it mixes θ + data + architecture; the clean control is within-model (Ch. 16).
  4. Form ≠ truth; fluency ≠ correctness; Lean ≠ reality: each claim needs its kind of receipt.
Note🧪 Try it — tafagent

Almost all of these recipes have a shortcut in tafagent (Profile mode for γ and regime, KV computation, YaRN planner, falsification panel). The full manual for the tool —every mode and recipes X-1..X-23— is in R9.

Next reference (R3): the glossary —every term in the book, defined at a glance—.