43 R2 · Cookbook
What it is. Step-by-step recipes for applying the book’s methods to your model: measure and predict γ, separate sink from decay, size the KV cache, extend context, choose a decoding strategy, and compress. Each recipe states what you need, the steps, the receipt (how to tell the result is trustworthy), and where it is explained in depth. (The full manual for the tool is in R9.)
43.1 Recipe 1 — Measure a model’s γ
Goal: obtain the decay exponent \(\gamma\) that summarizes how attention falls off with distance.
- Run a batch of texts through the model and record the attention weights of each head/layer.
- For each distance \(d\), average the attention weight → the curve \(A(d)\).
- Fit a line in log-log axes: \(\log A(d) = c - \gamma\,\log d\). The slope (sign flipped) is \(\gamma\).
- Receipt: look at the R² of the fit. With R² > 0.95 the γ is reliable; at R² ≈ 0.85 it is a coarser summary (note it). Average per head; don’t mix dissimilar heads.
Where: Ch. 13 (reading maps) and Ch. 15 (the law). Shortcut: Profile mode in tafagent.
43.2 Recipe 2 — Predict γ from geometry (no training)
Goal: estimate γ before measuring, using only RoPE’s base θ and the length \(T\).
- Take θ (a model hyperparameter, e.g. 10000) and the length \(T\) you are evaluating.
- Compute \(\displaystyle \gamma_{\text{Padé}}=\frac{2\theta-T\sqrt2}{2\theta+T\sqrt2}\).
- Honest receipt: this is the center, not the exact value —median error ~22% in Phase A—. If the measured γ (Recipe 1) drifts away, the decomposition (γ_train + γ_arch) explains the rest.
Where: Ch. 15. Formal receipt: the γ_Padé=Cayley identity is proved in Lean (📐, R1).
43.3 Recipe 3 — Separate concentration (sink) from decay (γ)
Goal: don’t conflate two things the literature mixes up.
- Measure two distinct signals: the sink mass (how much attention lands on the first tokens) and the γ (Recipe 1).
- Treat them as independent axes: a model can have high γ and little sink, or the other way round.
- Receipt: our within-model experiment (rescaling θ by 256×) moves γ (0.75→1.0) and leaves the sink flat (~0.38) → they are orthogonal (📊).
Where: Ch. 17. Shortcut: tafagent reports
peak_max_share(sink) and γ separately.
43.4 Recipe 4 — Size the KV-cache budget
Goal: estimate how much KV cache you really need at the target length.
- Measure γ (Recipe 1). If γ > 1 (Phase B, concentrates nearby), the cache is compressible.
- Estimate the window \(D_f \sim \varepsilon^{-1/(\gamma-1)}\), where \(\varepsilon\) is the attention-mass tolerance you accept losing.
- Compute the memory:
KV bytes ≈ batch × D_f × 2 × layers × d_model × bytes_per_number. - Honest receipt: \(D_f\) is a derived rule, 🟡 not yet validated on benchmarks (RULER/LongBench). Use it as a starting estimate, not a guarantee.
Where: Ch. 20 (D_f) and Ch. 36 (serving). Shortcut: tafagent computes the KV memory.
43.5 Recipe 5 — Extend a model’s context
Goal: make a model work beyond its training length.
- Remember: RoPE’s wavelengths depend on θ; rescaling θ stretches them.
- Use NTK-aware / YaRN to rescale and, if you can, a short fine-tune to the new length.
- Receipt / caveat: more context is not used uniformly (“lost in the middle”, Ch. 31): put what matters at the edges. And our own γ-based headroom rule is 🟡 unvalidated (avenue-2 crashed) — don’t present it as established.
Where: Ch. 14 (scales), Ch. 19 (long context).
43.6 Recipe 6 — Choose a decoding strategy
Goal: get greedy/sampling right for the task.
- Closed task (code, factual, translation)? → greedy/beam (precision).
- Open/creative task? → sampling:
temp ≈ 0.7–1.0,top-p ≈ 0.9(or min-p if you raise the temperature). - Need guaranteed format (JSON)? → constrained decoding —but remember: it guarantees form, not truth (Ch. 29).
- Receipt: evaluate fidelity, not just fluency; beware of the biases of the LLM-as-judge.
Where: Ch. 12 (basics) and Ch. 29 (in depth).
43.7 Recipe 7 — Decide how to compress a model
Goal: shrink it without losing (much) quality.
- First choice: quantization. 8 bits ≈ lossless; 4 bits nearly so (GPTQ/AWQ). Below ~3-4 bits, cliff (math/code suffer).
- Can you train? → distillation (small student imitates the teacher).
- Change of shape? → pruning —structured if you want real speedup on a normal GPU—.
- Receipt: don’t evaluate with perplexity alone —compression harms the rare stuff (the long tail); measure downstream tasks—.
Where: Ch. 35.
43.8 Recipe 8 — Estimate the cost of serving
Goal: anticipate throughput/latency/memory before deploying.
- Identify the bottleneck: decode (bandwidth-bound), not prefill.
- The KV cache limits how many requests fit → use Recipe 4 for the budget.
- Levers: continuous batching (throughput), PagedAttention (memory), speculative (latency), quantization (capacity).
- Honest receipt: the “X× faster” claims in papers are workload-specific — measure on yours—.
Where: Ch. 34 and Ch. 36.
43.9 Recipe 0 — How NOT to fool yourself (the meta-recipe)
Before believing any number —yours or a paper’s—:
- Ask for the receipt: is there a proof (📐), data (📊), or just a claim? No receipt → folklore (Ch. 38).
- R² before γ: a γ with low R² is a poor summary; say so.
- Don’t compare raw γ across different models: it mixes θ + data + architecture; the clean control is within-model (Ch. 16).
- Form ≠ truth; fluency ≠ correctness; Lean ≠ reality: each claim needs its kind of receipt.
Almost all of these recipes have a shortcut in tafagent (Profile mode for γ and regime, KV computation, YaRN planner, falsification panel). The full manual for the tool —every mode and recipes X-1..X-23— is in R9.
Next reference (R3): the glossary —every term in the book, defined at a glance—.