20  Long-context extension

Where we are. One of the star uses of all the RoPE theory: making a model trained up to length T work beyond T —without retraining it from scratch—. This chapter explains how (Position Interpolation, NTK-aware, YaRN, LongRoPE…), connects it to our γ, and —true to the book— is very clear about what these methods don’t achieve. An honest spoiler: “extended to 1M” is a claim to audit.

20.1 The idea in one sentence

Because beyond its training length the model sees RoPE angles it never saw (and collapses), the fix isn’t to extrapolate but to remap the long positions into the range of angles it already knows —almost always by rescaling the base θ—.

20.2 Key concepts and their role in the transformer

Before we look at the methods, let’s define the terms of this chapter and what each one is for inside a transformer:

  • Training length (T). Definition: the maximum sequence length the model saw while training. In the transformer: it fixes the range of “known” RoPE angles; beyond T the model is in unseen territory.
  • Out-of-distribution (OOD) angles. Definition: the RoPE rotations that appear at positions > T and that the model never observed. In the transformer: they make the attention scores blow up and the softmax destabilize → text degeneration.
  • Extrapolate vs. remap. Definition: extrapolating means letting the model face new angles; remapping means compressing the long positions into the angles it has already seen. In the transformer: almost every solution that works remaps, it doesn’t extrapolate.
  • Rescaling the base θ. Definition: changing the constant that fixes RoPE’s frequencies. In the transformer: it’s the common degree of freedom across PI, NTK, YaRN, and LongRoPE; moving θ shifts which positions fall on which angles.
  • NTK-aware / YaRN. Definition: non-uniform rescalings of θ that preserve the high frequencies (the local) and stretch the low ones (the far); YaRN adds a per-wavelength ramp and a temperature. In the transformer: they’re the practical standard for extending context with little or no finetuning.
  • “Lost in the middle”. Definition: the model’s tendency to use what’s at the beginning and end of the context well, but lose what’s in the middle. In the transformer: this is why the usable length is shorter than the nominal one —a “128K” is rarely 128K effective—.
  • Retrieval audit (passkey / needle). Definition: testing whether the model retrieves specific information at full length, not just whether perplexity drops. In the transformer: it’s the honest test of a “long context”; perplexity can stay low while retrieval has already collapsed.
  • The α=0 rule (ours). Definition: choosing the θ that leaves the extended model at a target γ at the new length. In the transformer: it turns extension into a decision derived from the geometry (γ depends on θ) instead of a factor chosen by eye.

With that clear, let’s get to the concrete problem and the methods.

20.3 The problem

A model trained up to T has only seen positions 0…T, i.e. certain RoPE angles. Beyond T, those angles fall in an out-of-distribution regime: the model never saw them, the attention scores blow up, and the softmax destabilizes → degeneration (incoherent text, the “needle” is lost). The idea common to almost all solutions: don’t invent new angles, compress the long positions into the angles already seen.

20.4 The methods

  • Position Interpolation (PI) (Chen et al. 2023): linearly compresses the positions (divide by L/T, where L is the new length you want to extend to and T the training length) so they fall in [0,T]. Trivial, but it loses fine local resolution and needs a little finetuning.
  • NTK-aware (bloc97, 2023; formalized in YaRN): rescales θ non-uniformlypreserves the high frequencies (the local) and stretches the low ones (the far). It often works without finetuning for small factors (2–4×). Dynamic NTK recomputes the factor based on the current length (so it doesn’t penalize short contexts).
  • YaRN (Peng et al. 2023): refines NTK with a per-wavelength ramp (it doesn’t interpolate the highs, does the lows, smooths the middle) + an attention temperature. The practical standard: a few finetuning steps.
  • LongRoPE (Ding et al. 2024): an evolutionary search of per-dimension rescalings, in stages, up to ~2M tokens.
  • Self-Extend (Jin et al. 2024): training-free, grouped + neighbor attention; inference-time changes only.
  • LongRoPE2 (Shang et al. 2025): the honest correction of LongRoPE —it acknowledges that the low frequencies were undertrained— with a needle-guided search and mixed training.
Table 20.1: Context-extension methods
Method Idea Finetuning? Typical reach
PI compress positions ÷(L/T) Yes (~1k steps) ~8–32k
NTK-aware non-uniform θ rescale Often no (small s) ~2–4×
YaRN per-wavelength ramp + temp. Few steps ~64–128k
LongRoPE per-dim evolutionary search Yes (stages) up to ~2M
Self-Extend grouped + neighbor attention No modest
LongRoPE2 needle-driven + mixed Yes ~128k near-lossless

20.5 The honest reality

Warning⚠ “Extended to 1M” is a claim to audit

These methods help, but they’re no magic: 1. They trade short-context quality for long-context capacity (that’s why YaRN/LongRoPE2 add temperature or mixed training to mitigate it). 2. The usable length is shorter than the nominal one. The “lost in the middle” phenomenon (Liu et al. 2023): the model uses what’s at the beginning and end well, but lets what’s in the middle slip away —even in “long-context” models—. A nominal “128K” is rarely 128K effective. 3. Large extensions (≫4×) almost always need some finetuning; only the modest ones really work training-free. → How to audit: faced with an “extended to 1M,” ask for a full-length retrieval evaluation (passkey / needle-in-haystack), not just perplexity. Perplexity can stay low while retrieval has already collapsed.

20.6 Our angle: extending in terms of γ

There’s a thread tying all of this to Part II. All the methods act on the same degree of freedom —the base θ— and, since our γ depends on θ (γ_Padé, Ch. 15), rescaling θ shifts γ predictably. That lets us frame extension backwards: instead of picking a factor by eye, pick the θ that leaves the extended model at a target γ (our “α=0 rule”: fix θ_design for a desired γ at the new length).

Caution✗ Honest — our validation is incomplete

The geometry behind this rule is solid (rescaling θ moves γ monotonically and predictably). But our empirical passkey validation came out incomplete: the experiment crashed for lack of memory (CUDA-OOM) before covering the lengths that would discriminate between conditions; we only confirmed that the native model retrieves within its training length (as expected). So we present the γ rule as derived and geometrically motivated, NOT as validated. We don’t claim it beats YaRN; we leave it as a hypothesis pending an experiment that reproduces cleanly. It’s exactly the kind of thing this book refuses to sell as fact.

Note🧪 Try it — tafagent

tafagent has a YaRN/RoPE extension planner: give it the model and the target length L, and it hands you the rope_scaling block ready to paste, the γ before and after, the d_horizon at that length, and a verdict (HEALTHY / USABLE-WITH-CARE / NEEDS-FINETUNE / DEGRADES). It’s this theory turned into a practical tool.

20.7 Summary

  • Beyond T, RoPE gives OOD angles → collapse. The cure: remap positions (rescale θ), don’t extrapolate.
  • PI (compresses, needs finetuning), NTK-aware (non-uniform, sometimes finetuning-free), YaRN (ramp + temperature, the standard), LongRoPE (~2M, evolutionary), Self-Extend (training-free), LongRoPE2 (2025, near-lossless).
  • Honest: nominal ≠ usable (“lost in the middle”); large extensions need finetuning; audit with passkey, not perplexity.
  • Ours: since γ depends on θ, extension can be fixed by a target γ (the α=0 rule). Solid geometry; our own validation incomplete (OOM) → hypothesis, not fact.

Next (Chapter 20): the other practical use of γ, and the one that closes Part II — compressing the KV-cache with a window derived from γ, with no parameters to tune—.

20.8 Exercises

  1. Why it fails. Why does a model trained at 4k produce gibberish at 16k if you do nothing? (Think about unseen RoPE angles.)
  2. NTK vs PI. What does NTK-aware do with the high frequencies that PI doesn’t respect?
  3. Audit. A model claims “1M context.” What evaluation do you ask for to believe it, and why isn’t perplexity enough?
  4. Honesty. Why do we present our γ rule as “not validated” instead of claiming it works?

References

Chen, Shouyuan, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending Context Window of Large Language Models via Positional Interpolation. https://arxiv.org/abs/2306.15595.
Ding, Yiran et al. 2024. “LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens.” ICML. https://arxiv.org/abs/2402.13753.
Jin, Hongye et al. 2024. LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning. https://arxiv.org/abs/2401.01325.
Liu, Nelson F. et al. 2023. “Lost in the Middle: How Language Models Use Long Contexts.” TACL. https://arxiv.org/abs/2307.03172.
Peng, Bowen, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023. YaRN: Efficient Context Window Extension of Large Language Models. https://arxiv.org/abs/2309.00071.
Shang, Ning et al. 2025. LongRoPE2: Near-Lossless LLM Context Window Scaling. https://arxiv.org/abs/2502.20082.