9  Positional information and RoPE

Where we are. We already have the full machinery: attention, multi-head, FFN and the scaffolding. But it’s missing a basic sense —order—. As it stands, attention doesn’t care what position each word is in. This chapter explains how order gets fed into it, and in particular RoPE, the method used by almost every model today. It’s also the hinge of the book: the geometry of RoPE is where our attention-decay law comes from in Part II.

9.1 The idea in one sentence

Because attention doesn’t distinguish word order, positional information is added to it; the modern method (RoPE) does this by rotating vectors according to their position, so that attention perceives the relative distance between tokens.

9.2 Key concepts and their role in the transformer

Before we dig in, let’s define this chapter’s terms and what each one is for inside a transformer:

  • Permutation invariance. Definition: since attention only computes a weighted average, if you shuffle the words the result is the same. In the transformer: it’s the blind spot that forces us to inject position; without it, the model would see text as a bag of words, not a sequence.
  • Positional encoding. Definition: the order information that’s added so that attention can tell each token’s position apart. In the transformer: it gives the model the sense of order it lacks on its own.
  • Absolute position (sinusoidal / learned). Definition: assigning each position a “fingerprint” —a fixed one of sines/cosines, or a learned vector— that’s added to the embedding. In the transformer: the first solution; the learned one fixes a maximum length and doesn’t extrapolate.
  • Relative position. Definition: encoding the distance between tokens (i−j) instead of their absolute position (Shaw et al. 2018). In the transformer: it reflects that what almost always matters is “how far apart they are”, not “being at position 5”.
  • RoPE (Rotary Position Embedding). Definition: instead of adding position, it rotates each query and key by an angle proportional to its position (Su et al. 2021). In the transformer: today’s standard; it injects relative position directly into the attention score without touching the embeddings.
  • Base θ (theta). Definition: the parameter (default 10000) that sets the rotation frequencies \(\omega_i = \theta^{-2i/d}\). In the transformer: it controls the range of speeds —low-index pairs rotate fast (short reach), high ones slow (long reach)—; rescaling it (NTK-aware, YaRN) extends the context.
  • Relative position in the score. Definition: the dot product of two rotated vectors depends only on the relative position (m − n); the absolute ones cancel out. In the transformer: it’s the property that makes RoPE elegant and “free”.
  • ALiBi. Definition: an alternative that doesn’t rotate, but instead subtracts from the score a penalty proportional to the distance (Press et al. 2022). In the transformer: it biases toward the recent and extrapolates very well to longer sequences.
  • Attention decay (A(d) ∝ d^−γ). Definition: the —contested— idea that attention decays with the distance between tokens. In the transformer: it’s the hinge into Part II, where we measure and predict it as a power law from the geometry of RoPE.

With this in mind, let’s see why attention needs order to be injected into it.

9.3 What it’s for (its role in the transformer)

The attention from Ch. 4 has a surprising blind spot: it’s permutation invariant. Since it only computes a weighted average, if you shuffle the words, the result is the same. To pure attention, “the dog bites the man” and “the man bites the dog” are indistinguishable.

Job of positional encoding: to inject the order that attention doesn’t see. Without it, the model would understand a text as a bag of words, not a sequence.

9.4 The ways to give position (a small evolution)

  • Absolute sinusoidal (original Transformer): each position is assigned a fixed “fingerprint” of sine/cosine waves of different frequencies, which is added to the embedding. It isn’t learned.
  • Learned absolute (GPT-2, BERT): a learned vector for each position. Simple, but it fixes a maximum length and doesn’t extrapolate beyond it.
  • Relative (Shaw et al. 2018): encodes the distance between tokens (i−j) instead of the absolute position —because what almost always matters is “how far apart they are”, not “being at position 5”—.
  • RoPE (Su et al. 2021): the idea that won out. We’ll look at it in detail.

9.5 RoPE: encoding position by rotating

RoPE (Rotary Position Embedding) does something elegant: instead of adding position, it rotates each query and key vector by an angle proportional to its position.

🧩 Analogy — the hands of a clock. Imagine each word carries a hand, and its position turns that hand by a certain angle. When two words are compared, what matters is the angle between their hands —and that angle depends only on how many positions separate them, not on where they are in absolute terms—. (And there isn’t one hand, but many turning at different speeds.)

That image is literal. Because of how rotations work, the dot product of two rotated vectors depends only on the relative position (m − n): the absolute positions cancel out. So RoPE gets something very useful for free: it injects relative position directly into the attention score, without adding anything to the embeddings.

The details worth keeping in mind:

  • Dimensions are rotated in pairs, each pair at a different speed.
  • The speed is set by the base θ (theta, default 10000), via the frequencies \(\omega_i = \theta^{-2i/d}\), where i numbers the pair of dimensions (0, 1, 2…) and d is the head dimension.
  • What θ does: it controls the range of speeds. Low-index pairs rotate fast (capturing short distances); high-index ones rotate slow (capturing long distances).
  • RoPE only touches the queries and keys (q, k), never the values.
import torch
d, base = 64, 10000
i = torch.arange(0, d, 2)
inv_freq = 1.0 / (base ** (i / d))   # ω_i: rotation speed of each pair
# at position m, each pair is rotated by the angle  m * inv_freq

RoPE is the standard today: LLaMA, GPT-NeoX, Qwen, Gemma… all use it.

9.6 A contrast: ALiBi

Not everyone rotates. ALiBi (Press et al. 2022) takes another route: it doesn’t add position to the vectors, but instead subtracts from the score a penalty proportional to the distance (each head with its own slope). The result: a bias toward the recent and very good extrapolation to sequences longer than those seen in training. It’s an elegant alternative worth knowing.

9.7 The bridge to Part II: decay (and an honest dispute)

Su et al. (2021) claimed that RoPE makes attention decay as distance increases —desirable, in principle—. But this is contested: work such as HoPE (Chen et al. 2024) argues that this long-range decay is neither guaranteed nor desirable at scale (LLMs need to retrieve distant information; in practice they show U-shaped patterns, not monotonic decay).

Note🧭 Here our territory begins

This is exactly the gap our Part II tackles. Instead of claiming “RoPE decays” or “doesn’t decay”, we measure and predict it: decay follows a power law A(d) ∝ d^−γ —where A(d) is the mean attention between tokens at distance d (here d is distance between tokens, not the head dimension from before)—, and the exponent γ can be computed from the geometry of RoPE (the base θ, the frequencies, the aliasing). From there come concrete tools —compressing the model’s memory, extending its context—. RoPE is, therefore, the hinge between the fundamentals and what you’ll only find here.

A related note: since RoPE depends on θ, rescaling θ (NTK-aware, YaRN) lets a model work beyond its training length. We’ll see this in Part II.

9.8 Summary

  • Attention is blind to order (permutation invariant); positional encoding injects order.
  • Evolution: sinusoidal → learned → relative → RoPE.
  • RoPE rotates q/k by an angle ∝ position; the dot product depends only on relative position (the absolute angles cancel out). Base θ=10000; low-index pairs rotate fast (short reach), high ones slow (long reach). It doesn’t touch the values.
  • ALiBi is the alternative (distance penalty, good extrapolation).
  • RoPE’s decay is contested — and it’s exactly what our Part II measures and predicts as a power law A(d)∝d^−γ.

Next (Chapter 9): we now have all the pieces and a sense of order. Time to build the complete transformer block and see what happens when we stack it in depth.

9.9 Exercises

  1. Order blindness. Explain why, without position, attention gives the same result for “the dog bites the man” and “the man bites the dog”.
  2. Relative for free. Why does the dot product of two RoPE-rotated vectors depend only on the difference in positions and not on the absolute positions? (Hint: think about the angle between two clock hands.)
  3. θ and reach. If you raise the base θ, what happens to the rotation speeds and, therefore, to the range of distances the model distinguishes well?
  4. q, k, and v? To which vectors does RoPE apply its rotation, and to which not? Why does that make sense?

References

Chen, Lv, Luan, Wang, and Liu. 2024. HoPE: A Novel Positional Encoding Without Long-Term Decay. https://arxiv.org/abs/2410.21216.
Press, Ofir, Noah A. Smith, and Mike Lewis. 2022. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.” ICLR. https://arxiv.org/abs/2108.12409.
Shaw, Peter, Jakob Uszkoreit, and Ashish Vaswani. 2018. “Self-Attention with Relative Position Representations.” NAACL. https://arxiv.org/abs/1803.02155.
Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv Preprint arXiv:2104.09864.