19  A taxonomy of attention mechanisms

Where we are. So far we’ve talked about dense attention (every token with every other). But its O(n²) cost has spawned a whole family of variants that try to cut either the compute or the memory. This chapter is the comparative map —classics plus the latest from 2024–2026— with something almost nobody states plainly: what actually won and what stayed an attempt. It’s your reference for not getting lost in the acronyms.

19.1 The idea in one sentence

Nearly every attention variant attacks one of two costs —the O(n²) compute or the O(n) KV-cache memory— and it pays to know which one each attacks, and at what price.

19.2 Key concepts and their role in the transformer

Before we get to the map, let’s define the acronyms and families of this chapter and what each one is for inside a transformer:

  • Dense attention (full attention). Definition: every token is compared with all the others, cost O(n²). In the transformer: it’s the original attention; the reference pattern that every variant tries to make cheaper without losing quality.
  • Sparse attention. Definition: each token attends to only a subset of the others (a local window, spaced-out positions, a few global ones). In the transformer: it trims compute, but can miss dependencies outside the pattern; it can be fixed (by hand) or learnable (NSA).
  • Linear / kernelized attention. Definition: rewrites softmax(QKᵀ)V as φ(Q)(φ(K)ᵀV) to avoid the n×n matrix → cost O(n). In the transformer: trades exactness for speed; at scale it hasn’t dethroned dense attention.
  • GQA / MQA / MLA. Definition: schemes that store fewer distinct keys/values (groups share K/V, a single K/V, or a compressed latent cache). In the transformer: they attack the KV-cache memory, not the compute; they remain full, exact attention.
  • FlashAttention. Definition: an algorithm that computes exact attention without ever materializing the n×n matrix (it tiles and recomputes it). In the transformer: the same attention, better computed; it’s the engine training almost everything today.
  • KV-cache. Definition: the memory that stores the keys and values already computed so they aren’t recomputed during generation. In the transformer: it grows O(n) with the context and is the bottleneck that GQA/MLA aim to relieve (Ch. 20).
  • SSM / Mamba. Definition: state-space models with linear-time recurrence, no attention matrix. In the transformer: they aren’t “cheap attention” but another family; they gain ground mostly as hybrids (Jamba).
  • Exact vs. approximate. Definition: an exact variant gives the same result as dense attention; an approximate one sacrifices some fidelity. In the transformer: it’s the axis that decides whether a technique “gives away” quality or not —the distinction most often confused.

With the vocabulary pinned down, let’s turn to the problem they all attack and then to the comparative map.

19.3 The underlying problem

Dense attention compares every token with all the others: O(n²) in time and memory. And during generation, the KV-cache grows O(n) (Ch. 12). Doubling the context quadruples the attention compute and doubles the cache memory. Everything that follows attacks one of those two fronts.

19.4 The families

1. Fixed-pattern sparse (approximate). Each token attends only to a predefined subset: a local window, spaced positions, a few global tokens. Cheap, but it can miss dependencies outside the pattern. Examples: Sparse Transformer (Child et al. 2019), Longformer (window + global, Beltagy et al. 2020), BigBird (Zaheer et al. 2020), and Mistral’s sliding window (the receptive field grows with depth).

2. Linear / kernelized — O(n) (approximate). They rewrite softmax(QKᵀ)V as φ(Q)(φ(K)ᵀV), avoiding the n×n matrix → linear cost. Linear Transformers (Katharopoulos et al. 2020), Performer (Choromanski et al. 2021). They trade exactness for speed.

3. Low-rank (approximate). Linformer (Wang et al. 2020) projects K and V down to a low dimension, assuming the attention matrix is low-rank → ≈O(n).

4. Head sharing — reduces the KV, not the compute (EXACT). They keep full attention but store fewer distinct K/V: MQA (a single K/V, Shazeer 2019), GQA (groups share K/V; standard in Llama-2/3), and MLA (DeepSeek-AI 2024), which compresses the KV cache into a latent vector (~93% less memory).

5. Exact but efficient — FlashAttention. Here comes the most important nuance of the chapter. FlashAttention (Dao et al. 2022; Dao 2023; Shah et al. 2024) computes exact attention —bit-for-bit identical to dense, with no approximation— but never materializes the n×n matrix in memory: it tiles it (tiling) and recomputes it, exploiting the GPU’s memory hierarchy. It’s the engine behind almost all training today.

6. Recent (2025): learnable sparsity. Native Sparse Attention (NSA) (Yuan and (DeepSeek) 2025): instead of a hand-crafted fixed pattern, the sparsity is learned end to end and aligned to the hardware. It promises to match full attention with a big speedup on long contexts.

7. Beyond attention. Mamba (Gu and Dao 2023): a state-space model with linear-time recurrence, no attention. Jamba (Lieber and (AI21) 2024): a hybrid that interleaves Transformer and Mamba layers. RetNet, RWKV: linear-RNN-style alternatives.

19.5 The table

Table 19.1: Attention mechanisms and alternatives
Mechanism Idea Cost vs O(n²) Exact/Approx. Used in
Full attention everyone with everyone O(n²) Exact original Transformer
Sparse Transformer fixed/strided pattern O(n√n) Approx. GPT-3 (layers)
Longformer window + global O(n·w) Approx. long documents
BigBird random+window+global O(n) Approx. long-form QA
Sliding-Window local window per layer O(n·w) Approx. Mistral
Linear / Performer kernel φ(Q)(φ(K)ᵀV) O(n) Approx. research
Linformer low-rank projection O(n·k) Approx. research
MQA 1 K/V head O(n²), ↓KV Exact PaLM
GQA K/V groups O(n²), ↓KV Exact Llama-2/3
MLA compressed latent KV O(n²), ↓↓KV (~93%) Exact DeepSeek-V2
FlashAttention tiling, no n×n matrix O(n²) compute, O(n) memory Exact almost all training
NSA (2025) learnable hierarchical sparse sub-O(n²) Approx. (trained) DeepSeek
Mamba recurrent SSM O(n) no attention Mamba, hybrids
Jamba Transformer+Mamba+MoE mixed mixed Jamba

(Table notation: n = sequence length; w = local window size; k = reduced projection dimension; φ = the “kernel function” that rewrites attention.)

19.6 The honest verdict: what actually won

After years of variants, what really dominates is surprising for how un-exotic it is:

  • FlashAttention (exact, efficient) dominates training. It sacrifices no quality: it’s the same attention, better computed.
  • GQA and MLA dominate KV reduction at inference —they attack memory, not compute—.
  • The sliding window survives in some production models (Mistral).
  • The classic linear and sparse approximations (Performer, Linformer…) did NOT dethrone full attention at the frontier, despite years of attempts: the quality loss isn’t worth it at scale.
  • The most serious challenger is SSMs/Mamba, and they almost always gain ground as hybrids (Jamba), not as a pure replacement.
  • NSA (2025) is the bet on learnable sparsity; promising, but the evidence is from the team itself and broad independent validation is still missing (honest).
Warning⚠ Five confusions to avoid
  1. FlashAttention is EXACT, not approximate — it only reorganizes the compute in memory. (The most common error of the chapter.)
  2. GQA/MQA/MLA reduce the KV cache, NOT the O(n²) compute — they’re still full attention.
  3. “Linear” doesn’t imply “better”: it trades exactness for cost; at scale it hasn’t won.
  4. Mamba has no attention matrix — it’s not “cheap attention,” it’s another family (stateful recurrence).
  5. Fixed sparsity (hand-crafted pattern) ≠ learnable sparsity (NSA, the 2025 trend).

19.7 Connection to our Part II

All of this is the engineering answer to the problem we’re studying: the cost of attention over distance. KV reduction (GQA/MLA) is exactly where our D_f window (the number of KV tokens you really need to keep, computed from γ) comes into play (Ch. 20): instead of picking the KV budget by eye or by heuristic, you derive it from the decay exponent.

Note🧪 Try it — tafagent

tafagent detects whether a model uses a sliding window (its η-regime flags the SWA case) and computes the KV memory at your target length. Handy for anticipating the real cost before serving a model.

19.8 Summary

  • Two costs to attack: O(n²) compute and O(n) KV memory.
  • Approximate variants (fixed sparse, linear, low-rank): cheaper, but lose some quality → they haven’t won at scale.
  • Exact, memory-saving variants: FlashAttention (compute, training) and GQA/MQA/MLA (KV, inference) — this is what dominates today.
  • Mamba/SSMs are the serious challenger, especially as a hybrid (Jamba); NSA
    1. bets on learnable sparsity (with validation still pending).
  • Watch out: FlashAttention is exact; GQA reduces KV, not compute; “linear” ≠ better; Mamba is another family.

Next (Chapter 19): one of the star uses of all this theory — extending a model’s context beyond its training length (NTK/YaRN and the γ-based rule).

19.9 Exercises

  1. Exact vs approximate. Why doesn’t FlashAttention appear as “approximate” in the table if it’s “efficient”? What exactly does it reorganize?
  2. KV vs compute. A colleague says “we use GQA to lower the O(n²) cost.” Correct them: what does GQA actually reduce?
  3. What won. Why do you think linear attention, cheaper in theory, didn’t replace full attention at scale?
  4. Another family. How does Mamba differ from “cheaper attention”?

References

Beltagy, Iz, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. https://arxiv.org/abs/2004.05150.
Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509.
Choromanski, Krzysztof et al. 2021. “Rethinking Attention with Performers.” ICLR. https://arxiv.org/abs/2009.14794.
Dao, Tri. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. https://arxiv.org/abs/2307.08691.
Dao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS. https://arxiv.org/abs/2205.14135.
DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. https://arxiv.org/abs/2405.04434.
Gu, Albert, and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. https://arxiv.org/abs/2312.00752.
Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. “Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention.” ICML. https://arxiv.org/abs/2006.16236.
Lieber, Opher, and others (AI21). 2024. Jamba: A Hybrid Transformer-Mamba Language Model. https://arxiv.org/abs/2403.19887.
Shah, Jay et al. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision. https://arxiv.org/abs/2407.08608.
Shazeer, Noam. 2019. Fast Transformer Decoding: One Write-Head Is All You Need. https://arxiv.org/abs/1911.02150.
Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-Attention with Linear Complexity. https://arxiv.org/abs/2006.04768.
Yuan, Jingyang, and others (DeepSeek). 2025. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. https://arxiv.org/abs/2502.11089.
Zaheer, Manzil et al. 2020. “Big Bird: Transformers for Longer Sequences.” NeurIPS. https://arxiv.org/abs/2007.14062.