44  R3 · Glossary

What it is. Every important term in the book, defined at a glance, with the chapter where it is explained in depth. For quick lookup; the developed definitions and their analogies are in the chapters.

44.1 A

  • Massive activation. A few coordinates of the residual stream with enormous values (~10⁵×); the cause of sinks. (Ch. 17)
  • Activation patching (causal tracing). Transplanting an activation between a clean pass and a corrupted one to isolate which component is causally responsible. (Ch. 37)
  • Adapter. Small bottleneck module inserted into each layer; precursor of LoRA. Adds latency. (Ch. 28)
  • Alignment. Making the model follow instructions and conform to what we prefer (SFT + RLHF/DPO). (Ch. 27)
  • Aliasing. When a RoPE pair completes a full turn and can no longer distinguish distances → loss of positional resolution. (Ch. 14)
  • Hallucination. Output that is fluent, confident, and false (or unsupported). Partly intrinsic to next-token prediction. (Ch. 40)
  • ANN (approximate nearest neighbors). Finding the closest vectors without comparing against all of them (HNSW, FAISS). (Ch. 31)
  • Attention. Weighted mix in which each token gathers information from the others by affinity (Q·K). (Ch. 4)
  • Linear attention. Replacing the softmax with a kernel that lets you reassociate the computation → O(n) cost; loses recall. (Ch. 34)
  • γ atlas. γ measured across 42 models from 4 families: a cross-architecture coordinate. (Ch. 16)

44.2 B

  • Continuous batching. Grouping requests at the iteration level (admit/release at each step) → keeps the GPU full. (Ch. 36)
  • Beam search. Decoding that keeps several hypotheses; good for closed tasks. (Ch. 12)
  • BPE (Byte-Pair Encoding). Tokenization that merges the most frequent symbol pairs. (Ch. 2)
  • Bradley-Terry. The reward model’s loss: ranks responses according to human preferences. (Ch. 27)

44.3 C

  • Head. One of the parallel attentions in multi-head; tends to specialize. (Ch. 5)
  • Induction head. A head that completes patterns “AAB”; a candidate mechanism for in-context learning. (Ch. 5, 30)
  • KV cache. Memory of already-computed keys/values so they aren’t recomputed when generating. (Ch. 12, 20)
  • Inter-layer CKA. Representational similarity between layers; its re-ascent predicts grokking (our result). (Ch. 24)
  • Chain-of-Thought (CoT). Asking for intermediate reasoning steps; emergent with scale. (Ch. 30)
  • Chunking. Splitting documents into fragments before indexing them for RAG. (Ch. 31)
  • CLIP. Image and text encoders trained contrastively into a shared space → zero-shot. (Ch. 33)
  • Residual connection. Adding the input to the output of a block; keeps the gradient alive. (Ch. 7)
  • Contrastive (learning). Pulling positive pairs together and pushing negatives apart in the embedding space. (Ch. 26)
  • Quantization. Representing weights/activations with fewer bits (scale + zero-point). (Ch. 35)

44.4 D

  • D_f (KV window). The number of KV tokens you really need to keep, derived from γ. (Ch. 20)
  • Decode. Token-by-token generation phase; bandwidth-bound; the serving bottleneck. (Ch. 36)
  • Constrained decoding. Masking invalid tokens + renormalizing → guaranteed format (not truth). (Ch. 29)
  • Speculative decoding. A draft proposes tokens and the large model verifies them in parallel (same output, faster). (Ch. 29)
  • Distillation. Training a small student to imitate the soft targets of a large teacher. (Ch. 35)
  • DPO. Direct preference optimization: the RLHF objective without RL or a reward model. (Ch. 27)

44.5 E

  • Embedding. Vector representing the meaning of a token or text. (Ch. 3, 26)
  • Encoder / decoder. Bidirectional encoder (understand) / causal decoder (generate). (Ch. 10)
  • Erratum. An error of our own, flagged, corrected, and re-proved (e.g. C_V /4→/12). (Ch. 22, 38)
  • 1/√d_k scaling. Factor that keeps dot products from saturating the softmax. (Ch. 4)

44.6 F

  • FFN (feed-forward). Network that processes each token separately; key-value memory. (Ch. 6)
  • FlashAttention. Exact attention without materializing the n×n matrix (tiling + online softmax). (Ch. 34)
  • Residual stream. The shared “lane” that each component reads from and writes to; the communication channel. (Ch. 3, 37)
  • Folklore. A popular but unjustified belief (or one already contradicted). (Ch. 38)
  • Fractional (order). A lens that reads attention as Lévy diffusion of order (γ−1)/2. (Ch. 23)

44.7 G

  • γ (gamma). Decay exponent of attention with distance: \(A(d)\propto d^{-\gamma}\). (Ch. 15)
  • γ_Padé. Prediction of γ from geometry (θ, T), without training. (Ch. 15)
  • GQA / MQA / MLA. Sharing K/V across heads to shrink the cache (not the compute). (Ch. 18)
  • Grokking. Late, sudden generalization after memorizing. (Ch. 24)

44.8 H

  • Hagedorn (γ=1). The phase boundary between looking far (Phase A) and concentrating (Phase B). (Ch. 21)
  • HBM / SRAM. The GPU’s large, slow / tiny, fast memory; their traffic dominates attention. (Ch. 34)

44.9 I — L

  • In-context learning (ICL). Learning a task from examples in the prompt, without updating weights. (Ch. 30)
  • Jailbreak. A prompt that bypasses the safety training. (Ch. 40)
  • LayerNorm / RMSNorm. Normalization that rescales the activation → stability. (Ch. 7)
  • Scaling laws. Loss drops predictably with size, data, and compute (Kaplan, Chinchilla). (Ch. 11, 25)
  • Logit lens. Projecting an intermediate activation through the output to read the model’s “bet” layer by layer. (Ch. 37)
  • LoRA / QLoRA. Low-rank delta (\(B{=}0\) at start, fusable) / over a 4-bit base. (Ch. 28)
  • “Lost in the middle”. Models use information in the middle of the context worst. (Ch. 31)

44.10 M — N

  • Causal mask. Prevents a token from looking at future ones (generation). (Ch. 9, 10)
  • Memory-bound. Limited by moving data, not by computing (the case of attention). (Ch. 34)
  • Multi-head. Several attentions in parallel + output projection \(W^O\). (Ch. 5)
  • Numerology. A formula that fits the data with no mechanism to explain it. (Ch. 38)

44.11 O — P

  • Online softmax. Computing the softmax block by block with a running max and sum. (Ch. 34)
  • PagedAttention. Managing the KV cache in virtual-memory-style blocks → zero waste. (Ch. 34, 36)
  • PEFT. Efficient fine-tuning: freeze the base and train few parameters. (Ch. 28)
  • Perplexity. \(e^{\text{loss}}\): “how many tokens the model hesitates among”. (Ch. 11)
  • Pruning. Removing weights (structured = blocks; unstructured = scattered). (Ch. 35)
  • Polysemanticity. One neuron firing for many disparate concepts (caused by superposition). (Ch. 37)
  • Prefill. Phase that processes the whole prompt in parallel; compute-bound; sets the TTFT. (Ch. 36)
  • Prompt injection. Hostile instructions smuggled in as data; worse in agents. (Ch. 30, 40)

44.12 Q — R

  • QK / OV (circuits). QK decides where to attend; OV decides what to write. (Ch. 37)
  • RAG. Retrieving evidence and conditioning generation on it (non-parametric memory). (Ch. 31)
  • ReAct. The agent loop: Thought → Action → Observation. (Ch. 32)
  • Receipt. The proof backing a claim: Lean (algebra) or data. (Ch. 38)
  • RLHF. Aligning with a reward model + PPO, on a KL leash. (Ch. 27)
  • RoPE. Rotary positional encoding: rotates pairs of dimensions according to position. (Ch. 8)

44.13 S — Z

  • SAE (sparse autoencoder). A dictionary that decomposes activations into monosemantic features; undoes superposition. (Ch. 37)
  • Self-consistency. Sampling several reasoning chains and voting for the majority. (Ch. 30)
  • Sink (attention sink). Attention mass that piles up on a few low-information tokens (BOS). (Ch. 17)
  • Superposition. Storing more features than dimensions in nearly orthogonal directions. (Ch. 37)
  • Temperature (τ). Rescales the logits before the softmax: low = cautious, high = adventurous. (Ch. 12, 29)
  • Token. Integer identifier of a chunk of text (subword). (Ch. 2)
  • TTFT / TPOT. Time to first token (prefill) / time per output token (decode). (Ch. 36)
  • Verified / derived. A claim with a receipt (formal proof or reproducible data). (Ch. 38)
  • ViT (Vision Transformer). Transformer over image patches treated as tokens. (Ch. 33)
  • Z (partition function). \(Z=\mathrm{Li}_\gamma(e^{-\lambda})\), in the thermodynamic lens. (Ch. 21)

Next reference (R4): how to measure your own γ and the open datasets on Hugging Face for reproducing it.