44 R3 · Glossary
What it is. Every important term in the book, defined at a glance, with the chapter where it is explained in depth. For quick lookup; the developed definitions and their analogies are in the chapters.
44.1 A
- Massive activation. A few coordinates of the residual stream with enormous values (~10⁵×); the cause of sinks. (Ch. 17)
- Activation patching (causal tracing). Transplanting an activation between a clean pass and a corrupted one to isolate which component is causally responsible. (Ch. 37)
- Adapter. Small bottleneck module inserted into each layer; precursor of LoRA. Adds latency. (Ch. 28)
- Alignment. Making the model follow instructions and conform to what we prefer (SFT + RLHF/DPO). (Ch. 27)
- Aliasing. When a RoPE pair completes a full turn and can no longer distinguish distances → loss of positional resolution. (Ch. 14)
- Hallucination. Output that is fluent, confident, and false (or unsupported). Partly intrinsic to next-token prediction. (Ch. 40)
- ANN (approximate nearest neighbors). Finding the closest vectors without comparing against all of them (HNSW, FAISS). (Ch. 31)
- Attention. Weighted mix in which each token gathers information from the others by affinity (Q·K). (Ch. 4)
- Linear attention. Replacing the softmax with a kernel that lets you reassociate the computation → O(n) cost; loses recall. (Ch. 34)
- γ atlas. γ measured across 42 models from 4 families: a cross-architecture coordinate. (Ch. 16)
44.2 B
- Continuous batching. Grouping requests at the iteration level (admit/release at each step) → keeps the GPU full. (Ch. 36)
- Beam search. Decoding that keeps several hypotheses; good for closed tasks. (Ch. 12)
- BPE (Byte-Pair Encoding). Tokenization that merges the most frequent symbol pairs. (Ch. 2)
- Bradley-Terry. The reward model’s loss: ranks responses according to human preferences. (Ch. 27)
44.3 C
- Head. One of the parallel attentions in multi-head; tends to specialize. (Ch. 5)
- Induction head. A head that completes patterns “A…A→B”; a candidate mechanism for in-context learning. (Ch. 5, 30)
- KV cache. Memory of already-computed keys/values so they aren’t recomputed when generating. (Ch. 12, 20)
- Inter-layer CKA. Representational similarity between layers; its re-ascent predicts grokking (our result). (Ch. 24)
- Chain-of-Thought (CoT). Asking for intermediate reasoning steps; emergent with scale. (Ch. 30)
- Chunking. Splitting documents into fragments before indexing them for RAG. (Ch. 31)
- CLIP. Image and text encoders trained contrastively into a shared space → zero-shot. (Ch. 33)
- Residual connection. Adding the input to the output of a block; keeps the gradient alive. (Ch. 7)
- Contrastive (learning). Pulling positive pairs together and pushing negatives apart in the embedding space. (Ch. 26)
- Quantization. Representing weights/activations with fewer bits (scale + zero-point). (Ch. 35)
44.4 D
- D_f (KV window). The number of KV tokens you really need to keep, derived from γ. (Ch. 20)
- Decode. Token-by-token generation phase; bandwidth-bound; the serving bottleneck. (Ch. 36)
- Constrained decoding. Masking invalid tokens + renormalizing → guaranteed format (not truth). (Ch. 29)
- Speculative decoding. A draft proposes tokens and the large model verifies them in parallel (same output, faster). (Ch. 29)
- Distillation. Training a small student to imitate the soft targets of a large teacher. (Ch. 35)
- DPO. Direct preference optimization: the RLHF objective without RL or a reward model. (Ch. 27)
44.5 E
- Embedding. Vector representing the meaning of a token or text. (Ch. 3, 26)
- Encoder / decoder. Bidirectional encoder (understand) / causal decoder (generate). (Ch. 10)
- Erratum. An error of our own, flagged, corrected, and re-proved (e.g. C_V /4→/12). (Ch. 22, 38)
- 1/√d_k scaling. Factor that keeps dot products from saturating the softmax. (Ch. 4)
44.6 F
- FFN (feed-forward). Network that processes each token separately; key-value memory. (Ch. 6)
- FlashAttention. Exact attention without materializing the n×n matrix (tiling + online softmax). (Ch. 34)
- Residual stream. The shared “lane” that each component reads from and writes to; the communication channel. (Ch. 3, 37)
- Folklore. A popular but unjustified belief (or one already contradicted). (Ch. 38)
- Fractional (order). A lens that reads attention as Lévy diffusion of order (γ−1)/2. (Ch. 23)
44.7 G
- γ (gamma). Decay exponent of attention with distance: \(A(d)\propto d^{-\gamma}\). (Ch. 15)
- γ_Padé. Prediction of γ from geometry (θ, T), without training. (Ch. 15)
- GQA / MQA / MLA. Sharing K/V across heads to shrink the cache (not the compute). (Ch. 18)
- Grokking. Late, sudden generalization after memorizing. (Ch. 24)
44.8 H
- Hagedorn (γ=1). The phase boundary between looking far (Phase A) and concentrating (Phase B). (Ch. 21)
- HBM / SRAM. The GPU’s large, slow / tiny, fast memory; their traffic dominates attention. (Ch. 34)
44.9 I — L
- In-context learning (ICL). Learning a task from examples in the prompt, without updating weights. (Ch. 30)
- Jailbreak. A prompt that bypasses the safety training. (Ch. 40)
- LayerNorm / RMSNorm. Normalization that rescales the activation → stability. (Ch. 7)
- Scaling laws. Loss drops predictably with size, data, and compute (Kaplan, Chinchilla). (Ch. 11, 25)
- Logit lens. Projecting an intermediate activation through the output to read the model’s “bet” layer by layer. (Ch. 37)
- LoRA / QLoRA. Low-rank delta (\(B{=}0\) at start, fusable) / over a 4-bit base. (Ch. 28)
- “Lost in the middle”. Models use information in the middle of the context worst. (Ch. 31)
44.10 M — N
- Causal mask. Prevents a token from looking at future ones (generation). (Ch. 9, 10)
- Memory-bound. Limited by moving data, not by computing (the case of attention). (Ch. 34)
- Multi-head. Several attentions in parallel + output projection \(W^O\). (Ch. 5)
- Numerology. A formula that fits the data with no mechanism to explain it. (Ch. 38)
44.11 O — P
- Online softmax. Computing the softmax block by block with a running max and sum. (Ch. 34)
- PagedAttention. Managing the KV cache in virtual-memory-style blocks → zero waste. (Ch. 34, 36)
- PEFT. Efficient fine-tuning: freeze the base and train few parameters. (Ch. 28)
- Perplexity. \(e^{\text{loss}}\): “how many tokens the model hesitates among”. (Ch. 11)
- Pruning. Removing weights (structured = blocks; unstructured = scattered). (Ch. 35)
- Polysemanticity. One neuron firing for many disparate concepts (caused by superposition). (Ch. 37)
- Prefill. Phase that processes the whole prompt in parallel; compute-bound; sets the TTFT. (Ch. 36)
- Prompt injection. Hostile instructions smuggled in as data; worse in agents. (Ch. 30, 40)
44.12 Q — R
- QK / OV (circuits). QK decides where to attend; OV decides what to write. (Ch. 37)
- RAG. Retrieving evidence and conditioning generation on it (non-parametric memory). (Ch. 31)
- ReAct. The agent loop: Thought → Action → Observation. (Ch. 32)
- Receipt. The proof backing a claim: Lean (algebra) or data. (Ch. 38)
- RLHF. Aligning with a reward model + PPO, on a KL leash. (Ch. 27)
- RoPE. Rotary positional encoding: rotates pairs of dimensions according to position. (Ch. 8)
44.13 S — Z
- SAE (sparse autoencoder). A dictionary that decomposes activations into monosemantic features; undoes superposition. (Ch. 37)
- Self-consistency. Sampling several reasoning chains and voting for the majority. (Ch. 30)
- Sink (attention sink). Attention mass that piles up on a few low-information tokens (BOS). (Ch. 17)
- Superposition. Storing more features than dimensions in nearly orthogonal directions. (Ch. 37)
- Temperature (τ). Rescales the logits before the softmax: low = cautious, high = adventurous. (Ch. 12, 29)
- Token. Integer identifier of a chunk of text (subword). (Ch. 2)
- TTFT / TPOT. Time to first token (prefill) / time per output token (decode). (Ch. 36)
- Verified / derived. A claim with a receipt (formal proof or reproducible data). (Ch. 38)
- ViT (Vision Transformer). Transformer over image patches treated as tokens. (Ch. 33)
- Z (partition function). \(Z=\mathrm{Li}_\gamma(e^{-\lambda})\), in the thermodynamic lens. (Ch. 21)
Next reference (R4): how to measure your own γ and the open datasets on Hugging Face for reproducing it.