44 R3 · Glossary

What it is. Every important term in the book, defined at a glance, with the chapter where it is explained in depth. For quick lookup; the developed definitions and their analogies are in the chapters.

44.1 A

Massive activation. A few coordinates of the residual stream with enormous values (~10⁵×); the cause of sinks. (Ch. 17)
Activation patching (causal tracing). Transplanting an activation between a clean pass and a corrupted one to isolate which component is causally responsible. (Ch. 37)
Adapter. Small bottleneck module inserted into each layer; precursor of LoRA. Adds latency. (Ch. 28)
Alignment. Making the model follow instructions and conform to what we prefer (SFT + RLHF/DPO). (Ch. 27)
Aliasing. When a RoPE pair completes a full turn and can no longer distinguish distances → loss of positional resolution. (Ch. 14)
Hallucination. Output that is fluent, confident, and false (or unsupported). Partly intrinsic to next-token prediction. (Ch. 40)
ANN (approximate nearest neighbors). Finding the closest vectors without comparing against all of them (HNSW, FAISS). (Ch. 31)
Attention. Weighted mix in which each token gathers information from the others by affinity (Q·K). (Ch. 4)
Linear attention. Replacing the softmax with a kernel that lets you reassociate the computation → O(n) cost; loses recall. (Ch. 34)
γ atlas. γ measured across 42 models from 4 families: a cross-architecture coordinate. (Ch. 16)

44.2 B

Continuous batching. Grouping requests at the iteration level (admit/release at each step) → keeps the GPU full. (Ch. 36)
Beam search. Decoding that keeps several hypotheses; good for closed tasks. (Ch. 12)
BPE (Byte-Pair Encoding). Tokenization that merges the most frequent symbol pairs. (Ch. 2)
Bradley-Terry. The reward model’s loss: ranks responses according to human preferences. (Ch. 27)

44.3 C

Head. One of the parallel attentions in multi-head; tends to specialize. (Ch. 5)
Induction head. A head that completes patterns “A…A→B”; a candidate mechanism for in-context learning. (Ch. 5, 30)
KV cache. Memory of already-computed keys/values so they aren’t recomputed when generating. (Ch. 12, 20)
Inter-layer CKA. Representational similarity between layers; its re-ascent predicts grokking (our result). (Ch. 24)
Chain-of-Thought (CoT). Asking for intermediate reasoning steps; emergent with scale. (Ch. 30)
Chunking. Splitting documents into fragments before indexing them for RAG. (Ch. 31)
CLIP. Image and text encoders trained contrastively into a shared space → zero-shot. (Ch. 33)
Residual connection. Adding the input to the output of a block; keeps the gradient alive. (Ch. 7)
Contrastive (learning). Pulling positive pairs together and pushing negatives apart in the embedding space. (Ch. 26)
Quantization. Representing weights/activations with fewer bits (scale + zero-point). (Ch. 35)

44.4 D

D_f (KV window). The number of KV tokens you really need to keep, derived from γ. (Ch. 20)
Decode. Token-by-token generation phase; bandwidth-bound; the serving bottleneck. (Ch. 36)
Constrained decoding. Masking invalid tokens + renormalizing → guaranteed format (not truth). (Ch. 29)
Speculative decoding. A draft proposes tokens and the large model verifies them in parallel (same output, faster). (Ch. 29)
Distillation. Training a small student to imitate the soft targets of a large teacher. (Ch. 35)
DPO. Direct preference optimization: the RLHF objective without RL or a reward model. (Ch. 27)

44.5 E

Embedding. Vector representing the meaning of a token or text. (Ch. 3, 26)
Encoder / decoder. Bidirectional encoder (understand) / causal decoder (generate). (Ch. 10)
Erratum. An error of our own, flagged, corrected, and re-proved (e.g. C_V /4→/12). (Ch. 22, 38)
1/√d_k scaling. Factor that keeps dot products from saturating the softmax. (Ch. 4)

44.6 F

FFN (feed-forward). Network that processes each token separately; key-value memory. (Ch. 6)
FlashAttention. Exact attention without materializing the n×n matrix (tiling + online softmax). (Ch. 34)
Residual stream. The shared “lane” that each component reads from and writes to; the communication channel. (Ch. 3, 37)
Folklore. A popular but unjustified belief (or one already contradicted). (Ch. 38)
Fractional (order). A lens that reads attention as Lévy diffusion of order (γ−1)/2. (Ch. 23)

44.7 G

γ (gamma). Decay exponent of attention with distance: \(A(d)\propto d^{-\gamma}\). (Ch. 15)
γ_Padé. Prediction of γ from geometry (θ, T), without training. (Ch. 15)
GQA / MQA / MLA. Sharing K/V across heads to shrink the cache (not the compute). (Ch. 18)
Grokking. Late, sudden generalization after memorizing. (Ch. 24)

44.8 H

Hagedorn (γ=1). The phase boundary between looking far (Phase A) and concentrating (Phase B). (Ch. 21)
HBM / SRAM. The GPU’s large, slow / tiny, fast memory; their traffic dominates attention. (Ch. 34)

44.9 I — L

In-context learning (ICL). Learning a task from examples in the prompt, without updating weights. (Ch. 30)
Jailbreak. A prompt that bypasses the safety training. (Ch. 40)
LayerNorm / RMSNorm. Normalization that rescales the activation → stability. (Ch. 7)
Scaling laws. Loss drops predictably with size, data, and compute (Kaplan, Chinchilla). (Ch. 11, 25)
Logit lens. Projecting an intermediate activation through the output to read the model’s “bet” layer by layer. (Ch. 37)
LoRA / QLoRA. Low-rank delta (\(B{=}0\) at start, fusable) / over a 4-bit base. (Ch. 28)
“Lost in the middle”. Models use information in the middle of the context worst. (Ch. 31)

44.10 M — N

Causal mask. Prevents a token from looking at future ones (generation). (Ch. 9, 10)
Memory-bound. Limited by moving data, not by computing (the case of attention). (Ch. 34)
Multi-head. Several attentions in parallel + output projection \(W^O\). (Ch. 5)
Numerology. A formula that fits the data with no mechanism to explain it. (Ch. 38)

44.11 O — P

Online softmax. Computing the softmax block by block with a running max and sum. (Ch. 34)
PagedAttention. Managing the KV cache in virtual-memory-style blocks → zero waste. (Ch. 34, 36)
PEFT. Efficient fine-tuning: freeze the base and train few parameters. (Ch. 28)
Perplexity. \(e^{\text{loss}}\): “how many tokens the model hesitates among”. (Ch. 11)
Pruning. Removing weights (structured = blocks; unstructured = scattered). (Ch. 35)
Polysemanticity. One neuron firing for many disparate concepts (caused by superposition). (Ch. 37)
Prefill. Phase that processes the whole prompt in parallel; compute-bound; sets the TTFT. (Ch. 36)
Prompt injection. Hostile instructions smuggled in as data; worse in agents. (Ch. 30, 40)

44.12 Q — R

QK / OV (circuits). QK decides where to attend; OV decides what to write. (Ch. 37)
RAG. Retrieving evidence and conditioning generation on it (non-parametric memory). (Ch. 31)
ReAct. The agent loop: Thought → Action → Observation. (Ch. 32)
Receipt. The proof backing a claim: Lean (algebra) or data. (Ch. 38)
RLHF. Aligning with a reward model + PPO, on a KL leash. (Ch. 27)
RoPE. Rotary positional encoding: rotates pairs of dimensions according to position. (Ch. 8)

44.13 S — Z

SAE (sparse autoencoder). A dictionary that decomposes activations into monosemantic features; undoes superposition. (Ch. 37)
Self-consistency. Sampling several reasoning chains and voting for the majority. (Ch. 30)
Sink (attention sink). Attention mass that piles up on a few low-information tokens (BOS). (Ch. 17)
Superposition. Storing more features than dimensions in nearly orthogonal directions. (Ch. 37)
Temperature (τ). Rescales the logits before the softmax: low = cautious, high = adventurous. (Ch. 12, 29)
Token. Integer identifier of a chunk of text (subword). (Ch. 2)
TTFT / TPOT. Time to first token (prefill) / time per output token (decode). (Ch. 36)
Verified / derived. A claim with a receipt (formal proof or reproducible data). (Ch. 38)
ViT (Vision Transformer). Transformer over image patches treated as tokens. (Ch. 33)
Z (partition function). \(Z=\mathrm{Li}_\gamma(e^{-\lambda})\), in the thermodynamic lens. (Ch. 21)

Next reference (R4): how to measure your own γ and the open datasets on Hugging Face for reproducing it.