36 Quantization, distillation and pruning
Where we are. Still on efficiency (Part VI). Ch. 34 made attention cheaper; here we make the whole model cheaper: making it smaller and cheaper to run without losing (much) quality. Three families that combine —quantization (fewer bits), distillation (training a small student) and pruning (removing weights)—, with their mechanics and their honest trade-offs.
36.1 The idea in one sentence
A trained model can be shrunk along three orthogonal routes —use fewer bits per weight, teach a small model to imitate a big one, or remove redundant weights— always trading some quality for size, speed and cost.
36.2 Key concepts and their role in the transformer
Before we dig in, let’s define this chapter’s terms and what each one is for inside a transformer:
- Quantization. Definition: representing weights/activations with fewer bits (FP32 → INT8 → 4 bits). In the transformer: it cuts memory and bandwidth —the most widely used route to make a large model fit and run fast.
- Scale and zero-point. Definition: the two numbers of the map \(x \approx S\,(x_q - Z)\) that translate between the real and the integer values. In the transformer: they define how each weight is rounded to few bits without losing its range.
- PTQ vs QAT. Definition: quantizing after training (cheap) vs simulating quantization during training (better at very low bits, expensive). In the transformer: PTQ is the norm; QAT, for extreme cases.
- Activation outliers. Definition: a few dimensions with enormous values that break naive quantization. In the transformer: they appear at scale (~6,700M parameters) and force you to handle them separately (LLM.int8, SmoothQuant).
- Distillation. Definition: training a small student model to imitate the distribution of a large teacher. In the transformer: it transfers capability to a model that’s cheaper to serve.
- Soft targets and temperature. Definition: the teacher’s full probability distribution (not just the answer), softened with a temperature \(T\). In the transformer: it’s the “dark knowledge” that teaches more than the hard label.
- Pruning (structured vs unstructured). Definition: removing individual weights (unstructured) or whole blocks —heads, neurons, layers— (structured). In the transformer: the structured kind actually speeds things up on ordinary hardware; the unstructured kind needs specific support.
- Sparsity. Definition: the fraction of weights set to zero. In the transformer: more sparsity = lighter model, until quality drops.
With this in hand, let’s go family by family.
36.3 Why compress
A trained model is expensive along four axes: memory (it takes up tens of GB), bandwidth (moving those weights), energy and latency/cost. Compressing it aims to make it fit (on a single GPU, on a phone) and to make serving cheap. The three families are orthogonal and composable: quantization is the “first coat” (the cheapest); distillation requires a training budget; pruning changes the shape of the model.
36.4 Quantization: fewer bits per weight
Quantizing is approximating a real value by scale × integer: \(x \approx S\,(x_q - Z)\), where \(S\) (the scale) sets the step size and \(Z\) (the zero-point) marks which integer represents the real zero (Jacob et al. 2018). It can be applied per tensor (one \(S\) for everything), per channel (one per row/column) or per group (blocks of, say, 128 weights, each with its own scale) —the finer, the more faithful, the more metadata— (Nagel et al. 2021).
The key method distinction is when you quantize: PTQ (post-training quantization) quantizes an already-trained model —cheap, no retraining— and QAT (quantization-aware training) simulates quantization during training —better at very low bits, but expensive.
🧩 Analogy — 16 pencils instead of a million shades. Quantizing is describing colors with 16 colored pencils instead of a million shades: for almost everything it looks just as good and weighs far less. The scale is how far apart the shades in your box are; the zero-point, which of them is the “white”.
The problem at scale: outliers. In large LLMs, a few activation dimensions have gigantic values that, when quantized crudely, dominate the range and wreck the precision of the rest. The modern solutions (all PTQ):
- LLM.int8() (Dettmers et al. 2022): detects those outlier dimensions and computes them separately in 16 bits, while the remaining >99.9% goes in 8 bits → INT8 with no loss up to 175,000M. (The outliers emerge above ~6,700M parameters.)
- SmoothQuant (Xiao et al. 2023): since activations are hard to quantize but weights aren’t, it migrates the difficulty from activations to weights with an equivalent transformation → W8A8 (weights and activations in 8 bits) with almost no loss.
- GPTQ (Frantar et al. 2023): one-shot quantization to 3-4 bits using second-order (Hessian) information to correct the error; quantizes a 175B in ~4 GPU hours.
- AWQ (Lin et al. 2024): identifies the ~1% of “salient” weights (by the activation, not the weight) and protects them with a rescaling —avoiding mixed precision.
🧩 Analogy — the voices that distort. Outliers are like a few very loud voices in a choir: if you record them with the same mic as everyone else, they saturate the whole mix. You have to give them a separate high-fidelity track (the 16 bits of LLM.int8) and record the rest normally.
The bit frontier: 8 bits ≈ no loss; 4 bits almost no loss with good methods (protecting the ~1%); below ~3-4 bits there’s a “cliff” that disproportionately damages math and code.
BitNet b1.58 (Ma et al. 2024) takes each weight to ternary {−1, 0, 1} (≈1.58 bits) and claims to match an FP16 model of the same size. Two honest caveats: (1) it matches FP16 only from ~3,000M parameters up; (2) it’s not post-training quantization but QAT from scratch (you have to retrain the model with that constraint). A promising result, but from a single team and still without broad independent replication at large scale.
36.5 Distillation: teaching a small student
Training a small student model to imitate not just the teacher’s (large) answer but its full probability distribution —the soft targets—, softened with a temperature \(T\) that enlarges the differences between unlikely options (Hinton et al. 2015). That “dark knowledge” (the teacher seeing “probably cat, maybe dog, definitely not car”) teaches more than the hard label “cat”.
(Loss detail: since the gradients of the soft targets scale as \(1/T^2\), they’re multiplied by \(T^2\) when combined with the hard label, so both weigh similarly.)
🧩 Analogy — the teacher who teaches the nuance. Instead of telling the apprentice only the final answer (“it’s a cat”), the teacher conveys the nuance: “this is almost certainly a cat, could be mistaken for a dog, never for a car”. That apportioning of doubt is what really teaches generalization.
The canonical case is DistilBERT (Sanh et al. 2019): 40% smaller, 60% faster and retains ~97% of BERT’s capability, with a triple loss (language modeling + distillation + cosine distance). TinyBERT (Jiao et al. 2020) goes further and also distills the attention matrices and intermediate states (7.5× smaller). For generation, sequence-level distillation (Kim and Rush 2016) trains the student on the teacher’s outputs —the basis for “distilling” a large LLM into a small one. Honest: beware the loose sense of “distilling” = training on a better model’s outputs; as we saw (Ch. 27 (Gudibande et al. 2023)), that copies the style, not the capability.
36.6 Pruning: removing what’s redundant
Removing weights to shrink the model. Unstructured: zeroing out individual weights → high sparsity, but needs special hardware to accelerate. Structured: removing whole blocks (heads, neurons, layers) → faster on ordinary hardware.
The conceptual backdrop is the lottery ticket hypothesis (Frankle and Carbin 2019): inside a dense network there are subnetworks (“winning tickets”) that, trained alone, match the original —i.e. a large fraction of the weights are redundant. In LLMs, modern one-shot pruning exploits this: SparseGPT (Frantar and Alistarh 2023) prunes ≥50% of the weights of a 175B without retraining and with minimal loss; Wanda (Sun et al. 2024) does something simpler —pruning by |weight| × ‖input activation‖— also without retraining.
And it connects with Ch. 5: you can prune whole attention heads —Michel et al. (Michel et al. 2019) removed many with hardly any loss; Voita et al. (Voita et al. 2019) pruned 38 of 48 heads with only −0.15 BLEU.
🧩 Analogy — pruning the tree. Pruning is cutting the dead branches off a tree: it ends up lighter but it’s still the same tree. The structured kind removes whole branches (you notice when you lift it); the unstructured kind removes scattered leaves all over (it weighs less, but carrying it costs about the same if you don’t have the right cart).
Setting scattered weights to zero rarely speeds things up on an ordinary GPU: the hardware can’t “skip” randomly scattered zeros. That’s why SparseGPT/Wanda target 2:4 patterns (two of every four set to zero), which recent GPUs do accelerate. The moral: structured pruning usually wins in practice; the unstructured kind, only with hardware support.
36.7 How they combine, and the honest trade-off
The typical order of “reach”: quantization (cheapest and most used) → distillation (if you can train) → pruning (a change of shape). And they compose: SparseGPT combines with quantization; QLoRA (Ch. 28) is a 4-bit base + LoRA. The universal, honest rule: all compression trades some quality for size/speed; the art is finding the knee of the curve.
Don’t rely on perplexity alone: compression disproportionately damages rare behaviors (math, code, long-tail cases). Also measure downstream tasks, robustness and infrequent cases before signing off on a compressed version.
36.8 Bridge to our theme (brief and honest)
Quantization and pruning are content-agnostic reductions (smaller weights, fewer weights). Our γ-based D_f window (Ch. 20) reduces a different object —the KV cache, not the weights— in a content-aware way: it’s an orthogonal axis, not a rival method. And head pruning fits our observation that different heads decay differently (their per-head γ): some matter more than others —consistent both with head pruning and with our per-head γ, without claiming causality.
tafagent works on an axis complementary to weight compression: it computes the KV budget (cache memory) from γ (Ch. 20). If you quantize and/or prune a model so it fits, also measure its γ to know how much cache you’ll need when serving it at long context —the two levers add up.
36.9 Summary
- Three composable families: quantization (fewer bits), distillation (small student), pruning (remove weights). All compression trades quality for size/speed.
- Quantization: map \(x\approx S(x_q-Z)\); PTQ (cheap) vs QAT (better at low bits). The challenge at scale is the outliers → LLM.int8 (16 bits aside), SmoothQuant (migrate to weights), GPTQ (3-4 bits, Hessian), AWQ (protect the ~1%). 8 bits ≈ no loss; 4 bits almost; sub-4-bit contested (BitNet, QAT from scratch).
- Distillation (Hinton et al. 2015): the student imitates the teacher’s soft targets (temperature \(T\), “dark knowledge”). DistilBERT (40% smaller, 97% of quality). Honest: imitating outputs copies style, not capability.
- Pruning: structured (speeds up on normal HW) vs unstructured (needs support); SparseGPT/Wanda prune ~50% one-shot; head pruning (Ch. 5). Scattered sparsity doesn’t always speed up (hence the 2:4 pattern).
- Evaluate: not just perplexity — compression damages the rare stuff (math/code).
- Bridge: compression = content-agnostic (weights); D_f (γ) = content-aware (KV cache), an orthogonal axis.
Next (Chapter 36): with the model now efficient and compressed, we close Part VI with serving and deployment —throughput, continuous batching, KV-cache paging, latency vs cost.
36.10 Exercises
- The map. In \(x \approx S\,(x_q - Z)\), what role do \(S\) and \(Z\) play? What does per-group quantization gain over per-tensor?
- PTQ vs QAT. When is the cost of QAT worth it over the convenience of PTQ?
- Outliers. Why do a few dimensions break naive quantization, and how does LLM.int8() solve it?
- Soft targets. What does the teacher’s full distribution teach that the hard label doesn’t? What is the temperature \(T\) for?
- Pruning. Why does structured pruning usually speed things up more than unstructured on an ordinary GPU?
- Honesty. Why isn’t perplexity enough to evaluate a compressed model?