13 Inference and decoding
Where we are. We have a trained model (Ch. 11). But a model doesn’t “blurt out” a whole answer: it writes it word by word. This chapter explains how text is generated —choosing the next token, temperature, top-k/top-p— and the KV-cache, the piece that makes generation fast… and that grows until it becomes a bottleneck. With it we close the fundamentals (Part I) and open the door to our Part II.
13.1 The idea in one sentence
Generating is a loop: the model predicts the next token, one is chosen, it’s added to the text, and we start over —token by token until we’re done—.
13.2 Key concepts and their role in the transformer
Before we dig in, let’s define this chapter’s terms and what each one is for inside a transformer:
- Autoregressive loop. Definition: predict → choose → add → repeat, token by token. In the transformer: it’s how a model “writes”; each choice is conditioned on everything before it.
- Logits. Definition: the raw scores the model assigns to each vocabulary token before the softmax. In the transformer: the starting point of every decoding strategy.
- Greedy / beam search. Definition: always taking the most probable token (greedy) or keeping the B most probable sequences (beam). In the transformer: deterministic and good at closed tasks, but repetitive in open-ended text.
- Sampling. Definition: choosing at random according to the distribution. In the transformer: it adds variety, but pure sampling sometimes fishes out rare tokens from the tail.
- Temperature. Definition: divides the logits before the softmax,
softmax(logits/T). In the transformer: it regulates how much the model “dares”; T<1 cautious, T>1 creative, T→0 greedy. - Top-k. Definition: sampling only among the k most probable tokens. In the transformer: it cuts the tail with a fixed k, whether the model is confident or hesitating.
- Top-p / nucleus. Definition: sampling from the smallest set whose cumulative probability reaches p. In the transformer: the number of candidates adapts to the model’s confidence; it’s today’s standard.
- KV-cache. Definition: a store of the keys (K) and values (V) of past tokens. In the transformer: it makes generation fast (it doesn’t recompute the past), but it grows with length → a memory bottleneck.
With these pieces understood, the rest of the chapter is watching them work together at each step of generation.
13.3 The autoregressive loop
This is, literally, what “writing” means for a model:
- Look at everything written so far → predict the distribution of the next token.
- Choose a token from that distribution (the strategies come next).
- Add it to the text.
- Repeat, until an end token or a length limit.
Each choice is conditioned on everything before it. That’s why a model can write coherently: each word “sees” the ones it has already placed.
13.4 From logits to a choice: the decoding strategies
The model gives some scores (logits) → probabilities. Which token do we take?
- Greedy: always the most probable. Deterministic, but repetitive and bland, and it can fall into loops. (Subtle: choosing the most probable at each step does not give the globally most probable sequence —it’s a local optimum—.)
- Beam search: keeps the B most probable partial sequences and at the end keeps the best one. It does well on tasks with a “correct” answer (translate, summarize), but in open-ended text it comes out just as soft and repetitive.
- Sampling: chooses at random according to the distribution → variety. But pure sampling sometimes fishes out rare tokens from the “tail” and lets incoherence slip through.
The next two ideas fix exactly that.
13.5 Temperature
Temperature regulates how much the model “dares”. It’s applied to the logits before the softmax:
\[ \text{softmax}(\text{logits} / T) \]
What T does:
- T < 1 → sharpens the distribution: more weight on the dominant token (cautious, confident).
- T > 1 → flattens it: spreads across more options (creative, risky).
- T → 0 → equivalent to greedy (always the most probable).
🧩 Analogy. Low T = a prudent writer who always uses the obvious word; high T = a bold one who tries unexpected words (sometimes brilliant, sometimes incoherent). (Careful: temperature is applied to the logits, not to the already-normalized probabilities.)
13.6 Top-k and top-p (nucleus)
Instead of sampling from the whole vocabulary, we restrict to the good candidates:
- Top-k (Fan et al. 2018): samples only among the k most probable tokens. It sticks with a fixed k, regardless of whether the model is confident or hesitating.
- Top-p / nucleus (Holtzman et al. 2020): samples from the smallest set of tokens whose cumulative probability reaches p (e.g. 0.92). The number of candidates adapts: few when the model is confident, more when it hesitates.
Why do they win? They cut the unreliable tail (better than pure sampling) and avoid the repetition/degeneration of greedy and beam search. It’s the sweet spot between coherence and variety —which is why top-p is the most used today—.
(There are also patches against repetition: the repetition penalty lowers the logit of what’s already been said, and no-repeat-ngram forbids repeating groups of words.)
import torch, torch.nn.functional as F
def next_token(logits, T=0.8, p=0.92):
logits = logits / T # temperature
probs = F.softmax(logits, dim=-1)
order = probs.argsort(descending=True) # nucleus (top-p):
cum = probs[order].cumsum(0)
cut = (cum >= p).nonzero()[0].item() + 1 # the minimal "nucleus"
chosen = order[:cut]
return chosen[torch.multinomial(probs[chosen], 1)] # sample from the nucleus13.7 The KV-cache: the bridge to Part II
Here’s the piece that connects to everything coming next. Without it, generating token number n would force us to recompute attention over all the previous ones at each step —a huge waste—.
The solution: the KV-cache stores the keys (K) and values (V) of past tokens. That way, each new token only computes its own and attends to what’s already stored.
- What it’s for: it makes generation fast (it doesn’t recompute the past).
- The problem: the cache grows linearly with length × layers × heads → it becomes the main memory bottleneck in long context.
🧩 Analogy. Instead of rereading the whole page from the start before writing each word, you keep some notes of what you’ve read (the K/V) and just consult them —but those notes take up more and more room on the desk—.
That endlessly growing cache is exactly what we tackle in Part II: from γ we derive a KV-cache compression window (which tokens to keep and which to discard, with no parameters to tune), and we’ll see how GQA/MQA shrink it by sharing K/V across heads. The “KV-cache problem” is one of the gateways into our unique territory.
13.8 Summary
- Generating = autoregressive loop: predict → choose → add → repeat, token by token.
- Greedy (repetitive), beam (good at closed tasks), sampling (varied but sometimes incoherent).
- Temperature
softmax(logits/T): T<1 cautious, T>1 creative, T→0 greedy. - Top-k (fixed k) and top-p/nucleus (adaptive) cut the unreliable tail → the sweet spot; top-p is the standard.
- The KV-cache trades memory for speed: it speeds up generation but grows with length → a bottleneck that Part II attacks (D_f window, GQA/MQA).
- Decoding is inference (model already trained), not training.
End of Part I. You now understand a complete transformer: from tokens to generation, piece by piece. Next (Part II): we enter our turf —how a model attends across distance, the γ decay law, and the tools that come out of it—.
13.9 Exercises
- Greedy. Give an intuitive example of why choosing the most probable word at each step may not give the most probable sentence.
- Temperature. What T would you use for (a) a reliable code assistant; (b) a creative idea generator? Why?
- Adaptive top-p. Explain why top-p considers fewer candidates when the model is confident and more when it hesitates. What advantage does it have over top-k?
- KV-cache. Why does generation get slower and consume more memory as the text grows? What does the cache trade off?