30 Text generation in depth
Where we are. This opens Part V (using the model). In Ch. 12 we covered the basics of decoding —greedy, beam, temperature, top-k, top-p, KV-cache—. Here we go deeper and more hands-on: how to pick a strategy, the newer sampling methods, how to control the output (repetition, length, guidance), how to enforce a format (JSON, grammars), how to speed things up (speculative decoding) and how to evaluate generated text —pitfalls and all—.
30.1 The idea in one sentence
The quality of what a model generates is not just a matter of the model, but of how you decode it: the same network can sound flat and repetitive or creative and coherent depending on the decoding strategy you choose.
30.2 Key concepts and their role in the transformer
Before diving in, we define this chapter’s terms and what each one is for inside a transformer:
- Decoding. Definition: the procedure that turns the probabilities the model produces into actual tokens, one by one. In the transformer: it is the layer that comes after the network; the same network can sound flat or creative depending on how you decode.
- Greedy / beam (maximization). Definition: at each step pick the most likely token (or the globally most likely sequence). In the transformer: it gives precision on closed tasks, but on open-ended text it degenerates into repetition.
- Sampling (temperature, top-k, top-p). Definition: draw the next token at random from the distribution, trimmed by temperature/top-k/top-p. In the transformer: it provides the diversity human-like text needs; the filters prune the tail of unlikely tokens.
- Min-p. Definition: a cutoff threshold relative to the most likely token. In the transformer: it adapts to the model’s confidence → it holds up at high temperatures without going off the rails.
- Control (repetition, length). Definition: levers that lower the logits of already-seen tokens or decide when to stop. In the transformer: they break the self-reinforcing loop of repetition and pin down the output length.
- Constrained decoding. Definition: at each step, mask out the tokens that would violate a schema (JSON, regex, grammar) and renormalize. In the transformer: it guarantees the shape of the output (not its truth).
- Speculative decoding. Definition: a draft model proposes several tokens and the big one verifies them in parallel. In the transformer: it speeds up inference without changing the output distribution.
- Evaluation (reference / LLM-as-judge / faithfulness). Definition: the families of metrics for judging generated text. In the transformer: they measure different axes —overlap, preference, truthfulness— and none is perfect.
With that in mind, let’s go step by step.
30.3 Decoding is deciding: probability ≠ quality
The first, counterintuitive lesson: searching for the most probable sequence does not give the best text in open-ended generation. The landmark work (Holtzman et al. 2020) showed it: maximization (greedy, beam) on open-ended text produces output that is flat, repetitive and degenerate —the model gets stuck in loops—. And the finding that explains it: human text does NOT follow the maximum-probability path; people are “surprising”, and text that is highly probable at every step sounds artificial.
Hence the central decision, depending on the task:
- Deterministic (greedy / beam): for closed tasks —translation, code, data, factual questions— where there is (almost) one correct answer and you want precision.
- Stochastic (sampling): for open-ended/creative tasks —writing, conversing, brainstorming— where diversity matters and repetition kills.
The most striking thing in (Holtzman et al. 2020) is this: if you plot the probability the model assigns to each word of human-written text, it oscillates —it rises and falls—; it does not stay glued to the top. Natural language mixes the expected with the unexpected. That is why a decoder that only chases the most probable produces something too predictable to seem human. Repetition degeneration, moreover, is a self-reinforcing loop: repeating a token raises its future probability.
30.4 Sampling in depth
A reminder (Ch. 12): temperature \(\tau\) rescales the logits before the softmax, softmax(z/τ) —with \(\tau\to 0\) it tends toward greedy (always the most probable) and \(\tau\to\infty\) toward uniform (pure chance)—; top-k keeps the \(k\) best; top-p (nucleus) keeps the smallest set whose probability sums to \(p\). On top of that, methods more recent than what Ch. 12 covered:
- Min-p (Nguyen et al. 2025): the threshold is relative to the most likely token (
p_base × prob_of_top). When the model is confident (one token dominates), it trims aggressively; when it hesitates, it leaves more options. It is more robust to high temperature —it lets the model be creative without going off the rails—. - Typical sampling (Meister et al. 2023): it samples among tokens whose information content is close to the expected surprise (the conditional entropy). It discards both the too obvious and the too weird.
- Eta / epsilon (Hewitt et al. 2022): it views the model as “true distribution + smoothing”; truncating is un-smoothing. η uses a threshold that depends on entropy (more candidates when the model is genuinely uncertain, fewer when it is confident) —better than top-p at escaping repetition—.
- Contrastive search (Su et al. 2022): at each step it balances the model’s confidence against a degeneration penalty (similarity to what was already generated) → coherent and non-repetitive, without the noise of sampling.
- Contrastive decoding (Li et al. 2022): it scores with
log p(expert) − log p(amateur)(a strong model minus a weak one); the contrast cancels the failures (repetition, blandness) that the amateur also commits.
Practical recipe: typical values are temp ≈ 0.7–1.0 and top-p ≈ 0.9 (or top-k ≈ 40). Watch out: these filters combine, and the order matters (temperature → top-k → top-p is the usual order over a single step). Lower the temperature / go greedy for factual; raise the temperature + min-p/top-p for creative.
30.5 Controlling generation
Beyond how you sample, there are levers to steer the output:
- Repetition. The repetition penalty (Keskar et al. 2019) divides the logits of already-seen tokens by a factor >1; no-repeat-ngram forbids repeating any n-gram; the frequency penalty (scales with the number of occurrences) and the presence penalty (flat, once seen) are the ones in the APIs. All of them fight Holtzman’s self-reinforcing loop.
- Length.
max/minnew tokens; the length penalty in beam (without it, beam prefers short sequences); handling the end-of-sequence token (EOS): suppress it to force a minimum length, or bias toward it to finish sooner. - Classifier-free guidance (CFG) for text (Sanchez et al. 2023): borrowed from diffusion models. It amplifies the effect of the prompt by contrasting the logits with and without the condition:
\[ \text{logits} = \text{logits}_{\text{sin}} + \gamma\,\big(\text{logits}_{\text{con}} - \text{logits}_{\text{sin}}\big) \]
Term by term: logits_con = with the prompt; logits_sin = without it (unconditional); \(\gamma\) = the guidance weight (how much to exaggerate the difference). \(\gamma=1\) is normal generation; \(\gamma>1\) pushes the text to “stay on topic” with the prompt. - Others: logit bias (force or veto specific tokens) and stopping criteria (stop strings, EOS, max tokens).
30.6 Constrained decoding: enforcing a valid format
Sometimes you need the output to obey a schema (JSON, a regex, a grammar). The idea is simple and powerful: at every step, you zero out the probability of tokens that would violate the constraint and renormalize over the valid ones. That way the output is guaranteed to fit.
- Outlines (Willard and Louf 2023): it compiles the regex/grammar into a finite-state machine and precomputes the mask of valid tokens per state → near-zero cost, regex- or JSON-schema-guided generation.
- Grammar-constrained decoding (Geng et al. 2023): formal grammars (llama.cpp’s GBNF style). “JSON mode” and function calling are productized cases of this.
🧩 Analogy. It is an autocomplete that physically cannot type an invalid character —like a form field that only accepts digits—.
Constraints ensure shape, not truth: a perfectly valid JSON can contain an incorrect answer. And there is an open debate: Let Me Speak Freely? (Tam et al. 2024) found that constraining the format can degrade reasoning (JSON mode helps with classification but hurts reasoning tasks). Others counter that careful schema design recovers it. Treat it as a trade-off, not as settled: constrain when you need flawless parsing, but measure whether you lose content quality.
30.7 Speculative decoding: the same output, faster
Generation is sequential (one token after another), and that is where the latency bottleneck lives. Speculative decoding (Leviathan et al. 2023; Chen et al. 2023) attacks it without changing the result: a small draft model proposes several tokens, and the big model verifies them in parallel in a single pass, accepting the correct prefix and recomputing at the first failure. The output distribution is identical to that of the big model alone; just faster. Today it is the dominant lossless speedup in inference.
🧩 Analogy. A junior drafts a sentence and the expert reviews it at a glance, keeping the correct part: the same final words, in much less time.
30.8 Evaluating generated text
How do you know whether an output is good? It is hard, because in open-ended generation there is no single reference. Three families:
- Reference-based (overlap with a “correct” text): BLEU (Papineni et al. 2002) (n-gram precision, translation), ROUGE (Lin 2004) (n-gram recall, summarization), BERTScore (Zhang et al. 2020) (embedding similarity, captures paraphrase). Honest: BLEU/ROUGE correlate weakly with human judgment on open-ended text.
- Model-based (LLM-as-judge): G-Eval (Liu et al. 2023) (a GPT-4 scores with a chain of reasoning) and MT-Bench / Chatbot Arena (Zheng et al. 2023) (pairwise comparison, human Elo). Honest: LLM judges have biases —position bias (they favor the first answer), verbosity bias (they favor the longer one) and self-preference (they favor their own model family)—.
- Faithfulness / hallucination: this is a separate axis from fluency —an answer can sound perfect and be false—.
The decoding strategy does not change the model’s attention geometry; but the length of what you generate does drive up the cost of the KV-cache. tafagent computes, from γ (Ch. 15-20), the KV budget at the target length: useful for anticipating how much memory a long generation will cost before you launch it.
30.9 Summary
- Probability ≠ quality (Holtzman et al. 2020): maximizing (greedy/beam) degenerates on open-ended text; human text does not follow the probability ridge → sample.
- When: deterministic for closed/factual/code; sampling for open-ended/creative.
- New sampling: min-p (threshold relative to the top, robust to high temperature), typical, eta/epsilon, contrastive search/decoding. Recipe: temp≈0.7–1.0, top-p≈0.9; they combine and the order matters.
- Control: repetition/frequency/presence penalty, no-repeat-ngram; length and EOS; text CFG (
sin + γ·(con − sin)); logit bias. - Constrained: mask invalid tokens + renormalize (Outlines/grammars, JSON mode) → guaranteed shape; but shape ≠ truth, and it can hurt reasoning (Tam et al. 2024).
- Speculative: draft proposes, big one verifies in parallel → same output, faster.
- Evaluate: BLEU/ROUGE/BERTScore (reference, weak correlation), LLM-as-judge (G-Eval, MT-Bench, with biases position/verbosity/self-preference); faithfulness is a separate axis.
Next (Chapter 30): without touching the weights, the prompt itself programs the model — prompting, in-context learning, chain-of-thought and the induction heads that make it possible—.
30.10 Exercises
- Probability ≠ quality. Why do greedy/beam degenerate in open-ended generation? What did Holtzman show about human text?
- Min-p vs top-p. How does the min-p threshold differ from the top-p one, and why does min-p hold up better at high temperatures?
- Text CFG. In
logits_sin + γ·(logits_con − logits_sin), what does \(\gamma\) do? What happens with \(\gamma=1\)? - Constrained. Explain the “mask + renormalize” mechanism. Why does a valid JSON not guarantee a correct answer?
- Speculative. Why does speculative decoding produce exactly the same distribution as the big model alone?
- LLM judges. Cite two biases of using an LLM as an evaluator.