30  Text generation in depth

Where we are. This opens Part V (using the model). In Ch. 12 we covered the basics of decoding —greedy, beam, temperature, top-k, top-p, KV-cache—. Here we go deeper and more hands-on: how to pick a strategy, the newer sampling methods, how to control the output (repetition, length, guidance), how to enforce a format (JSON, grammars), how to speed things up (speculative decoding) and how to evaluate generated text —pitfalls and all—.

30.1 The idea in one sentence

The quality of what a model generates is not just a matter of the model, but of how you decode it: the same network can sound flat and repetitive or creative and coherent depending on the decoding strategy you choose.

30.2 Key concepts and their role in the transformer

Before diving in, we define this chapter’s terms and what each one is for inside a transformer:

  • Decoding. Definition: the procedure that turns the probabilities the model produces into actual tokens, one by one. In the transformer: it is the layer that comes after the network; the same network can sound flat or creative depending on how you decode.
  • Greedy / beam (maximization). Definition: at each step pick the most likely token (or the globally most likely sequence). In the transformer: it gives precision on closed tasks, but on open-ended text it degenerates into repetition.
  • Sampling (temperature, top-k, top-p). Definition: draw the next token at random from the distribution, trimmed by temperature/top-k/top-p. In the transformer: it provides the diversity human-like text needs; the filters prune the tail of unlikely tokens.
  • Min-p. Definition: a cutoff threshold relative to the most likely token. In the transformer: it adapts to the model’s confidence → it holds up at high temperatures without going off the rails.
  • Control (repetition, length). Definition: levers that lower the logits of already-seen tokens or decide when to stop. In the transformer: they break the self-reinforcing loop of repetition and pin down the output length.
  • Constrained decoding. Definition: at each step, mask out the tokens that would violate a schema (JSON, regex, grammar) and renormalize. In the transformer: it guarantees the shape of the output (not its truth).
  • Speculative decoding. Definition: a draft model proposes several tokens and the big one verifies them in parallel. In the transformer: it speeds up inference without changing the output distribution.
  • Evaluation (reference / LLM-as-judge / faithfulness). Definition: the families of metrics for judging generated text. In the transformer: they measure different axes —overlap, preference, truthfulness— and none is perfect.

With that in mind, let’s go step by step.

30.3 Decoding is deciding: probability ≠ quality

The first, counterintuitive lesson: searching for the most probable sequence does not give the best text in open-ended generation. The landmark work (Holtzman et al. 2020) showed it: maximization (greedy, beam) on open-ended text produces output that is flat, repetitive and degenerate —the model gets stuck in loops—. And the finding that explains it: human text does NOT follow the maximum-probability path; people are “surprising”, and text that is highly probable at every step sounds artificial.

Hence the central decision, depending on the task:

  • Deterministic (greedy / beam): for closed tasks —translation, code, data, factual questions— where there is (almost) one correct answer and you want precision.
  • Stochastic (sampling): for open-ended/creative tasks —writing, conversing, brainstorming— where diversity matters and repetition kills.
Note🧠 Curiosity — human text lives off the probability ridge

The most striking thing in (Holtzman et al. 2020) is this: if you plot the probability the model assigns to each word of human-written text, it oscillates —it rises and falls—; it does not stay glued to the top. Natural language mixes the expected with the unexpected. That is why a decoder that only chases the most probable produces something too predictable to seem human. Repetition degeneration, moreover, is a self-reinforcing loop: repeating a token raises its future probability.

30.4 Sampling in depth

A reminder (Ch. 12): temperature \(\tau\) rescales the logits before the softmax, softmax(z/τ) —with \(\tau\to 0\) it tends toward greedy (always the most probable) and \(\tau\to\infty\) toward uniform (pure chance)—; top-k keeps the \(k\) best; top-p (nucleus) keeps the smallest set whose probability sums to \(p\). On top of that, methods more recent than what Ch. 12 covered:

  • Min-p (Nguyen et al. 2025): the threshold is relative to the most likely token (p_base × prob_of_top). When the model is confident (one token dominates), it trims aggressively; when it hesitates, it leaves more options. It is more robust to high temperature —it lets the model be creative without going off the rails—.
  • Typical sampling (Meister et al. 2023): it samples among tokens whose information content is close to the expected surprise (the conditional entropy). It discards both the too obvious and the too weird.
  • Eta / epsilon (Hewitt et al. 2022): it views the model as “true distribution + smoothing”; truncating is un-smoothing. η uses a threshold that depends on entropy (more candidates when the model is genuinely uncertain, fewer when it is confident) —better than top-p at escaping repetition—.
  • Contrastive search (Su et al. 2022): at each step it balances the model’s confidence against a degeneration penalty (similarity to what was already generated) → coherent and non-repetitive, without the noise of sampling.
  • Contrastive decoding (Li et al. 2022): it scores with log p(expert) − log p(amateur) (a strong model minus a weak one); the contrast cancels the failures (repetition, blandness) that the amateur also commits.

Practical recipe: typical values are temp ≈ 0.7–1.0 and top-p ≈ 0.9 (or top-k ≈ 40). Watch out: these filters combine, and the order matters (temperature → top-k → top-p is the usual order over a single step). Lower the temperature / go greedy for factual; raise the temperature + min-p/top-p for creative.

30.5 Controlling generation

Beyond how you sample, there are levers to steer the output:

  • Repetition. The repetition penalty (Keskar et al. 2019) divides the logits of already-seen tokens by a factor >1; no-repeat-ngram forbids repeating any n-gram; the frequency penalty (scales with the number of occurrences) and the presence penalty (flat, once seen) are the ones in the APIs. All of them fight Holtzman’s self-reinforcing loop.
  • Length. max/min new tokens; the length penalty in beam (without it, beam prefers short sequences); handling the end-of-sequence token (EOS): suppress it to force a minimum length, or bias toward it to finish sooner.
  • Classifier-free guidance (CFG) for text (Sanchez et al. 2023): borrowed from diffusion models. It amplifies the effect of the prompt by contrasting the logits with and without the condition:

\[ \text{logits} = \text{logits}_{\text{sin}} + \gamma\,\big(\text{logits}_{\text{con}} - \text{logits}_{\text{sin}}\big) \]

Term by term: logits_con = with the prompt; logits_sin = without it (unconditional); \(\gamma\) = the guidance weight (how much to exaggerate the difference). \(\gamma=1\) is normal generation; \(\gamma>1\) pushes the text to “stay on topic” with the prompt. - Others: logit bias (force or veto specific tokens) and stopping criteria (stop strings, EOS, max tokens).

30.6 Constrained decoding: enforcing a valid format

Sometimes you need the output to obey a schema (JSON, a regex, a grammar). The idea is simple and powerful: at every step, you zero out the probability of tokens that would violate the constraint and renormalize over the valid ones. That way the output is guaranteed to fit.

  • Outlines (Willard and Louf 2023): it compiles the regex/grammar into a finite-state machine and precomputes the mask of valid tokens per state → near-zero cost, regex- or JSON-schema-guided generation.
  • Grammar-constrained decoding (Geng et al. 2023): formal grammars (llama.cpp’s GBNF style). “JSON mode” and function calling are productized cases of this.

🧩 Analogy. It is an autocomplete that physically cannot type an invalid character —like a form field that only accepts digits—.

Warning⚠ Honest — format does not guarantee content

Constraints ensure shape, not truth: a perfectly valid JSON can contain an incorrect answer. And there is an open debate: Let Me Speak Freely? (Tam et al. 2024) found that constraining the format can degrade reasoning (JSON mode helps with classification but hurts reasoning tasks). Others counter that careful schema design recovers it. Treat it as a trade-off, not as settled: constrain when you need flawless parsing, but measure whether you lose content quality.

30.7 Speculative decoding: the same output, faster

Generation is sequential (one token after another), and that is where the latency bottleneck lives. Speculative decoding (Leviathan et al. 2023; Chen et al. 2023) attacks it without changing the result: a small draft model proposes several tokens, and the big model verifies them in parallel in a single pass, accepting the correct prefix and recomputing at the first failure. The output distribution is identical to that of the big model alone; just faster. Today it is the dominant lossless speedup in inference.

🧩 Analogy. A junior drafts a sentence and the expert reviews it at a glance, keeping the correct part: the same final words, in much less time.

30.8 Evaluating generated text

How do you know whether an output is good? It is hard, because in open-ended generation there is no single reference. Three families:

  • Reference-based (overlap with a “correct” text): BLEU (Papineni et al. 2002) (n-gram precision, translation), ROUGE (Lin 2004) (n-gram recall, summarization), BERTScore (Zhang et al. 2020) (embedding similarity, captures paraphrase). Honest: BLEU/ROUGE correlate weakly with human judgment on open-ended text.
  • Model-based (LLM-as-judge): G-Eval (Liu et al. 2023) (a GPT-4 scores with a chain of reasoning) and MT-Bench / Chatbot Arena (Zheng et al. 2023) (pairwise comparison, human Elo). Honest: LLM judges have biasesposition bias (they favor the first answer), verbosity bias (they favor the longer one) and self-preference (they favor their own model family)—.
  • Faithfulness / hallucination: this is a separate axis from fluency —an answer can sound perfect and be false—.
Note🧪 Try it — tafagent

The decoding strategy does not change the model’s attention geometry; but the length of what you generate does drive up the cost of the KV-cache. tafagent computes, from γ (Ch. 15-20), the KV budget at the target length: useful for anticipating how much memory a long generation will cost before you launch it.

30.9 Summary

  • Probability ≠ quality (Holtzman et al. 2020): maximizing (greedy/beam) degenerates on open-ended text; human text does not follow the probability ridge → sample.
  • When: deterministic for closed/factual/code; sampling for open-ended/creative.
  • New sampling: min-p (threshold relative to the top, robust to high temperature), typical, eta/epsilon, contrastive search/decoding. Recipe: temp≈0.7–1.0, top-p≈0.9; they combine and the order matters.
  • Control: repetition/frequency/presence penalty, no-repeat-ngram; length and EOS; text CFG (sin + γ·(con − sin)); logit bias.
  • Constrained: mask invalid tokens + renormalize (Outlines/grammars, JSON mode) → guaranteed shape; but shape ≠ truth, and it can hurt reasoning (Tam et al. 2024).
  • Speculative: draft proposes, big one verifies in parallel → same output, faster.
  • Evaluate: BLEU/ROUGE/BERTScore (reference, weak correlation), LLM-as-judge (G-Eval, MT-Bench, with biases position/verbosity/self-preference); faithfulness is a separate axis.

Next (Chapter 30): without touching the weights, the prompt itself programs the model — prompting, in-context learning, chain-of-thought and the induction heads that make it possible—.

30.10 Exercises

  1. Probability ≠ quality. Why do greedy/beam degenerate in open-ended generation? What did Holtzman show about human text?
  2. Min-p vs top-p. How does the min-p threshold differ from the top-p one, and why does min-p hold up better at high temperatures?
  3. Text CFG. In logits_sin + γ·(logits_con − logits_sin), what does \(\gamma\) do? What happens with \(\gamma=1\)?
  4. Constrained. Explain the “mask + renormalize” mechanism. Why does a valid JSON not guarantee a correct answer?
  5. Speculative. Why does speculative decoding produce exactly the same distribution as the big model alone?
  6. LLM judges. Cite two biases of using an LLM as an evaluator.

References

Chen, Charlie, Sebastian Borgeaud, Geoffrey Irving, et al. 2023. Accelerating Large Language Model Decoding with Speculative Sampling. https://arxiv.org/abs/2302.01318.
Geng, Saibo, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. “Grammar-Constrained Decoding for Structured NLP Tasks Without Finetuning.” EMNLP. https://arxiv.org/abs/2305.13971.
Hewitt, John, Christopher D. Manning, and Percy Liang. 2022. “Truncation Sampling as Language Model Desmoothing.” EMNLP Findings. https://arxiv.org/abs/2210.15191.
Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. “The Curious Case of Neural Text Degeneration.” ICLR. https://arxiv.org/abs/1904.09751.
Keskar, Nitish Shirish, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. CTRL: A Conditional Transformer Language Model for Controllable Generation. https://arxiv.org/abs/1909.05858.
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2023. “Fast Inference from Transformers via Speculative Decoding.” ICML. https://arxiv.org/abs/2211.17192.
Li, Xiang Lisa, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, et al. 2022. Contrastive Decoding: Open-Ended Text Generation as Optimization. https://arxiv.org/abs/2210.15097.
Lin, Chin-Yew. 2004. “ROUGE: A Package for Automatic Evaluation of Summaries.” ACL Workshop (Text Summarization Branches Out).
Liu, Yang, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. “G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment.” EMNLP. https://arxiv.org/abs/2303.16634.
Meister, Clara, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2023. “Locally Typical Sampling.” TACL. https://arxiv.org/abs/2202.00666.
Nguyen, Minh, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. 2025. “Turning up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs.” ICLR. https://arxiv.org/abs/2407.01082.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: A Method for Automatic Evaluation of Machine Translation.” ACL.
Sanchez, Guillaume, Honglu Fan, Alexander Spangher, et al. 2023. Stay on Topic with Classifier-Free Guidance. https://arxiv.org/abs/2306.17806.
Su, Yixuan, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. 2022. “A Contrastive Framework for Neural Text Generation.” NeurIPS. https://arxiv.org/abs/2202.06417.
Tam, Zhi Rui et al. 2024. Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. https://arxiv.org/abs/2408.02442.
Willard, Brandon T., and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. https://arxiv.org/abs/2307.09702.
Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. “BERTScore: Evaluating Text Generation with BERT.” ICLR. https://arxiv.org/abs/1904.09675.
Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. 2023. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS. https://arxiv.org/abs/2306.05685.