31  Prompting and in-context learning

Where we are. We are still in Part V (using the model). The previous chapter dealt with how the model decodes; this one deals with what you put in the prompt. We will see the most astonishing phenomenon of LLMs —learning from a few examples in the context, without touching a single weight—, what actually makes it work (with a bridge to interpretability: induction heads), the reasoning techniques (chain-of-thought, self-consistency, tree-of-thoughts) and the honest limits of prompting.

31.1 The idea in one sentence

An LLM can learn a task from a few examples in its own context, at inference time and with no training step whatsoever —but prompting elicits capabilities the model already has, it does not add knowledge—.

31.2 Key concepts and their role in the transformer

Before diving in, we define this chapter’s terms and what each one is for inside a transformer:

  • In-context learning (ICL). Definition: solving a task from examples in the prompt itself, without updating weights. In the transformer: learning at inference time; it elicits capabilities, it does not add them.
  • Zero / one / few-shot. Definition: giving zero, one or several examples in the context. In the transformer: it tunes how much “demonstration” fits in the window to pin down the task.
  • Induction heads. Definition: attention heads that complete patterns “[A][B]…[A]→[B]”. In the transformer: the candidate mechanism behind ICL —they match the prefix and copy what came after it—.
  • Chain-of-Thought (CoT). Definition: asking the model to write the intermediate steps before answering. In the transformer: it unfolds the computation across tokens → better multi-step reasoning (emergent with scale).
  • Self-consistency. Definition: sampling several reasoning chains and voting for the majority. In the transformer: it cancels the random errors of a single chain.
  • Tree-of-Thoughts (ToT). Definition: exploring a tree of steps with search and backtracking. In the transformer: it adds planning/search to linear generation.
  • Prompt brittleness (order / format). Definition: the output’s sensitivity to cosmetic changes in the prompt. In the transformer: it warns that much of a “gain” may be format luck, not capability.
  • Prompt injection. Definition: hostile instructions slipped in as if they were data. In the transformer: it exploits the fact that the LLM does not separate data from instructions.

With that in mind, let’s take it piece by piece.

31.3 In-context learning: learning without retraining

GPT-3 (Brown et al. 2020) uncovered it: by giving it a few examples of a task in the prompt, the model solves it without any weight update —“tasks and demonstrations are specified purely as text”—. Depending on how many examples you give:

  • Zero-shot: only the task description, no examples.
  • One-shot: one example.
  • Few-shot: several (as many as fit in the context window).

The astonishing part is that this is learning at inference time: the model does not change; it just “figures out”, within a single pass, what you are asking it to do.

🧩 Analogy — the sticky note, not the course. It is like handing a smart new employee a couple of worked examples on a Post-it instead of sending them off to a training course. You do not rewire their brain (no weight update): you show them the pattern and they apply it on the fly.

31.4 What makes ICL work (the honest and surprising part)

Here comes the part that breaks intuition. You would assume ICL works because the model learns the correct input→label mapping from the examples. Well, not quite:

Warning⚠ Correct labels matter less than you think

Min et al. (Min et al. 2022) showed that replacing the example labels with RANDOM labels barely lowers performance (a 0-5 % drop, across 12 models including GPT-3). What really matters is the format, the label space and the distribution of the input text —not the correct pairing—. In other words: the examples teach the model what kind of task it is and how to answer, more than the answer itself. (Honest caveat: Yoo et al. (Yoo et al. 2022) counter that this depends on the setup —with terser templates or smaller models, correct labels do matter—; take it as nuanced, not absolute.)

31.4.1 The bridge to interpretability: induction heads

Is there an identifiable mechanism behind ICL? The strongest candidate is induction heads (Olsson et al. 2022), which already showed up in Ch. 5. They are attention heads that implement a pattern-completion rule:

“[A][B] … [A] → [B]”: “the last time I saw A, B followed; now I see A again → I predict B”.

It is built from two pieces (from the circuits framework (Elhage et al. 2021)): a previous token head + the induction head, combining prefix matching (attending to where the current token appeared before) and copying (raising the logit of what came after).

🧩 Analogy — the pattern-matching autocomplete. An induction head is like an autocomplete with memory: it remembers “last time I saw this, that came next” and copies it. That simple rule, repeated, explains a good part of why the model “gets” a pattern from your examples.

Note🧠 Curiosity — they emerge abruptly, in a phase change

The fascinating part (and it connects with our Ch. 24, grokking): induction heads form abruptly, in a phase change during training that coincides with a sharp jump in ICL ability —visible as a “bump” in the loss curve— (Olsson et al. 2022). The authors’ own honesty: the case that induction heads explain ICL is “stronger for small models than for large ones” (for the large ones the evidence is circumstantial).

And what is ICL, deep down? It is not settled. Several explanations coexist, partly reconcilable: implicit Bayesian inference —inferring a latent concept common to the examples— (Xie et al. 2022); implicit gradient descent —a linear attention layer is equivalent to a gradient step, “mesa-optimization”— (Oswald et al. 2023); and the evidence that transformers learn entire function classes in-context (linear, trees…) (Garg et al. 2022). Honest: the last two have been tested mostly in linear/toy settings, not on frontier LLMs.

31.5 Chain-of-Thought: asking it to reason out loud

For multi-step tasks (math, logic), one simple technique boosts performance: asking the model to show the intermediate steps before the answer.

  • Chain-of-Thought (CoT) (Wei et al. 2022): include the step-by-step reasoning in the examples. On GSM8K (math problems) with PaLM-540B, it went from 17.9 % (normal prompting) to 56.9 %. Honest key point: it is a capability that emerges with scale (~100B parameters); on small models it does not help —they produce fluent but illogical chains—.
  • Zero-shot CoT (Kojima et al. 2022): you do not even need examples; just add “Let’s think step by step”. On GSM8K it rose from 10.4 % to 40.7 % with that phrase alone.

🧩 Analogy — showing your work. CoT is like a math exam: instead of blurting out the result off the top of your head, you write out the working. Breaking the problem into steps makes each step easier and reduces “mental arithmetic” errors.

  • Self-consistency (Wang et al. 2023): instead of a single chain (greedy), sample several distinct reasoning chains and vote for the majority answer. A big improvement over plain CoT (+17.9 % on GSM8K with PaLM-540B).
  • Tree-of-Thoughts (ToT) (Yao, Yu, et al. 2023): it explores a tree of reasoning steps with search, self-evaluation and backtracking. On the Game of 24, GPT-4 went from 4 % (CoT) to 74 % (ToT). It shines on planning/search tasks.

(One variant, ReAct (Yao, Zhao, et al. 2023), interleaves reasoning with actions —using tools—; it is the bridge to Ch. 32, agents.)

🧩 Analogy — the majority vote. Self-consistency is asking several people to solve the problem separately and keeping the answer the majority agrees on: each one’s random errors cancel out, and the correct answer tends to recur.

Warning⚠ Honest — the written reasoning may not be the real one

CoT helps mostly on mathematical/symbolic/multi-step tasks, not on everything. And there is a faithfulness debate: the reasoning the model writes may not reflect what actually leads it to the answer. Turpin et al. (Turpin et al. 2023) biased the prompt (e.g. making the correct answer always “(A)”); the model shifted its answer toward the bias but did not mention it in its chain —it rationalized it—, with accuracy drops of up to 36 %. It is an active debate: CoT can be a post-hoc rationalization, not the trace of the real computation.

31.6 Prompt engineering: the practical (and the brittle)

Prompting works, but it is surprisingly brittle:

  • The order of the examples matters enormously (Lu et al. 2022): simply reordering the same few-shot examples can swing performance between near-SOTA and chance (on SST-2, from >85 % to ~50 %). And it does not go away with scale.
  • The format matters (Sclar et al. 2024): cosmetic changes (separators, casing, spaces) cause up to 76 points of accuracy difference. Much of the reported “gain” may be format luck, not capability.
  • Best practices: clear instructions, delimiters/structure, system/user roles.
  • (Soft prompts —prefixes learned by gradient— were covered in Ch. 28; here we are talking about text prompts.)

31.7 The limits of prompting

  • Brittleness (order, format, wording): it makes reliable evaluation hard and overestimates capabilities.
  • The ceiling: prompting elicits what the model learned in pretraining, it does not add knowledge —the superficial alignment hypothesis of Ch. 27 (Zhou et al. 2023)—.
  • Security: hostile prompts can hijack the application (prompt injection (Greshake et al. 2023)) because the LLM blurs the line between data and instructions —we will see this in the security chapter—.
  • When to use what: prompting to elicit capabilities and prototype; fine-tuning (Ch. 28) to lock in a format/style or to specialize; RAG (Ch. 31) to inject external or up-to-date knowledge the model does not have.
Note🧪 Try it — tafagent

ICL ability is born, according to (Olsson et al. 2022), in a phase change where induction heads form. tafagent includes an induction-head phase detector (based on \(\Delta\gamma\)): it tells you whether a model has already crossed that transition —that is, whether you can expect few-shot to work well— without having to retrain it.

31.8 Summary

  • ICL (Brown et al. 2020): learning from examples in the context without updating weights (zero/one/few-shot).
  • What drives it (honest): correct labels matter less than the format and the distribution (Min et al. 2022) (nuanced by (Yoo et al. 2022)); candidate mechanism = induction heads (the “[A][B]…[A]→[B]” rule, born in a phase change) (Olsson et al. 2022); what ICL is is not settled (Bayes / implicit GD / function classes).
  • Reasoning: CoT (emergent ~100B; 17.9→56.9 % GSM8K), zero-shot CoT (“let’s think step by step”), self-consistency (majority vote), ToT (search with backtracking). Honest: CoT can be unfaithful (rationalization) (Turpin et al. 2023).
  • Practice: brittle to order (Lu et al. 2022) and to format (Sclar et al. 2024).
  • Limits: it elicits, it does not add knowledge (the pretraining ceiling); prompt injection; prompting vs fine-tune vs RAG.

Next (Chapter 31): when the model needs knowledge it does not have (current facts, private documents), the prompt is not enough: you have to retrieve it and hand it over. We enter embeddings, semantic search and RAG.

31.9 Exercises

  1. No gradients. What does it mean that ICL happens “at inference time”? How does it differ from fine-tuning?
  2. Random labels. What did Min et al. demonstrate by shuffling the example labels? What three things do matter, then?
  3. Induction heads. Explain the “[A][B]…[A]→[B]” rule and why it is a good candidate for the ICL mechanism. For which models is the evidence stronger?
  4. CoT and scale. Why does chain-of-thought not help a small model? What is self-consistency and why does it improve things?
  5. Brittleness. Give two ways the same prompt can yield very different results without changing its content.
  6. Faithfulness. What did Turpin et al. show about the relationship between the written reasoning and the real answer?

References

Brown, Tom B. et al. 2020. “Language Models Are Few-Shot Learners (GPT-3).” NeurIPS. https://arxiv.org/abs/2005.14165.
Elhage, Nelson et al. 2021. “A Mathematical Framework for Transformer Circuits.” Transformer Circuits Thread (Anthropic). https://transformer-circuits.pub/2021/framework/index.html.
Garg, Shivam, Dimitris Tsipras, Percy Liang, and Gregory Valiant. 2022. “What Can Transformers Learn in-Context? A Case Study of Simple Function Classes.” NeurIPS. https://arxiv.org/abs/2208.01066.
Greshake, Kai, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. https://arxiv.org/abs/2302.12173.
Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. “Large Language Models Are Zero-Shot Reasoners.” NeurIPS. https://arxiv.org/abs/2205.11916.
Lu, Yao, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity.” ACL. https://arxiv.org/abs/2104.08786.
Min, Sewon, Xinxi Lyu, Ari Holtzman, et al. 2022. “Rethinking the Role of Demonstrations: What Makes in-Context Learning Work?” EMNLP. https://arxiv.org/abs/2202.12837.
Olsson, Catherine et al. 2022. “In-Context Learning and Induction Heads.” Transformer Circuits Thread (Anthropic). https://arxiv.org/abs/2209.11895.
Oswald, Johannes von et al. 2023. “Transformers Learn in-Context by Gradient Descent.” ICML. https://arxiv.org/abs/2212.07677.
Sclar, Melanie, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design.” ICLR. https://arxiv.org/abs/2310.11324.
Turpin, Miles, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.” NeurIPS. https://arxiv.org/abs/2305.04388.
Wang, Xuezhi, Jason Wei, Dale Schuurmans, et al. 2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR. https://arxiv.org/abs/2203.11171.
Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS. https://arxiv.org/abs/2201.11903.
Xie, Sang Michael, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2022. “An Explanation of in-Context Learning as Implicit Bayesian Inference.” ICLR. https://arxiv.org/abs/2111.02080.
Yao, Shunyu, Dian Yu, Jeffrey Zhao, et al. 2023. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” NeurIPS. https://arxiv.org/abs/2305.10601.
Yao, Shunyu, Jeffrey Zhao, Dian Yu, et al. 2023. “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR. https://arxiv.org/abs/2210.03629.
Yoo, Kang Min et al. 2022. “Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations.” EMNLP. https://arxiv.org/abs/2205.12685.
Zhou, Chunting et al. 2023. “LIMA: Less Is More for Alignment.” NeurIPS. https://arxiv.org/abs/2305.11206.