28 Instruction tuning and alignment

Where we are. In Ch. 26 we adapted a model to a task (classify, represent). But a base model, however smart, doesn’t know how to obey: it completes text, it doesn’t respond to what you ask. This chapter covers how it’s taught to follow instructions (SFT) and to align with what we prefer (RLHF and DPO) —with its formulas explained term by term and its failures told without makeup. (LoRA/PEFT, the cheap how of tuning, goes in Ch. 28.)

28.1 The idea in one sentence

A base model knows an enormous amount but only “keeps writing”; aligning it is teaching it, in two steps, to actually answer the question (with examples) and then to have good judgment (learning which answers we prefer).

28.2 Key concepts and their role in the transformer

Before getting into the details, we define this chapter’s terms and what each one is for inside a transformer:

Alignment gap. Definition: the distance between what the base model does (complete text) and what we want (obey). In the transformer: pre-training gives knowledge, not useful, obedient, harmless behavior.
SFT (instruction tuning). Definition: tune with (instruction, desired response) pairs. In the transformer: it teaches format and the act of obeying; it surfaces capabilities, it doesn’t inject new knowledge.
Reward model (Bradley-Terry). Definition: a judge \(r_\phi(x,y)\) trained from human comparisons. In the transformer: it turns thousands of scattered human judgments into an automatic, continuous signal.
RLHF / PPO. Definition: optimize the model (the policy) against that judge with reinforcement learning. In the transformer: it maximizes the judge’s score to shape the behavior —but it’s expensive: four models in memory.
KL leash (\(\beta\)). Definition: a penalty for straying from the reference model. In the transformer: it prevents reward hacking and keeps the model talking sensibly.
Reward hacking / Goodhart’s law. Definition: exploiting the judge —an imperfect proxy— to get a high score with degenerate outputs. In the transformer: it’s the reason the KL leash exists.
DPO. Definition: optimize the preference pairs directly, without a reward model or RL (an implicit reward). In the transformer: the same goal as RLHF, but as a simple supervised loss —dominant in open models.
Alignment tax and failure modes. Definition: aligning can degrade capabilities and produce sycophancy, reward hacking, or less diversity. In the transformer: the cost and risks of this style/behavior layer.

With those terms in hand, let’s get to the details.

28.3 The alignment gap

Pre-training (Ch. 25) optimizes one single thing: predicting the next token over internet text. That’s why a base model completes text according to the distribution of the web, it doesn’t attend to your intent. If you ask it to “explain the moon to a child”, it might continue with more similar questions —because that’s what usually comes next in its corpus— instead of answering. Pre-training gives knowledge and capability; it doesn’t give the behavior of being useful, obedient, and harmless. Closing that gap is alignment.

The canonical recipe is three stages (InstructGPT (Ouyang et al. 2022)):

SFT — tune with demonstrations (instruction → response) written by humans.
Reward model — train a judge from human comparisons.
RL (PPO) — optimize the model against that judge, with a leash that holds it back.

Today stage 3 can be replaced by DPO, which fuses reward + RL into a single loss. Let’s look at them.

🧩 Analogy — the sage who learns to answer and to have manners. Imagine someone who has read half a library: they know an enormous amount, but only know how to keep writing. A tutor teaches them with examples to answer the question they’re asked and not ramble (SFT). Then they learn good judgment with feedback: someone rates their answers and tells them which was better, over and over, until they internalize the judgment (RLHF/DPO).

28.4 Stage 1 — SFT (instruction tuning)

Supervised fine-tuning is the same “predict the next token” from Ch. 11, but over (instruction, desired response) pairs. It teaches it two things: the format of a good response and the act of obeying the instruction.

Its power was seen with FLAN (Wei et al. 2022): tuning a model over 60+ tasks phrased as instructions makes it generalize to unseen task types in zero-shot —it beat GPT-3 175B on 20 of 25 datasets. Scaling the number of tasks (to ~1,800), the size, and adding reasoning chains improves it further (Chung et al. 2022).

🧠 Curiosity — 1,000 examples are (almost) enough

LIMA (Zhou et al. 2023) tuned a 65B model with exactly 1,000 carefully curated examples, without RLHF, and stayed competitive against much more elaborately worked models. Hence the superficial alignment hypothesis: “knowledge and capabilities are learned almost entirely in pre-training; alignment only teaches which subdistribution of formats to use when interacting”. Translation: aligning doesn’t teach new things, above all it surfaces and shapes what the model already knew.

That marks what SFT does do (elicit capabilities, fix style/format) and what it can’t: it doesn’t inject knowledge the model doesn’t have. Imitating by SFT the style of a better model fools the human evaluator but doesn’t close the real capability gap (Gudibande et al. 2023). Honesty: an answer that sounds expert is not an expert answer.

28.5 Stage 2 — the reward model (learning what we prefer)

How do we teach it “good judgment” if the quality of an answer isn’t a single label? We don’t ask for absolute scores, but comparisons: a human is shown two answers to the same question and says which they prefer. With those pairs a reward model \(r_\phi(x,y)\) is trained that assigns a number (the “score”) to each answer. The loss is Bradley-Terry:

\[ \mathcal{L}_R = -\,\mathbb{E}_{(x,\,y_w,\,y_l)}\big[\,\log\sigma\big(r_\phi(x,y_w)-r_\phi(x,y_l)\big)\big] \]

Term by term:

\(x\) = the prompt (the instruction); \(y_w\) = the preferred response (winner) by the human; \(y_l\) = the dispreferred (loser) of the same pair.
\(r_\phi(x,y)\) = the scalar reward the judge assigns to response \(y\). It’s what we’re learning (the weights \(\phi\)).
\(r_\phi(x,y_w) - r_\phi(x,y_l)\) = how much the preferred one wins over the rejected one. The loss wants this difference to be large and positive.
\(\sigma\) (sigmoid) = squashes that difference into a probability: it models “P(the human prefers \(y_w\))”. Minimizing the loss = making the judge order the answers the way the human does.

Its function: the reward model turns thousands of scattered human judgments into an automatic, continuous signal that can guide training without a human at every step.

28.6 Stage 3 — PPO with a KL leash (optimizing against the judge)

With the judge ready, we tune the model (the policy \(\pi_\theta\)) so it maximizes the score —but with a crucial restraint. The objective of RLHF is:

\[ \max_{\pi_\theta}\ \mathbb{E}_{x,\,y\sim\pi_\theta}\big[r_\phi(x,y)\big]\;-\;\beta\,\mathbb{D}_{\mathrm{KL}}\big[\pi_\theta(y\mid x)\,\|\,\pi_{\mathrm{ref}}(y\mid x)\big] \]

Term by term:

\(\mathbb{E}[r_\phi(x,y)]\) = the average score the judge gives to the answers the model generates. Raising it is the objective.
\(\pi_{\mathrm{ref}}\) = the frozen reference model (the SFT from stage 1).
\(\mathbb{D}_{\mathrm{KL}}[\pi_\theta \| \pi_{\mathrm{ref}}]\) = how much the model has strayed from that reference (a “distance” between distributions).
\(\beta\) = the strength of the leash: high \(\beta\) = very tied to the reference; low \(\beta\) = free to chase the score.

Why the KL leash: the judge is an imperfect proxy. Without restraint, the model cheats the judge (reward hacking): it finds outputs with sky-high scores but degenerate —Goodhart’s law, “when a measure becomes a target, it ceases to be a good measure”. The degradation is so systematic that it has its own scaling law (Gao et al. 2023). The KL keeps the model close to talking sensibly.

🧩 Analogy — the coach and the leash. The reward model is a coach who scores each answer. The KL penalty is a leash that keeps the student from earning points by spouting gibberish that fools the coach: it keeps it close to talking like the reference model (sensibly).

⚠ Why PPO is heavy

PPO is on-policy reinforcement learning: it optimizes over samples that the model itself generates against a non-differentiable signal. The cost: you have to keep four models in memory at once —the policy, the frozen reference, the reward model, and a value network (critic). It’s complex, expensive, and unstable. That’s exactly what the alternatives attack.

28.7 DPO — skipping the coach

Direct Preference Optimization (Rafailov et al. 2023) had a beautiful idea: you can optimize the same objective as RLHF directly over the preference pairs, without training a reward model and without an RL loop.

The mathematical trick: the optimum of the RLHF objective with a KL leash has a closed form, which lets you rewrite the reward as a function of the policy itself. Substituting it into Bradley-Terry, the reward model cancels out and a simple classification loss remains:

\[ \mathcal{L}_{\mathrm{DPO}} = -\,\mathbb{E}_{(x,y_w,y_l)}\Big[\log\sigma\Big(\beta\log\tfrac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)} - \beta\log\tfrac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\Big)\Big] \]

Term by term:

\(\log\tfrac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}\) = how much more probable the model makes response \(y\) relative to the reference. This is, in fact, an implicit reward: \(\hat r(x,y)=\beta\log\tfrac{\pi_\theta}{\pi_{\mathrm{ref}}}\). The model itself is, secretly, its own reward model.
The loss raises the probability of the preferred \(y_w\) and lowers that of the rejected \(y_l\) —it pushes the implicit reward of \(y_w\) above that of \(y_l\), just like Bradley-Terry, but with no separate judge.
\(\beta\) = the same role as in PPO: how far it can stray from the reference.

Why it became dominant in open models (Zephyr (Tunstall et al. 2023), and others): it’s a supervised, offline loss, with no reward training, no sampling, no RL — “stable, effective, and lightweight”.

🧩 Analogy — no coach. Instead of hiring a separate judge and then playing against it, the student learns directly from many comparisons “this answer was liked more than that one”, raising the probability of the preferred one and lowering that of the other.

⚠ Honest — the DPO vs PPO debate isn’t settled

DPO is simpler, but it isn’t unanimously “better”:

Pro-PPO: careful studies find that PPO beats DPO in hard cases (DPO can exploit out-of-distribution responses) (Xu et al. 2024), and that the hierarchy of levers is preference data quality > algorithm > RM quality (Ivison et al. 2024).
Pro-DPO/on-policy: what matters most may be using on-policy data (generated by the model itself), more than the specific algorithm (Tajwar et al. 2024); iterative/online DPO narrows the gap.

Conciliatory reading: PPO has a higher ceiling but costs more; DPO is simpler and often “enough”; and preference data matters more than the choice of method.

28.8 What changes (and what breaks) in the model

It’s a style/behavior layer, not a knowledge one (superficial alignment, LIMA). The knowing comes from pre-training.
Alignment tax: aligning can degrade raw capabilities on benchmarks. InstructGPT mitigates it with PPO-ptx (mixing in gradients from the pre-training distribution), achieving “minimal regressions” (Ouyang et al. 2022).
Failure modes (honest):
- Sycophancy: the model tends to tell you what you want to hear; the human preference data causes it, because we sometimes prefer the pleasant answer over the correct one (Sharma et al. 2023; Perez et al. 2022).
- Reward hacking (already seen): exploiting the judge.
- Less diversity: RLHF reduces the variety of outputs versus SFT —it’s a trade-off: it improves out-of-distribution generalization but narrows the range (Kirk et al. 2024).

28.9 The 2024-2026 landscape (briefly)

RLHF and DPO are the load-bearing ideas; the rest are variants that cut cost or change the signal:

RLAIF / Constitutional AI (Bai et al. 2022; Lee et al. 2023): the feedback is given by another AI guided by a written “constitution”, instead of humans.
Preference variants: IPO (Azar et al. 2023), KTO (a binary signal without pairs) (Ethayarajh et al. 2024), ORPO (SFT + preference in one step, no reference) (Hong et al. 2024), SimPO (reference-free reward) (Meng et al. 2024).
GRPO (Shao et al. 2024): a PPO variant that drops the critic and estimates the advantage from a group of sampled outputs —the RL engine of DeepSeek-R1 (DeepSeek-AI 2025) for eliciting reasoning.

🧪 Try it — tafagent

tafagent profiles an already-aligned model and compares it with its base by γ and regime: since alignment is above all a style layer (it doesn’t change knowledge), you’ll see that its attention profile across distance (Ch. 15-20) barely moves relative to the base —an empirical check of the superficial alignment hypothesis.

28.10 Summary

Alignment gap: the base model completes text, it doesn’t obey. Recipe: SFT → reward → RL (PPO), or SFT → DPO.
SFT: tune with (instruction, response); it teaches format and obedience, generalizes (FLAN); ~1,000 examples can be enough (LIMA) → aligning surfaces, it doesn’t teach anew.
RLHF: a reward model (Bradley-Terry: −log σ(r_w − r_l)) learns what we prefer; PPO maximizes that score with a KL leash (−β·KL) that avoids reward hacking. Expensive: 4 models in memory.
DPO: rewrites the reward as β·log(π_θ/π_ref) and optimizes the pairs directly —no RM, no RL. Dominant in open models. Honest: open DPO vs PPO debate; the data rules.
What changes: style/behavior (not knowledge); beware alignment tax, sycophancy, reward hacking, and less diversity.

Next (Chapter 28): all this is expensive if you retrain the whole model. PEFT (LoRA/QLoRA, adapters) achieves almost the same thing touching a tiny fraction of the weights.

28.11 Exercises

The gap. Why might a well-trained base model not respond to an instruction even though it “knows” the answer? Which stage fixes it?
Bradley-Terry. In −log σ(r(x,y_w) − r(x,y_l)), what does the loss push to happen between \(y_w\) and \(y_l\)? Why are comparisons used and not absolute scores?
The leash. What would happen without the −β·KL term in the PPO objective? Explain reward hacking with Goodhart’s law.
DPO. Which two components of classic RLHF does DPO eliminate, and how? What is the “implicit reward”?
Honesty. Cite two alignment failure modes and why the human preference data can cause them.

References

Azar, Mohammad Gheshlaghi et al. 2023. A General Theoretical Paradigm to Understand Learning from Human Preferences. https://arxiv.org/abs/2310.12036.

Bai, Yuntao et al. 2022. Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/abs/2212.08073.

Chung, Hyung Won et al. 2022. Scaling Instruction-Finetuned Language Models. https://arxiv.org/abs/2210.11416.

DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://arxiv.org/abs/2501.12948.

Ethayarajh, Kawin et al. 2024. KTO: Model Alignment as Prospect Theoretic Optimization. https://arxiv.org/abs/2402.01306.

Gao, Leo, John Schulman, and Jacob Hilton. 2023. Scaling Laws for Reward Model Overoptimization. https://arxiv.org/abs/2210.10760.

Gudibande, Arnav et al. 2023. The False Promise of Imitating Proprietary LLMs. https://arxiv.org/abs/2305.15717.

Hong, Jiwoo, Noah Lee, and James Thorne. 2024. ORPO: Monolithic Preference Optimization Without Reference Model. https://arxiv.org/abs/2403.07691.

Ivison, Hamish et al. 2024. “Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.” NeurIPS. https://arxiv.org/abs/2406.09279.

Kirk, Robert et al. 2024. “Understanding the Effects of RLHF on LLM Generalisation and Diversity.” ICLR. https://arxiv.org/abs/2310.06452.

Lee, Harrison et al. 2023. RLAIF Vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. https://arxiv.org/abs/2309.00267.

Meng, Yu, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple Preference Optimization with a Reference-Free Reward. https://arxiv.org/abs/2405.14734.

Ouyang, Long et al. 2022. Training Language Models to Follow Instructions with Human Feedback. https://arxiv.org/abs/2203.02155.

Perez, Ethan et al. 2022. Discovering Language Model Behaviors with Model-Written Evaluations. https://arxiv.org/abs/2212.09251.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” NeurIPS. https://arxiv.org/abs/2305.18290.

Shao, Zhihong et al. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. https://arxiv.org/abs/2402.03300.

Sharma, Mrinank et al. 2023. Towards Understanding Sycophancy in Language Models. https://arxiv.org/abs/2310.13548.

Tajwar, Fahim et al. 2024. “Preference Fine-Tuning of LLMs Should Leverage Suboptimal, on-Policy Data.” ICML. https://arxiv.org/abs/2404.14367.

Tunstall, Lewis et al. 2023. Zephyr: Direct Distillation of LM Alignment. https://arxiv.org/abs/2310.16944.

Wei, Jason et al. 2022. “Finetuned Language Models Are Zero-Shot Learners.” ICLR. https://arxiv.org/abs/2109.01652.

Xu, Shusheng et al. 2024. “Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study.” ICML. https://arxiv.org/abs/2404.10719.

Zhou, Chunting et al. 2023. “LIMA: Less Is More for Alignment.” NeurIPS. https://arxiv.org/abs/2305.11206.