29 Parameter-efficient fine-tuning (PEFT)
Where we are. In Chs. 26-27 we tuned models —classification, SFT, alignment— assuming we touched all the weights. But that, in a model of billions of parameters, is hugely expensive: a whole copy per task. This chapter closes Part IV with the trick that changed everything in practice: PEFT —adapting the model by touching a tiny fraction of the weights—, with LoRA and QLoRA as the protagonists, explained term by term.
29.1 The idea in one sentence
Instead of retraining the whole model for each task, you freeze the base and train a tiny handful of new parameters —you get almost the same quality for a fraction of the cost and a checkpoint of a few megabytes.
29.2 Key concepts and their role in the transformer
Before getting into the details, we define this chapter’s terms and what each one is for inside a transformer:
- PEFT. Definition: tune by freezing the base and training only a tiny fraction of the weights. In the transformer: it avoids a whole copy per task, lowers training memory, and reduces forgetting.
- Adapters. Definition: small bottleneck modules inserted inside each layer. In the transformer: the precursor that worked; but being extra layers they add latency at inference.
- LoRA (low-rank delta). Definition: learn \(\Delta W = BA\), the product of two thin matrices of rank \(r\). In the transformer: since the fine-tuning update lives in a low intrinsic dimension, a cheap delta suffices instead of a dense matrix.
- Rank \(r\) and scale \(\alpha/r\). Definition: the delta’s capacity and the magnitude with which it’s added. In the transformer: the quality/size knob —turn it up on hard tasks like code or math.
- Fusion (zero latency). Definition: add \(BA\) into the base weight after training (\(W=W_0+\tfrac{\alpha}{r}BA\)). In the transformer: the fused model has the same shape as the original → no added latency, unlike adapters.
- QLoRA. Definition: LoRA over a base quantized to 4 bits (NF4 + double quantization
- paged optimizers). In the transformer: it lets you tune a giant on a single GPU; the gradients flow through the 4 bits up to the 16-bit adapters.
- Soft prompts (prefix / prompt tuning). Definition: learned input vectors (not text), which touch no weight. In the transformer: they adapt the model in an ultralight way and shine at large scale (one prefix per task).
With those terms in hand, let’s get to the details.
29.3 The problem: tuning the whole thing is hugely expensive
Tuning all the weights has three costs that explode with size:
- Storage: a copy of the full model for each task. Ten tasks = ten full-size models.
- Training memory: the optimizer (Adam) stores two states per parameter (momentum and variance), plus the gradients, plus the activations. Rule of thumb: full tuning with Adam costs ~12-16 bytes per parameter (weights + gradients + states), versus ~2 bytes/parameter for just the weights at inference.
- Forgetting: touching all the weights risks catastrophic forgetting (Ch. 26).
PEFT (parameter-efficient fine-tuning) attacks all three: it freezes the base and trains only a few new or selected parameters (Lialin et al. 2023). Since the optimizer states only exist for that trainable fraction, memory drops sharply; and you save/send tiny deltas, not whole models.
29.4 Adapters: the precursor
The first idea that worked well were the adapters (Houlsby et al. 2019): small bottleneck modules inserted inside each layer of the transformer (one after attention, another after the FFN). Their structure: project down to a small dimension r → non-linearity → project back up to d, with a residual connection. They’re initialized near the identity so as not to perturb the network at the start. Only the adapters train; the base stays frozen.
They worked very well —within 0.4% of full tuning on GLUE while adding only ~3.6% of parameters per task— but they have a key flaw: they are extra, sequential layers, so they add latency at every inference. That’s exactly what LoRA eliminates.
29.5 LoRA: the low-rank delta
LoRA (Hu et al. 2022) starts from a deep observation: the update a model needs to specialize in a task is “simple” —it has low rank. It builds on the fact that fine-tuning lives in a very low intrinsic dimension (Aghajanyan et al. 2020) (you can tune RoBERTa moving barely ~200 projected parameters). If the update is low-rank, you don’t need a huge dense matrix to represent it: the product of two thin matrices suffices.
Instead of modifying the weight \(W_0\), LoRA freezes it and learns an add-on \(\Delta W = BA\):
\[ h = W_0\,x \;+\; \frac{\alpha}{r}\,B\,A\,x \]
Term by term:
- \(W_0 \in \mathbb{R}^{d\times k}\) = the pre-trained, frozen weight (it receives no gradient). It keeps doing its job intact.
- \(A \in \mathbb{R}^{r\times k}\) = the down-projection: it compresses the input into a tiny space of dimension r. It’s initialized with Gaussian noise.
- \(B \in \mathbb{R}^{d\times r}\) = the up-projection: it expands back to d. It’s initialized to zero.
- \(r\) = the rank, the key knob, with \(r \ll \min(d,k)\) (typical: 4-64). It’s the delta’s “capacity”: more r = more expressive but more parameters.
- \(\alpha\) = a scaling factor; the ratio \(\alpha/r\) regulates the magnitude of the add-on (it lets you change r without retouching the learning rate).
Why \(B=0\) at the start matters: with \(B=0\), the add-on \(BA=0\), so the model starts exactly at the pre-trained one —no perturbation, stable training. From there, only \(A\) and \(B\) learn; \(W_0\) isn’t touched.
Its function: \(BA\) is a specialized, cheap delta. On GPT-3 (175B), LoRA cut the trainable parameters ~10,000× and GPU memory ~3×, matching or exceeding the quality of full tuning.
After training, the delta can be fused into the base weight: \(W = W_0 + \frac{\alpha}{r}BA\). The resulting model has exactly the same shape as the original → no added latency at inference. And it can be un-fused to switch adapters. (Adapters, being extra layers, don’t allow this.) In practice LoRA is applied mostly to the attention (Q, V) projections, sometimes to all four (Q,K,V,O) and to the FFN.
🧩 Analogy — sticky notes in a textbook. Instead of rewriting the whole book for each subject (full tuning), you stick a few thin sticky notes (the delta \(BA\), low-rank) that you can add or remove at will. The original book (\(W_0\)) stays intact, and saving “the subject” is saving just the notes.
29.6 QLoRA: tuning a giant on a single GPU
LoRA already saves the optimizer states, but the base model still takes up memory (you have to keep it loaded for the forward). QLoRA (Dettmers et al. 2023) attacks that: it trains LoRA adapters on top of a base model quantized to 4 bits. Three pieces:
- NF4 (4-bit NormalFloat): a quantization data type that is information-theoretically optimal for weights with a normal distribution —and pre-trained weights are.
- Double quantization: quantizing the quantization constants too, squeezing out more memory.
- Paged optimizers: they use the GPU’s unified memory to absorb the memory spikes of gradient checkpointing (avoiding OOM).
The fine idea: the gradients backpropagate through the frozen 4-bit weights up to the 16-bit LoRA adapters. The 4-bit base is never updated; it’s only used to compute. The result was spectacular: tuning a 65B model on a single 48 GB GPU while preserving the quality of 16-bit tuning.
With QLoRA, the Guanaco family reached 99.3% of ChatGPT’s level on the Vicuna benchmark training 24 hours on a single GPU. It democratized fine-tuning of large models: what once required a cluster suddenly fit on a consumer card.
🧩 Analogy — compress the book and write with good ink. QLoRA compresses the book to 4 bits so it fits on your desk, and then writes the sticky notes in full-precision ink (the 16-bit adapters over the 4-bit base).
29.7 Soft prompts: prefix and prompt tuning
A different family touches no weight: it learns input vectors.
- Prefix tuning (Li and Liang 2021): it prepends trainable continuous vectors (“virtual tokens”) to the keys and values in each layer; the base stays frozen and the real tokens “attend” to that prefix as if it were context. ~0.1% of parameters.
- Prompt tuning (Lester et al. 2021): even simpler —trainable soft-prompt vectors only at the input. Its key finding: it becomes competitive with full tuning as the model’s scale grows (it matches it toward ~10B parameters). P-tuning v2 (Liu et al. 2021) reintroduces prompts at every layer so it works on small models too.
(Important nuance: these soft prompts* are learned continuous vectors, which don’t correspond to any real token. Not to be confused with hard prompts —discrete text you write by hand—, which we’ll see in Ch. 30.)*
29.8 When to use each one
- LoRA / QLoRA = today’s standard. Quality ≈ full tuning, fusible (zero latency), tiny checkpoints, and they let you swap adapters per task over a single loaded base —the “one base + many small adapters” pattern, key to serving many tasks at once.
- Adapters: historic; superseded by LoRA above all because of latency.
- Prompt / prefix tuning: ultralight; they shine at large scale and for serving very many tasks (one prefix per task, dirt cheap to store).
A careful study, LoRA Learns Less and Forgets Less (Biderman et al. 2024), tempers the optimism: on hard tasks like code and math, LoRA (at standard low rank) underperforms full tuning —which learns perturbations of rank 10-100× larger. But LoRA forgets less (it preserves out-of-domain capabilities better). Lessons: (1) the rank \(r\) is a quality/size knob —turn it up on demanding tasks; (2) there’s a learn-vs-forget trade-off. 2024 refinements like DoRA (Liu et al. 2024) (decomposing the weight into magnitude and direction) close part of that gap.
tafagent profiles a model with LoRA already fused (\(W_0 + BA\)): since the delta is low-rank and is added to the base weight, you’ll see that the attention profile across distance (γ, regime; Ch. 15-20) barely moves relative to the base —consistent with the adaptation being a thin layer, not a rewrite. Useful for confirming that your fine-tuning didn’t break the model’s long-range behavior.
And the 🔌 Adapter Reality mode acts before merging: paste the adapter_config.json of a published adapter and, from the config alone (no inference), it checks (a) compatibility with your base —match, target_modules, \(\alpha/r\) scaling, embedding resize— and (b) the forgetting evidence for its rank, from the literature —including LoRA Learns Less and Forgets Less (Biderman et al. 2024)—, shown as a Δ (percentage-point) range with its sources, never a fabricated per-adapter number.
29.9 Summary
- Why PEFT: full tuning = one copy per task + ~12-16 bytes/parameter in memory + forgetting risk. PEFT freezes the base and trains a tiny fraction.
- Adapters (Houlsby et al. 2019): bottleneck modules inside each layer (0.4% of full FT, +3.6% params) — but they add latency.
- LoRA (Hu et al. 2022): a low-rank delta \(\Delta W=BA\) (with \(B=0\) at the start); ~10,000× fewer params on GPT-3, fusible → zero latency. Motivated by the low intrinsic dimension of fine-tuning.
- QLoRA (Dettmers et al. 2023): LoRA over a base quantized to 4 bits (NF4 + double quantization + paged optimizers) → 65B on a 48 GB GPU; the gradients flow through the 4 bits up to the 16-bit adapters.
- Soft prompts: prefix tuning (K/V per layer) and prompt tuning (input only, competitive at large scale) — learned vectors, not text.
- Today: LoRA/QLoRA is the default (fusible, small checkpoints, per-task hot-swap). Honest: LoRA can fall short on hard tasks (low rank), though it forgets less.
Next (Part V): with the model trained, adapted, and aligned, we move on to using it: generation in depth, prompting and in-context learning, RAG, agents, and multimodal.
29.10 Exercises
- The saving. Why does PEFT reduce training memory so much? (Hint: for how many parameters do the optimizer states exist?)
- LoRA term by term. In \(h=W_0x+\frac{\alpha}{r}BAx\), what is \(W_0\), what do \(A\) and \(B\) do, and why is \(B\) initialized to zero?
- Zero latency. Explain why LoRA adds no latency at inference and an adapter does.
- QLoRA. What is quantized to 4 bits and what is trained in 16 bits? Why is the 4-bit base “never updated”?
- Honesty. According to LoRA Learns Less and Forgets Less, on what type of tasks does LoRA fall short and what, on the other hand, does it do better than full tuning?