27 Fine-tuning for classification and representation

Where we are. We already have a powerful base model (Ch. 25). We rarely use it “as is”: we adapt it to a task —classify emails, measure whether two sentences mean the same thing, generate embeddings for search. This chapter covers the two most classic adaptations: classifying (attach a head and tune) and representing (producing good text vectors with contrastive learning). Instruction fine-tuning and RLHF go in Ch. 27; LoRA/PEFT in Ch. 28 — not here.

27.1 The idea in one sentence

A pre-trained model is a generalist that already understands language; adapting it is giving it a short course —adding a small piece and re-tuning— for a concrete task, or reorganizing its representation space so that similar things end up close.

🧩 Analogy — the generalist who takes a course. The base model is someone with very broad general knowledge (it knows language in general). Fine-tuning is a short, focused course in a trade (detecting spam, sentiment). It doesn’t relearn to read from scratch: it adapts what it already knows, fast and with little new data. If you push it too hard with the narrow course, it forgets general skills (we’ll see it: catastrophic forgetting).

27.2 Key concepts and their role in the transformer

Before getting into the details, we define this chapter’s terms and what each one is for inside a transformer:

Transfer (pre-train → adapt). Definition: reuse a general language model and re-tune it to a task. In the transformer: the knowledge from pre-training is reused, so a few labeled examples are enough.
Classification head. Definition: a linear layer (\(W\in\mathbb{R}^{K\times H}\)) plus softmax over the summary vector. In the transformer: it translates the internal representations into a class answer; it’s the only thing born from scratch when tuning.
Pooling ([CLS] vs mean-pooling). Definition: how the sequence is condensed into a single sentence vector. In the transformer: the raw [CLS] is a poor embedding; averaging the tokens works better.
Freeze vs tune. Definition: leave the base fixed (train only the head) or update all the weights. In the transformer: cheap and forgetting-free versus more precise but more expensive.
Catastrophic forgetting. Definition: the new task’s updates overwrite what was learned before. In the transformer: the risk of tuning all the weights; mitigated by unfreezing gradually.
Anisotropy / cosine similarity. Definition: the vectors piling up in a narrow cone, with “similarity” measured by the cosine of the angle. In the transformer: it explains why the un-tuned [CLS] discriminates poorly —“everything looks like everything”.
Bi-encoder vs cross-encoder (Sentence-BERT). Definition: encoding each sentence separately (reusable vectors) versus judging the pair together. In the transformer: the bi-encoder makes semantic search scalable (65 h → 5 s).
Contrastive learning. Definition: pull positive pairs together and push negatives apart (the NT-Xent loss, with temperature \(\tau\) and in-batch negatives). In the transformer: it reorganizes the representation space so that similar things end up close (alignment + uniformity).

With those terms in hand, let’s get to the details.

27.3 The paradigm: pre-train and then adapt

The dominant recipe in NLP is pre-train → adapt. One of its first clear articulations was ULMFiT (Howard and Ruder 2018): train a general language model and then tune it to a classification task, with tricks to avoid breaking what was learned (per-layer learning rates, gradual unfreezing). The eye-opening result: with 100 labeled examples, ULMFiT matched a model trained from scratch with 100× more data. That’s the magic of transfer: the knowledge from pre-training gets reused.

27.4 Classify: attach a head and tune

The canonical pattern was set by BERT (Devlin et al. 2019). The idea is simple:

You pass the text through the model and take a summary representation of the sequence. In BERT, that summary is the final state of the special token [CLS] (a vector \(C \in \mathbb{R}^{H}\), with \(H\) = hidden dimension).
You plug a classification head on top: a linear layer with weights \(W \in \mathbb{R}^{K \times H}\) (with \(K\) = number of classes) that converts that summary vector into class logits, and a softmax on top.
You tune all the weights (model + head) end to end with the labeled data.

Let’s take the head term by term, because it’s the new piece:

\(C \in \mathbb{R}^{H}\) = the sentence summary vector (what the model “understood” from the whole sequence, condensed into \(H\) numbers).
\(W \in \mathbb{R}^{K \times H}\) = the head. Each of its \(K\) rows is a “detector” of a class; multiplying \(W\,C\) gives you \(K\) scores, one per class. It’s the only thing born from scratch in fine-tuning; everything else came pre-trained.
The softmax converts those \(K\) scores into probabilities that sum to 1.

Its function in the system: the head is the translator between the model’s internal language (vectors of \(H\) dimensions) and your task’s concrete question (“spam or not?”, “positive, neutral, or negative?”). Without it, the model gives you a rich representation but not an answer; un-tuned, that representation isn’t oriented toward your task.

27.4.1 Freeze or tune: two modes

You have two ways to use the base model:

Feature extraction (frozen backbone): you leave the model’s weights fixed and train only the head over the representations it produces. Cheap (you can precompute and cache the representations), more resistant to overfitting with little data, and it forgets nothing.
Full fine-tuning: you update all the weights. More precise —especially with more labeled data or a domain shift— but more expensive and with more risk of overfitting and forgetting.

How much do you gain by tuning everything? In BERT itself, on entity recognition, the best method with a frozen backbone came within just ~0.3 F1 points of full fine-tuning. That is: BERT works well both ways; tuning everything gains a bit, in exchange for more cost.

⚠ Catastrophic forgetting

When you tune a model on a new task, its updates can overwrite what was learned before —it suddenly loses prior abilities. This is catastrophic forgetting (McCloskey and Cohen, 1989; review by French, 1999). That’s why ULMFiT unfreezes the layers gradually and small learning rates are used: to adapt without erasing. It’s the risk of “pushing the narrow course too hard” from the analogy.

27.5 Represent: why the raw `[CLS]` is a poor embedding

Here comes an important twist. If you want a sentence vector to compare meanings (search, cluster), you’re tempted to use the [CLS] of an un-tuned BERT… and it’s a bad idea. The reason is geometric:

Representation degeneration (Gao et al. 2019): when training by maximum likelihood, the embeddings tend to crowd into a narrow cone of the space, which limits their representational power.
Anisotropy (Ethayarajh 2019): the contextual representations of BERT, ELMo, and GPT-2 don’t spread out across the space; they live in a narrow cone, especially in the upper layers, where even random words have high cosine similarity.

(Two bridge terms: cosine similarity = the cosine of the angle between two vectors, 1 = same direction, 0 = perpendicular; it’s how we measure “similarity”. Anisotropy = the vectors nearly all pointing the same way, instead of spreading out — then “everything looks like everything” and cosine stops discriminating.)

The consequence is measurable: in the Sentence-BERT table, averaging GloVe vectors (old, non-contextual) gives more correlation with human similarity judgment (~61) than BERT’s [CLS] vector (~29). The raw [CLS] is, by far, the worst. So, if you’re not going to tune, averaging the tokens (mean-pooling) already works better than the [CLS].

27.6 Sentence-BERT: sentence vectors that actually work

The supervised solution is Sentence-BERT (Reimers and Gurevych 2019): tune BERT with a siamese architecture (a bi-encoder) so it produces sentence vectors comparable directly by cosine. The key is understanding why plain BERT applied to pairs isn’t enough:

A cross-encoder feeds the two sentences together and judges their relation. It’s precise, but to find the most similar pair in a collection of 10,000 sentences you’d have to compare all pairs: ~50 million passes, ~65 hours. It doesn’t scale to search.
A bi-encoder encodes each sentence separately into a reusable vector. Comparing is then computing cosines between precomputed vectors: those same 65 hours become ~5 seconds.

What each one is for: the cross-encoder is for reranking a few candidates with maximum precision; the bi-encoder is for searching/clustering at large scale, because its vectors are computed once and reused. (This sets up Ch. 31, RAG.)

27.7 Contrastive learning: pull similar things close, push different ones away

How do you train a good embedding space? With contrastive learning: the idea is to pull positive pairs together and push negatives apart.

🧩 Analogy — organizing a library. You place the books so that the similar ones (same topic, same author) end up together and the different ones far apart. Once organized, “find me books like this one” is just grabbing the neighbors on the shelf, without re-reading them all. (And it’s good for them to fill the whole library, not pile up in a corner — exactly the fix for anisotropy.)

The modern loss function (SimCLR’s NT-Xent (Chen et al. 2020)) is:

\[ \mathcal{L}_i = -\log \frac{\exp\!\big(\operatorname{sim}(h_i, h_i^{+})/\tau\big)}{\sum_{j} \exp\!\big(\operatorname{sim}(h_i, h_j)/\tau\big)} \]

Term by term:

\(h_i\) = the anchor: the vector of the sentence we’re looking at.
\(h_i^{+}\) = its positive: a paired sentence (related, or an altered “view” of the same one).
\(h_j\) = all the candidates in the batch (including the positive and the rest, which act as negatives).
\(\operatorname{sim}\) = the cosine similarity between normalized vectors.
\(\tau\) (temperature) = a control over “how much to punish hard negatives”: low \(\tau\) → the loss obsesses over the closest negatives and pushes them hard; high \(\tau\) → it treats them all equally.
The numerator measures how close the anchor is to its positive; the denominator, how close it is to everyone. The loss is low only if the positive is, by far, the most similar to the anchor. That’s “pull similar things close and push the rest away”, written in math.

(Lineage honesty: this form with cosine + temperature is SimCLR’s NT-Xent (2020). The original “InfoNCE” (Oord et al. 2018) used a dot product without temperature; it’s worth not confusing them, even though today “contrastive” usually refers to this version.)

A trick that makes it cheap: in-batch negatives. For each anchor, the other sentences in the same batch are already encoded → they come free as negatives.

🧠 Curiosity — the same text, twice, as a positive pair (SimCSE)

Where do you get “similar pairs” without labels? SimCSE (Gao et al. 2021) had an idea almost like a joke: pass the same sentence twice through the model, with two different dropout masks (the random noise the model already applies when training). The two slightly different outputs are a positive pair; the other sentences in the batch, the negatives. It works surprisingly well — and if you remove the dropout, the representations collapse. Its supervised version uses entailment (NLI) pairs as positives and contradictions as hard negatives, reaching ~81.6 correlation (about +2.2 points over the previous best result).

27.7.1 What a good space does well: alignment and uniformity

Two properties measure the quality of a contrastive space (Wang and Isola 2020):

Alignment: positive pairs end up close (similar books, neighbors).
Uniformity: the vectors spread out over the whole sphere, preserving the maximum of information (books filling the whole library, not piled up).

The contrastive loss optimizes both at once, and SimCSE showed that it precisely remakes BERT’s anisotropic space to be more uniform — closing the circle with the narrow-cone problem.

27.8 Modern embedding models (bridge to RAG)

This machinery is the basis of today’s semantic search engines:

DPR (Karpukhin et al. 2020): a bi-encoder for question answering, trained with in-batch negatives + one hard negative mined by BM25. It beat a classic search engine (BM25) by 9-19% absolute in top-20 retrieval.
E5 (Wang et al. 2022): weakly supervised contrastive pre-training over text pairs; the first model to beat BM25 on the BEIR benchmark without labeled data.
MTEB (Muennighoff et al. 2022): the standard embedding leaderboard — 8 task types, 58 datasets, 112 languages — the way these models are compared today.

The common recipe: large batches (many in-batch negatives) + hard negatives. We’ll pick it up again in Ch. 31 (RAG).

27.9 When to tune and when to use off-the-shelf embeddings

Tune when you have a fixed task with enough labeled data and you want maximum precision or to adapt to a concrete domain.
Use frozen embeddings (off-the-shelf) for search/clustering, when many uses share the same vector space, when there are few labels, or when compute/serving cost matters (compute the vector once, reuse it).

🧪 Try it — tafagent

tafagent profiles the base model you’re going to adapt: its γ (reach), regime, and KV budget. Useful before tuning to know which attention behavior you start from —if you’re going to use it as an embedding encoder for long texts, its decay profile (Ch. 15-20) tells you how far it really “sees”.

27.10 Summary

Paradigm: pre-train → adapt (ULMFiT); the base knowledge is reused (100 examples ≈ 100× more data from scratch).
Classify: add a linear head (\(W \in \mathbb{R}^{K\times H}\)) over the [CLS] summary and tune. Frozen (head only, cheap, no forgetting) vs full (more precise, ~0.3 F1 more in BERT) — mind catastrophic forgetting.
Represent: the raw [CLS] is a poor embedding (anisotropy/narrow cone); mean-pooling works better; Sentence-BERT (bi-encoder) gives vectors reusable by cosine (65 h → 5 s versus the cross-encoder).
Contrastive: pull positives close, push negatives away (the NT-Xent loss, with cosine + temperature \(\tau\)); in-batch negatives for free; alignment + uniformity.
SimCSE: the same text with two dropouts = a positive pair (unsupervised); NLI with hard negatives (supervised).
Modern: DPR, E5, evaluated on MTEB — the basis of semantic search (→ RAG, Ch. 31).

Next (Chapter 27): tuning to classify is one thing; getting a model to follow instructions and align with what we want is another. We get into SFT, RLHF, and DPO.

27.11 Exercises

The head. In the head \(W \in \mathbb{R}^{K\times H}\), what do \(K\) and \(H\) represent? Why is it “the only thing born from scratch” when tuning?
Freeze vs tune. You have 200 labeled examples and little compute. Which mode would you choose and why? And if you had 2 million?
The raw [CLS]. Explain with the word “anisotropy” why the [CLS] of an un-tuned BERT is a poor sentence vector.
Bi vs cross. Why doesn’t a cross-encoder scale to searching 10,000 sentences and a bi-encoder does? What does the bi-encoder precompute?
Temperature. In the contrastive loss, what happens to the treatment of the hard negatives if you lower \(\tau\) a lot?
SimCSE. How does SimCSE build a positive pair without labels, and what happens if you remove the dropout?

References

Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. “A Simple Framework for Contrastive Learning of Visual Representations.” ICML. https://arxiv.org/abs/2002.05709.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” NAACL. https://arxiv.org/abs/1810.04805.

Ethayarajh, Kawin. 2019. “How Contextual Are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.” EMNLP. https://arxiv.org/abs/1909.00512.

Gao, Jun, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Representation Degeneration Problem in Training Natural Language Generation Models. https://arxiv.org/abs/1907.12009.

Gao, Tianyu, Xingcheng Yao, and Danqi Chen. 2021. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” EMNLP. https://arxiv.org/abs/2104.08821.

Howard, Jeremy, and Sebastian Ruder. 2018. “Universal Language Model Fine-Tuning for Text Classification.” ACL. https://arxiv.org/abs/1801.06146.

Karpukhin, Vladimir et al. 2020. “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP. https://arxiv.org/abs/2004.04906.

Muennighoff, Niklas, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive Text Embedding Benchmark. https://arxiv.org/abs/2210.07316.

Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. https://arxiv.org/abs/1807.03748.

Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” EMNLP. https://arxiv.org/abs/1908.10084.

Wang, Liang, Nan Yang, Xiaolong Huang, et al. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-Training (E5). https://arxiv.org/abs/2212.03533.

Wang, Tongzhou, and Phillip Isola. 2020. “Understanding Contrastive Representation Learning Through Alignment and Uniformity on the Hypersphere.” ICML. https://arxiv.org/abs/2005.10242.

27.1 The idea in one sentence

27.2 Key concepts and their role in the transformer

27.3 The paradigm: pre-train and then adapt

27.4 Classify: attach a head and tune

27.4.1 Freeze or tune: two modes

27.5 Represent: why the raw [CLS] is a poor embedding

27.6 Sentence-BERT: sentence vectors that actually work

27.7 Contrastive learning: pull similar things close, push different ones away

27.7.1 What a good space does well: alignment and uniformity

27.8 Modern embedding models (bridge to RAG)

27.9 When to tune and when to use off-the-shelf embeddings

27.10 Summary

27.11 Exercises

References

27.5 Represent: why the raw `[CLS]` is a poor embedding