32  Embeddings, semantic search and RAG

Where we are. In Ch. 30 we saw that prompting elicits what the model knows, but does not add knowledge. What if it needs facts it does not have —recent data, private documents—? The answer is to retrieve them and put them in the context: retrieval-augmented generation (RAG). This chapter builds on the embeddings and bi/cross-encoders of Ch. 26 to assemble the full system: retrieve → read → generate, with its honest evaluation.

32.1 The idea in one sentence

Instead of relying on what the model memorized, you find the relevant evidence for it and hand it over in the context so it answers from that evidence —like an open-book exam—.

🧩 Analogy — the open-book exam. Instead of memorizing the whole syllabus, the student looks up the relevant page and answers from it. RAG does that: it does not ask the model to know everything by heart, but to consult a base of documents and answer leaning on what it finds.

32.2 Key concepts and their role in the transformer

Before diving in, we define this chapter’s terms and what each one is for inside a transformer:

  • Embedding. Definition: a vector that represents the meaning of a text. In the transformer: it is the common currency of semantic search —similar texts land close in the space—.
  • Dense vs sparse retrieval (BM25). Definition: searching by meaning (vectors) or by word overlap (BM25). In the transformer: dense captures paraphrase; BM25 catches exact terms → the hybrid usually wins.
  • ANN (approximate nearest neighbors). Definition: finding the nearest vectors without comparing against all of them (HNSW, FAISS). In the transformer: it makes search viable at the scale of millions, sacrificing a bit of recall for speed.
  • Reranking (bi- vs cross-encoder). Definition: retrieve many candidates cheaply and reorder the top with an expensive, precise model. In the transformer: the cross-encoder looks at query and document together → more precision where it matters.
  • Chunking. Definition: splitting documents into fragments before indexing them. In the transformer: a focused fragment embeds better and fits in the context.
  • RAG. Definition: retrieve evidence and condition generation on it. In the transformer: it adds a queryable non-parametric memory alongside the parametric one in the weights.
  • Parametric vs non-parametric memory. Definition: what is in the weights vs what is in an external index. In the transformer: the non-parametric one updates without retraining —hence RAG—.
  • Faithfulness (groundedness). Definition: the degree to which the answer derives from the retrieved documents. In the transformer: a separate axis from fluency —RAG reduces, does not eliminate hallucination—.
  • “Lost in the middle”. Definition: models use the information at the ends of the context better than the one in the middle. In the transformer: it dictates placing the relevant fragments at the beginning or the end.

With that in mind, let’s assemble the system.

32.3 Why RAG exists

Four problems that pretraining alone does not solve:

  • Knowledge cutoff: the weights freeze what was seen at training time; they do not know anything later.
  • Private/changing data: your company’s documents, today’s prices… are not in the weights.
  • Hallucination: without knowing, the model fluently makes things up.
  • Prompting does not add knowledge (the Ch. 30 ceiling): it reorganizes what is already there.

The key distinction (from (Lewis et al. 2020)): the model is parametric memory (what is in the weights); the document index the retriever accesses is non-parametric memory (what is outside, queryable and updatable). RAG conditions generation on retrieved evidence instead of reciting from the weights.

32.4 The canonical pipeline

The term was coined by Lewis et al. (Lewis et al. 2020): a dense retriever (DPR style, Ch. 26) + a seq2seq generator (BART), so that the retrieved documents condition the output. Two variants: RAG-Sequence (uses the same document for the whole answer) and RAG-Token (can use a different one per token). Precursors worth remembering: REALM (Guu et al. 2020) (retrieval during pretraining) and kNN-LM (Khandelwal et al. 2020) (interpolates the LM with a neighbor datastore —helps on rare/long-tail cases—).

Today’s practical pattern is “retrieve-then-read”, usually with retriever and generator frozen:

  1. Index the corpus (chunk + embed each fragment).
  2. Embed the query from the user.
  3. Retrieve the top-k most similar fragments.
  4. Insert them into the prompt.
  5. Generate the answer conditioned on that evidence.

32.5 The retrieval half (building on Ch. 26)

32.5.1 Dense vs sparse, and hybrid

  • Dense: semantic embeddings (DPR, E5 from Ch. 26). It retrieves by meaning, even when the exact words do not match.
  • Sparse/lexical — BM25 (Robertson and Zaragoza 2009): scores by weighted word overlap (saturated frequency + term rarity + length). It remains a strong baseline, especially for proper nouns, codes and exact terms.
  • Hybrid: combining both usually wins —dense brings semantics; sparse catches exact and rare matches that dense misses—.

🧩 Analogy — search by meaning, not by title. Dense retrieval is a librarian who understands what you mean and brings you books on the topic, even if you do not use their exact words; BM25 is the keyword catalog that only matches literal strings. Each one succeeds where the other fails → which is why they are combined.

32.5.2 Vector databases: approximate search (ANN)

Comparing the query against all the vectors is O(N) —unfeasible with millions—. The solution is approximate nearest neighbors (ANN): you sacrifice a bit of recall for a huge amount of speed. Two common pieces:

  • HNSW (Malkov and Yashunin 2016): a navigable graph through which the search “hops” to the neighbor in near-logarithmic time. It is the standard in vector databases.
  • FAISS (Johnson et al. 2017): a library with IVF (partitions the space into cells and searches only the nearby ones) and PQ (compresses the vectors so they fit in memory).

The shared idea: structures that avoid comparing against everything.

32.5.3 Reranking: the glance and the close read

This ties back to bi- vs cross-encoder (Ch. 26). A two-stage pattern: retrieve many candidates cheaply with a bi-encoder (precomputed vectors), and then reorder the top with an expensive cross-encoder (query + document together → more precise). Examples: monoBERT (Nogueira et al. 2019) and monoT5 (Nogueira et al. 2020) (cross-encoder rerankers); ColBERT (Khattab and Zaharia 2020) (late interaction — a middle ground: per-token vectors, precomputable, with a cheap comparison at query time).

🧩 Analogy — grab 50 and read 5. Reranking is like pulling 50 candidate books off the shelf at a quick glance (bi-encoder) and then carefully reading the summary of the best 5 (cross-encoder) to keep the truly relevant ones.

32.6 Chunking and indexing

  • Why chunk: the context is limited, and a focused fragment embeds better than a whole document (which dilutes the topic).
  • Size and overlap: small fragments = precise retrieval but they lose context; large ones = more context but a blurry embedding and noise. The overlap keeps ideas from being cut mid-sentence. (These are heuristics, not theorems: it is a trade-off.)
  • “Lost in the middle” (Liu et al. 2023): models use the information at the beginning or the end of the context better, and the one in the middle worse (a U-shaped curve). Implication for RAG: do not stuff in a huge k and place the most relevant fragments at the ends.
  • Metadata filtering: narrow by date, source or permissions before/alongside the search.

🧩 Analogy — index cards. Chunking is tearing the book into index cards so you can pull out only the relevant card, instead of loading the whole book —easier to find and to fit in the context—.

32.7 Variants and advances

  • Fusion-in-Decoder (FiD) (Izacard and Grave 2021): it encodes each passage separately and fuses them in the decoder → it scales to many passages without the quadratic cost of concatenating them.
  • REPLUG (Shi et al. 2023): retrieval for black-box (frozen) LMs; optionally, train the retriever using the LM itself as a signal.
  • Self-RAG (Asai et al. 2024): the model decides when and what to retrieve and critiques its own outputs and the passages (self-evaluation).
  • FLARE (Jiang et al. 2023): iterative/active retrieval —it anticipates the next sentence and, if it is unsure, retrieves and regenerates— versus single-shot RAG.
Warning⚠ Honest — does long context kill RAG?

With 100k-1M+ token windows, is it enough to dump in everything and forget about retrieval? The debate is open, but the current consensus is that they are complementary: (1) “lost in the middle” shows that more context is not used equally; (2) cost/latency grow with the context; (3) RAG gives attribution (where the answer comes from) and fresh data. RAG selects what to put in a window that still costs something.

32.8 Evaluating RAG (honest)

Two axes, and it is worth not confusing them:

  • Retrieval quality (classic IR metrics): Recall@k (is the good document among the k?), MRR (what average position the first hit reaches) and nDCG (rewards having the relevant items high in the ranking).
  • Generation quality + FAITHFULNESS (groundedness): does the answer derive from the retrieved documents? RAGAS (Es et al. 2024) evaluates it without a reference using an LLM judge on three axes: faithfulness (support in the context → catches hallucinations), answer relevance and context relevance.
Warning⚠ Honest — RAG reduces hallucination, it does not eliminate it

Three cautions: (1) the model can hallucinate even with correct retrieval —ignoring or contradicting the context—; that is why the faithfulness metric exists. (2) Retrieval errors propagate: if you bring in garbage, the generator builds on garbage. (3) Context poisoning: erroneous, contradictory or malicious documents degrade or hijack the answer. RAG is a big improvement but not a guarantee of truth.

Note🧪 Try it — tafagent

The “lost in the middle” effect is, at bottom, a phenomenon of attention across distance —precisely our Part II—. tafagent gives you the γ and the effective horizon of a model: how much it really “sees” across the context. Useful for deciding how many fragments it makes sense to retrieve (a short-horizon model will not exploit a huge top-k) and where to place them.

32.9 Summary

  • RAG (Lewis et al. 2020): retrieve evidence and condition generation on it —parametric memory (weights) + non-parametric (queryable index)—.
  • Pipeline: index → embed query → retrieve top-k → insert → generate (retrieve-then-read). Precursors: REALM, kNN-LM.
  • Retrieval: dense (semantic) + BM25 (lexical) → hybrid wins; ANN (HNSW, FAISS) to scale; reranking bi→cross (ColBERT in between).
  • Chunking: chunk for precision; mind size/overlap and “lost in the middle” (relevant at the ends).
  • Variants: FiD, REPLUG (black box), Self-RAG, FLARE (active). Long context vs RAG: complementary (open debate).
  • Evaluate: retrieval (Recall@k/MRR/nDCG) and generation + faithfulness (RAGAS). Honest: RAG reduces, does not eliminate hallucination; retrieval errors propagate.

Next (Chapter 32): if, instead of just retrieving text, the model can act —use tools, call APIs, plan over several steps—, we enter the world of agents.

32.10 Exercises

  1. Parametric vs not. Explain the difference between parametric and non-parametric memory. Which one do you update without retraining?
  2. Dense vs BM25. Give a case where BM25 beats dense retrieval and another the other way around. Why is the hybrid usually better?
  3. ANN. Why is exact kNN not used at the scale of millions of vectors? What is sacrificed with HNSW/FAISS?
  4. Reranking. Why retrieve with a bi-encoder and reorder with a cross-encoder, rather than using the cross-encoder for everything?
  5. Lost in the middle. If you retrieve 10 fragments, where do you place the most relevant ones and why?
  6. Honesty. Cite two reasons RAG can give a wrong answer even though retrieval works.

References

Asai, Akari, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. “Self-RAG: Learning to Retrieve, Generate, and Critique Through Self-Reflection.” ICLR. https://arxiv.org/abs/2310.11511.
Es, Shahul, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2024. “RAGAS: Automated Evaluation of Retrieval Augmented Generation.” EACL (Demo). https://arxiv.org/abs/2309.15217.
Guu, Kelvin, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. “REALM: Retrieval-Augmented Language Model Pre-Training.” ICML. https://arxiv.org/abs/2002.08909.
Izacard, Gautier, and Edouard Grave. 2021. “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering.” EACL. https://arxiv.org/abs/2007.01282.
Jiang, Zhengbao, Frank F. Xu, Luyu Gao, et al. 2023. “Active Retrieval Augmented Generation.” EMNLP. https://arxiv.org/abs/2305.06983.
Johnson, Jeff, Matthijs Douze, and Hervé Jégou. 2017. Billion-Scale Similarity Search with GPUs. https://arxiv.org/abs/1702.08734.
Khandelwal, Urvashi, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. “Generalization Through Memorization: Nearest Neighbor Language Models.” ICLR. https://arxiv.org/abs/1911.00172.
Khattab, Omar, and Matei Zaharia. 2020. “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR. https://arxiv.org/abs/2004.12832.
Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS. https://arxiv.org/abs/2005.11401.
Liu, Nelson F. et al. 2023. “Lost in the Middle: How Language Models Use Long Contexts.” TACL. https://arxiv.org/abs/2307.03172.
Malkov, Yu A., and D. A. Yashunin. 2016. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. https://arxiv.org/abs/1603.09320.
Nogueira, Rodrigo, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. “Document Ranking with a Pretrained Sequence-to-Sequence Model.” Findings of EMNLP. https://arxiv.org/abs/2003.06713.
Nogueira, Rodrigo, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-Stage Document Ranking with BERT. https://arxiv.org/abs/1910.14424.
Robertson, Stephen, and Hugo Zaragoza. 2009. “The Probabilistic Relevance Framework: BM25 and Beyond.” Foundations and Trends in Information Retrieval 3 (4): 333–89. https://doi.org/10.1561/1500000019.
Shi, Weijia, Sewon Min, Michihiro Yasunaga, et al. 2023. REPLUG: Retrieval-Augmented Black-Box Language Models. https://arxiv.org/abs/2301.12652.