4 Embeddings and the residual stream

Where we are. In the previous chapter we turned text into tokens: whole numbers. But a number like 15496 tells the model nothing about meaning —15496 is no “closer” to 15497 than it is to 42. This chapter takes the second step: turning each token into a vector (its embedding) that does capture relationships, and placing it on the residual stream, the “shared memory” along which information travels inside the model. By the end you’ll understand why models use vectors, how they capture meaning, and what the real communication channel between layers actually is.

4.1 The idea in one sentence

Each token is replaced by a learned vector of numbers —its embedding— and all those vectors travel along a common channel, the residual stream, where every layer reads and keeps adding information without erasing what came before.

4.2 Key concepts and their role in the transformer

Before diving into the details, let’s define this chapter’s terms and what each one is for inside a transformer:

Embedding. Definition: the learned vector of numbers that represents a token. In the transformer: it turns a meaningless integer into something with distance and direction, which is what the model can actually process.
Embedding matrix (E). Definition: the big table (vocabulary × d_model) with one row per token. In the transformer: the token ID selects its row; this is where all the input embeddings live.
d_model. Definition: the dimension of each vector and, therefore, the “width” of the residual stream. In the transformer: it sets how much distinguishing capacity the model has; wider = finer detail at the cost of more compute.
Meaning as direction. Definition: the idea that relationships (gender, plural…) live in directions of the space, not on individual axes (king − man + woman ≈ queen). In the transformer: this is the geometry an integer can’t provide, and what makes vectors useful.
Static vs. contextual. Definition: the input embedding is fixed per token; after passing through the layers, it becomes contextualized. In the transformer: “riverbank” and “central bank” start from the same vector, but the layers pull them apart; contextualization doesn’t live in E, but in what gets written on top of it.
Residual stream. Definition: the common channel —a “shared whiteboard”— along which each token’s vector travels. In the transformer: the embedding initializes it and every layer reads and ADDS (doesn’t erase), so earlier information persists and can be combined later on.
Residual connection. Definition: the sum \(x \leftarrow x + f(x)\) that adds each block’s contribution instead of replacing it. In the transformer: this is what makes it possible to train very deep models without the signal getting lost.
Position. Definition: order information that is added to the embedding, because attention on its own is permutation-invariant. In the transformer: without it the model couldn’t tell “dog bites man” from “man bites dog”; modern models inject it with RoPE (Ch. 8).

With these in hand, let’s flesh them out.

4.3 What an embedding is

How do we get from an integer to something with meaning? The model keeps a big table, the embedding matrix E, of shape (vocabulary_size × d_model): one row per token in the vocabulary, and each row is a vector of d_model numbers. The token ID simply selects its row:

import torch, torch.nn as nn

vocab, d_model = 50257, 768          # e.g. GPT-2
E = nn.Embedding(vocab, d_model)     # the matrix (it's learned)

ids = torch.tensor([15496, 1917, 0]) # "Hello", " world", "!"
vecs = E(ids)                        # (3, 768): one vector per token
print(vecs.shape)                    # torch.Size([3, 768])

Mathematically this is equivalent to multiplying a one-hot vector (all zeros except a single 1 at the token’s position) by E —but in practice it’s just “go to row number id”.

4.4 Why a vector and not the integer

Wouldn’t the token’s number be enough? No, because integers have no notion of similarity: 47 is no “more similar” to 48 than to 9000. A vector, by contrast, lives in a space that does have distance and direction: tokens with similar meanings can end up close together, and relationships (gender, plural, verb tense…) can correspond to directions. That geometry is what the model needs and an integer can’t give.

4.5 How they are learned

Embeddings are not handcrafted or downloaded: they start with random values and are adjusted alongside the whole model during training, by backpropagation, like any other weight. The semantic structure emerges on its own from training to predict text.

4.6 Meaning lives in directions

The intuition predates transformers. The distributional hypothesis (Firth 1957) sums it up: “you shall know a word by the company it keeps” —words that appear in similar contexts mean similar things. Methods like word2vec (Mikolov et al. 2013) and GloVe (Pennington et al. 2014) brought this to vectors, with the now-famous example:

Figure 4.1: A 2D sketch of the famous word2vec example. The “royalty” relationship (orange) is a *direction* that is nearly identical when going from man→king and from woman→queen; likewise the “gender” direction (blue). That’s why `king − man + woman ≈ queen`. In a real model the space has hundreds of dimensions; this is just a visual intuition.

\[ \text{vec}(\text{king}) - \text{vec}(\text{man}) + \text{vec}(\text{woman}) \approx \text{vec}(\text{queen}) \]

⚠ Careful — it’s approximate

The analogy is illustrative, not an exact law: ≈ is not =, and in practice it doesn’t work perfectly for every word. And beware: individual dimensions are not interpretable one by one; meaning lives in directions of the space, not on specific axes. A dimension being “the gender one” is the toy exception, not the rule.

4.7 The subtlety almost everyone confuses: static vs. contextual

There’s an important subtlety here. word2vec vectors are static: the word “bank” has a single vector, no matter the sentence. A transformer’s input embeddings (the rows of E) are also static per token.

The key difference: in a transformer, that static vector becomes contextualized as it passes through the layers. After attention, the vector for “bank” in “riverbank” already differs from the one in “central bank”. Contextualization isn’t in E; it’s in what the layers keep writing on top —and that brings us to the residual stream.

4.8 The residual stream: the shared whiteboard

Where does all this information travel and accumulate? On the residual stream, the concept that best explains how a transformer works on the inside (Elhage et al. 2021).

🧩 Analogy — a conveyor belt with a shared whiteboard. Each token gets on the belt with an initial note: its embedding. At each station (layer), the workers —the attention and the feed-forward network— read what’s on the whiteboard and add their own notes, without erasing anything. At the end of the belt, a reader translates the accumulated whiteboard into the prediction. Nobody has their own memory: everything is shared on that whiteboard.

In formulas, each block adds its contribution to the traveling vector (hence “residual”):

\[ x \leftarrow x + \text{attention}(x), \qquad x \leftarrow x + \text{FFN}(x) \]

Three ideas worth holding onto:

The embedding initializes the stream (it’s the starting value).
Each layer reads and adds (doesn’t replace), so earlier information persists and can be combined later on.
The final value of the stream is “un-embedded” (projected back to the vocabulary) to produce the prediction (Chapter 12).

This additive writing comes from the residual connections of ResNets (He et al. 2016), and it’s exactly what makes it possible to train very deep models without the signal getting lost.

🔍 Going deeper — weight tying

Many models share the input embedding matrix with the output one (the matrix that turns the final vector into logits over the vocabulary): it’s the same E used at both ends (Press and Wolf 2017). The upsides: fewer parameters and, often, better performance. It makes sense: if E[i] represents token i at the input, it can also serve to score it at the output.

4.9 How big is `d_model`?

d_model (the dimension of each vector and, therefore, the “width” of the residual stream) has grown along with the models:

Table 4.1: Model dimension across well-known architectures

Model	`d_model`
Original Transformer (2017)	512
GPT-2 small / BERT-base	768
LLaMA-2-7B	4096
GPT-3 (175B)	12288

More d_model = more capacity to make fine distinctions (and a wider residual stream), but also more parameters and more compute. It’s another trade-off, like the vocabulary size from the previous chapter.

4.10 And where does position come in?

A detail we close here and open fully later on: as it stands, attention doesn’t know the order of the tokens (it’s “permutation-invariant”). For order to matter, position information is added to each token’s embedding. The original Transformer used sinusoidal positions; GPT-2 and BERT used learned ones; modern models use RoPE —which we’ll cover in detail in Chapter 8.

4.11 Summary

An embedding is the learned vector that represents a token; it lives in the matrix E (vocabulary × d_model), and the token ID selects its row.
We use vectors (not the integer) because they provide distance and direction: meaning lives in directions (king − man ≈ queen − woman), not on individual axes.
Input embeddings are static per token; the layers contextualize them afterward.
The residual stream is the common channel: the embedding initializes it, every layer reads and ADDS (doesn’t erase), and the final value is un-embedded to predict.
d_model (the width of the stream) ranges from 512 to 12288: more capacity in exchange for more compute.

Next (Chapter 4): we now have the vectors on the stream. Time for the heart of the book —how attention lets each token look at the others and blend their information.

4.12 Exercises

The E table. If the vocabulary has 50,000 tokens and d_model = 768, how many numbers (parameters) does the embedding matrix have? And with d_model = 4096?
Static vs. contextual. Explain in your own words why “bank” has one input embedding but different representations in deep layers depending on the sentence.
Additive. On the residual stream, x ← x + FFN(x). What’s the advantage of adding rather than replacing x? (Hint: what happens to the information from earlier layers?)
Directions. If vec(Paris) − vec(France) + vec(Spain) ≈ vec(?), what word would you expect, and why?

References

Elhage, Nelson et al. 2021. “A Mathematical Framework for Transformer Circuits.” Transformer Circuits Thread (Anthropic). https://transformer-circuits.pub/2021/framework/index.html.

Firth, John R. 1957. “A Synopsis of Linguistic Theory, 1930–1955.” In Studies in Linguistic Analysis. Blackwell.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” CVPR. https://arxiv.org/abs/1512.03385.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” ICLR Workshop. https://arxiv.org/abs/1301.3781.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” EMNLP.

Press, Ofir, and Lior Wolf. 2017. “Using the Output Embedding to Improve Language Models.” EACL. https://arxiv.org/abs/1608.05859.