10 The complete block and depth

Where we are. We have all the pieces: attention (Ch. 4), multi-head (5), FFN (6), the residual+norm scaffolding (7) and the sense of order (8). Time to assemble them into a complete block, stack that block many times, and follow a token’s journey from start to finish. By the end of this chapter you’ll have built an entire transformer —the end of the fundamentals Part—.

10.1 The idea in one sentence

A transformer is, almost entirely, a stack of identical blocks; each block mixes (attention) and then processes (FFN), with its scaffolding, repeated N times.

10.2 Key concepts and their role in the transformer

Before we dig in, let’s define this chapter’s terms and what each one is for inside a transformer:

Block (Pre-LN). Definition: the unit that repeats, made of two sublayers with the pattern normalize → sublayer → add. In the transformer: it’s the basic brick; stacking it N times is, almost entirely, the model.
Attention sublayer (mix). Definition: the operation that lets each token look at the others and pull in information. In the transformer: the block’s first line; it moves information between positions.
FFN sublayer (process). Definition: a network that transforms each token separately. In the transformer: the block’s second line; each token “thinks” about what it just gathered, without looking at its neighbors.
Residual stream. Definition: each token’s vector that flows through the whole model, to which each sublayer adds its contribution. In the transformer: the through-line; the embeddings initialize it, the blocks edit it, and its final value is translated into the output.
Causal mask. Definition: a filter that sets future positions to −∞ before the softmax, leaving them at weight 0. In the transformer: what prevents “cheating” by looking at the future; it turns the model into a generator.
Depth (stacking layers). Definition: the number of stacked blocks (12, 32, 96…). In the transformer: it lets the model compose simple patterns into abstract concepts; that’s why certain abilities only emerge with depth.
Unembedding (output head). Definition: the final projection of each token’s vector back to the vocabulary. In the transformer: it translates the internal state into a prediction (logits over the possible words).

With these pieces, the rest of the chapter is just watching them fit into a block and repeating it.

10.3 The block, assembled

Each block does exactly two things, the two you already know, in this order (Pre-LN style, the modern one):

\[ x \leftarrow x + \text{Attention}(\text{Norm}(x)) \] \[ x \leftarrow x + \text{FFN}(\text{Norm}(x)) \]

What each line does:

Line 1 — mix: normalize, let each token look at the others (multi-head attention) and add the result to the residual stream.
Line 2 — process: normalize, each token thinks on its own (FFN) and adds the result.

And that’s it. That pattern —normalize → sublayer → add, twice— is the entire block. The model’s intelligence isn’t in a sophisticated block, but in repeating this simple block many times over a good information stream.

10.4 The causal mask: what makes generation possible

There’s a detail that decides whether the model understands or generates: what each token can look at. In a generative model (GPT-style), each token can only attend to itself and the previous ones, never the future. That’s achieved with a causal mask: before the softmax, future positions are set to −∞, so their weight ends up at 0.

What’s it for? So the model learns to predict the next word without “cheating” by looking at the answer. If it could see the future, predicting would be trivial and it wouldn’t learn anything useful. (In a comprehension model like BERT there’s no masking: each token sees everything; we’ll look at this in Ch. 10.)

import torch
n = 5                                  # 5 tokens
mask = torch.triu(torch.full((n, n), float('-inf')), diagonal=1)
print(mask)   # upper triangle = -inf (the future, blocked)

10.5 Stacking in depth: why so many layers?

A transformer doesn’t have one block: it has many (GPT-2 small: 12; LLaMA-2-7B: 32; the largest, 80–100+). Why? Because depth lets the model compose:

The low layers detect surface patterns (syntax, the previous token).
The middle and high layers combine that into increasingly abstract concepts (references, semantics, facts).

Each block reads from the residual stream what the previous ones left and adds its contribution. That’s why certain abilities only appear with depth: the induction heads from Ch. 5, for example, need at least two layers to form (one finds the pattern, the next copies it).

🧩 Analogy. A single pass is a shallow reading. Stacking blocks is rereading the text many times, and on each reread understanding one more layer of meaning —from loose words, to sentences, to ideas—.

10.6 A token’s full journey

Putting the whole fundamentals Part together, this is the path from start to finish:

Tokenize (Ch. 2): text → numbers.
Embeddings + position (Ch. 3, 8): each token → a vector, placed in the residual stream, with its order information.
N blocks (Ch. 4–7): mix (attention) → process (FFN), over and over, refining each token’s vector.
Final norm + “unembedding” (Ch. 12): each token’s final vector is projected back to the vocabulary to give a prediction (e.g., the next word).

The residual stream is the thread that runs through everything: the embeddings initialize it, the blocks edit it, and its final value is translated into the output.

10.7 A minimal transformer, in code

Everything above fits in a few readable lines:

import torch, torch.nn as nn

class Block(nn.Module):
    def __init__(self, d_model, n_heads, d_ff):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.attn  = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn   = nn.Sequential(nn.Linear(d_model, d_ff), nn.GELU(),
                                   nn.Linear(d_ff, d_model))
    def forward(self, x, mask):
        h = self.norm1(x)
        x = x + self.attn(h, h, h, attn_mask=mask)[0]   # mix
        x = x + self.ffn(self.norm2(x))                 # process
        return x

class GPTmini(nn.Module):
    def __init__(self, vocab, d_model=768, n_heads=12, d_ff=3072, n_layers=12, ctx=1024):
        super().__init__()
        self.tok = nn.Embedding(vocab, d_model)         # token embeddings
        self.pos = nn.Embedding(ctx, d_model)           # position (learned, simple)
        self.blocks = nn.ModuleList([Block(d_model, n_heads, d_ff) for _ in range(n_layers)])
        self.norm_f  = nn.LayerNorm(d_model)
        self.head    = nn.Linear(d_model, vocab)        # "unembedding" -> logits

    def forward(self, ids):
        n = ids.shape[1]
        x = self.tok(ids) + self.pos(torch.arange(n))   # init the residual stream
        mask = torch.triu(torch.full((n, n), float('-inf')), diagonal=1)
        for b in self.blocks:
            x = b(x, mask)                              # N blocks
        return self.head(self.norm_f(x))                # logits over the vocabulary

That is —in essence— a GPT. The rest (training it, generating text) is what comes in Ch. 11–12. The important thing: you already understand every line.

10.8 Summary

The transformer is a stack of identical blocks; each block = normalize → attention → add, then normalize → FFN → add.
The causal mask (blocking the future) is what turns the model into a generator; without it (BERT), each token sees everything → comprehension.
Depth lets the model compose simple patterns into abstract concepts; that’s why there are 12, 32 or 96 layers, and why certain abilities emerge with depth.
The full journey: tokenize → embeddings+position → N blocks → final norm → unembedding → prediction. The residual stream is the through-line.

Next (Chapter 10): we’ve built a decoder (causal, generative). But by changing the mask and the connection, three families of models emerge —BERT, GPT, T5—. Those are the architectures.

10.9 Exercises

The pattern. Write, from memory, the two lines that make up a Pre-LN block. What does each one do?
The mask. Why must a generative model hide the future during training? What would happen if it didn’t?
Depth. Give an example (from Ch. 5) of an ability that needs more than one layer to exist, and explain why.
The path. Order these steps: unembedding · embeddings · tokenize · N blocks · add position.