26 Pre-training at scale

Where we are. In Ch. 11 we saw how a transformer learns (predict the next token, gradient, Adam, warmup). Now we move up to industrial scale: how many parameters and how much data to use, where that data comes from and how it’s cleaned, how to spread the compute across thousands of GPUs without it blowing up, and why the “optimal” recipe of three years ago has already changed. This is the chapter that separates “I trained a little model” from “I trained a frontier model”.

26.1 The idea in one sentence

With a fixed compute budget, a model’s quality is governed by predictable power laws: the art is in splitting that budget between size and data —and, since 2023, in remembering that training is only half the bill; the other half is serving the model millions of times.

26.2 Key concepts and their role in the transformer

Before getting into the details, we define this chapter’s terms and what each one is for inside a transformer:

Scaling laws. Definition: power-law formulas that predict loss as a function of size and data. In the transformer: they let you know how good a model will be before you spend the money to train it —they turn training into engineering.
Compute (\(C\approx 6ND\)). Definition: the real budget, in total operations, about 6 per parameter per token. In the transformer: it’s what actually gets split between size (\(N\)) and data (\(D\)); once \(C\) is fixed, it’s all about deciding the split.
Chinchilla (~20 tokens/parameter). Definition: the optimal compute split, growing brain and data in step. In the transformer: it tells you how many tokens to train per parameter so as not to leave the model undertrained.
Inference-aware (overtraining). Definition: taking into account the cost of serving, not just of training. In the transformer: it justifies training small models with far more tokens (Llama-3), because they then get served trillions of times.
Data curation. Definition: deduplicating, filtering, and classifying the corpus by quality. In the transformer: the quality of the corpus shapes what the model learns as much as raw quantity.
Parallelism (data / tensor / pipeline / ZeRO-FSDP). Definition: spreading the work across many GPUs. In the transformer: a frontier model doesn’t fit in one GPU; these strategies, combined, make it trainable.
Mixed precision and gradient checkpointing. Definition: training in bf16 and recomputing activations instead of storing them. In the transformer: two almost universal savings of memory and stability at scale.
Stability (loss spikes, critical batch, μP). Definition: techniques to keep a months-long run from breaking. In the transformer: they let you rewind after a spike, ramp the batch size, and transfer hyperparameters from a small model to the large one.

With those terms in hand, let’s get to the details.

26.3 Scaling laws, for real

In Ch. 11 we previewed that loss drops predictably as the model grows. Here we write it down. The functional form that Chinchilla fit (Hoffmann et al. 2022) captures almost everything:

\[ L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} \]

Let’s go term by term, because each one tells a story:

\(L\) = the loss (cross-entropy from Ch. 11). Lower is better.
\(N\) = number of the model’s parameters (its “brain size”).
\(D\) = number of training tokens (how much it “has studied”).
\(E\) = the irreducible entropy: the floor no model can go below because language has genuine randomness. Even if \(N\) and \(D\) were infinite, you’d be stuck at \(E\).
\(A/N^{\alpha}\) = what you lose by having a small brain. Grow the model (\(N\uparrow\)) and this term drops, but slowly, because the exponent \(\alpha\approx0.34\) is small: it takes a lot more size to gain a little.
\(B/D^{\beta}\) = what you lose by having studied little. More data (\(D\uparrow\)) and it drops, also slowly (\(\beta\approx0.28\)).

The function says something liberating: you can predict how good a model will be before you spend the money to train it, by fitting \(A,B,\alpha,\beta,E\) on small models and extrapolating. That’s what turned training from an art into engineering.

26.3.1 Kaplan vs Chinchilla: how to split the budget

The real budget isn’t \(N\) or \(D\) separately, but the compute \(C\), and there’s a rule of thumb that links them: \(C \approx 6ND\) (training each token costs about 6 operations per parameter). Given \(C\), how much goes to brain and how much to study?

Kaplan et al. 2020 (Kaplan et al. 2020) answered: almost all to brain. Their optimal split was \(N_{\rm opt}\propto C^{0.73}\), \(D_{\rm opt}\propto C^{0.27}\) —when you multiply compute by 10, the model grew ~5.4× but the data only ~1.85×. Conclusion: huge models, relatively little data.
Chinchilla 2022 (Hoffmann et al. 2022) corrected it: you should grow brain and study in step, \(N_{\rm opt}\propto C^{0.5}\), \(D_{\rm opt}\propto C^{0.5}\) —hence the rule ~20 tokens per parameter. The proof was conclusive: Chinchilla (70B parameters, 1.4 trillion tokens) beat Gopher (280B) at equal compute, while being 4× smaller. The giants of the era were undertrained: too much brain, too little study.

🧩 Analogy — the study budget. You have a fixed budget to pass an exam and you split it between brain size (parameters) and study hours (tokens). Kaplan’s advice was “build yourself an enormous brain and skim the syllabus”. Chinchilla replied: a genius who barely studied is a waste; with a fixed budget you pass better by growing brain and study in parallel —about 20 pages read per unit of brain. The middling but well-read crammer beats the genius who never opened the book.

⚠ Honest — Chinchilla itself is disputed (but the rule holds)

In 2024, a replication (Besiroglu et al. (Besiroglu et al. 2024)) reconstructed the Chinchilla data and found that the specific published fit (the coefficients of its “Approach 3”) is inconsistent with its other two methods and implies ~70 tokens/parameter instead of 20, with suspiciously narrow confidence intervals. But their re-fit restores the ~50/50 split and the ~20:1 rule survives and comes out reinforced. Honest moral: what’s in doubt is a few coefficients, not the underlying conclusion.

26.4 The modern twist: overtraining on purpose

Here’s what Ch. 11 didn’t tell, because it’s recent. The Chinchilla rule optimizes only training compute —it’s blind to inference cost. And a deployed model isn’t trained once and done: it answers trillions of requests afterward. The lifetime bill is training + inference × usage volume.

That’s why frontier practice has shifted: people train smaller models on far more tokens than Chinchilla would say.

LLaMA (Touvron et al. 2023) already argued it explicitly: Chinchilla “ignores the inference budget… a smaller model trained longer will ultimately be cheaper to serve”.
Llama-3 8B (Grattafiori et al. 2024) was trained with ~15 trillion tokens —about 1,875 tokens per parameter, nearly 90× the 20:1 rule. Its own “Chinchilla-optimal” point would be ~200B tokens, but performance kept improving far beyond that.
“Beyond Chinchilla-Optimal” (Sardana and Frankle (Sardana et al. 2024)) formalized it: counting inference demand, the optimal model is smaller and more trained. In their example, to match the quality of a 13B with 2 trillion tokens of expected inference, it’s better to train a 7B with more data → ~17% less total compute.

The twist in the analogy. Chinchilla’s budget was only the exam preparation. But if the student then has to answer questions for years (serve millions of queries), a smaller brain is cheaper to operate every day. So it pays to over-study a small brain (15 trillion tokens for an 8B), even if it’s “inefficient” for the exam alone: you pay more up front to save a lifetime of work.

26.5 Data rules: curation and quality

Once the how-much is fixed, the what remains. And here the recent consensus is clear: data quality rivals —or beats— raw quantity. The standard pipeline is dedup → heuristic filtering → quality classifiers.

Deduplication matters (Lee et al. 2021): removing repetitions cuts verbatim memorization ~10× and trains as well or better in fewer steps. (In the C4 corpus there was a 61-word sentence repeated 60,000 times.)
Reference corpora. The Pile (~800 GB, 22 sources (Gao et al. 2020)); RefinedWeb (Penedo et al. 2023), which showed that well-filtered, deduplicated web alone can beat curated corpora; and FineWeb (15 trillion tokens (Penedo et al. 2024)), with its FineWeb-Edu subset filtered by “educational quality”.
The mixture matters (DoReMi (Xie et al. 2023)): reweighting domains with a small proxy model improved accuracy by ~6.5 points and trained 2.6× faster.
Quality > quantity (Textbooks Are All You Need (Gunasekar et al. 2023)): phi-1 (1.3B parameters), trained with ~6B tokens of “textbook quality”, matched much larger models on code.

⚠ Honest — the data wall and the “quality” labels

Two nuances: (1) Projections (Villalobos et al. (Villalobos et al. 2024)) estimate that the stock of high-quality public human text could run out between ~2026 and 2032 —the “data wall”— which pushes toward synthetic and multimodal data. (2) “Textbook quality” (phi) or “educational quality” (FineWeb-Edu) are not objective standards: they are labels defined by a classifier. They work, but it’s worth naming them for what they are.

26.6 How the work gets divided: parallelism

A frontier model doesn’t fit in a single GPU —not the model, not its optimizer states, not the activations. It’s split in four ways that combine:

Table 26.1: Parallelism strategies for training at scale

Strategy	What it splits	Inter-GPU traffic	When it’s used
Data	the batch, model replicated	gradient all-reduce/step	whenever the model fits in 1 GPU
Tensor (Megatron)	the math inside each layer	high (every fwd/bwd)	within one node (NVLink)
Pipeline (GPipe)	the layer stack into stages	low (only boundary activations)	across nodes
ZeRO/FSDP	the states (optimizer→grad→params)	medium	to remove memory redundancy

In plain terms:

Data parallelism: each GPU holds a full copy of the model and processes a part of the batch; before each step, they average their gradients (all-reduce). The simplest, but on its own it doesn’t allow models larger than one GPU.
Tensor parallelism (Megatron-LM (Shoeybi et al. 2019)): it splits the matrices inside each layer (the MLP’s, the attention heads’) across GPUs. It talks a lot among them → only worthwhile within a node with fast interconnect.
Pipeline parallelism (GPipe (Huang et al. 2018)): it cuts the layer stack into stages that live on different GPUs, like an assembly line. To keep GPUs from going idle (the “bubble”), it chops the batch into micro-batches that overlap. It talks little → good across nodes.
ZeRO / FSDP (Rajbhandari et al. 2019): instead of replicating the states on each GPU (pure waste), it shards them: stage 1 shards the optimizer states, stage 2 also the gradients, stage 3 also the parameters (which are reassembled on the fly). PyTorch’s FSDP is essentially this idea.

Frontier models combine the three in 3D parallelism (e.g. Megatron-Turing NLG at 530B (Smith et al. 2022): tensor within the node, pipeline across nodes, data on top). And two almost universal savings: bf16 mixed precision (16 bits with fp32’s dynamic range → stable, no scaling tricks (Micikevicius et al. 2017)) and gradient checkpointing (Chen et al. 2016), which stores only some activations and recomputes the rest in backprop —trading ~30% more compute for a big memory saving.

26.7 Stability at scale: when training breaks

Training for months on thousands of GPUs introduces a new enemy: instability. Three key pieces Ch. 11 didn’t need:

Loss spikes. In the training of PaLM-540B (Chowdhery et al. 2022) there were ~20 loss spikes despite gradient clipping. The fix that worked: rewind to a checkpoint ~100 steps earlier and skip ~200-500 batches. Since re-feeding those batches did not reproduce the spike, they deduced it wasn’t the data’s fault, but the batch × model-state interaction. Additional mitigations (Wortsman et al. (Wortsman et al. 2023)): qk-layernorm and z-loss curb the logit growth that triggers the instability.
Critical batch size (McCandlish et al. 2018): there’s a batch size above which growing it no longer speeds things up. It’s predicted by the “gradient noise scale”, which grows as the loss drops → that’s why the batch is ramped during training (GPT-3 went from ~32k up to the full batch in the first few billions of tokens).
μP / μTransfer (Yang et al. (Yang et al. 2022)): under this parametrization, the optimal hyperparameters —above all the learning rate— stay stable as you change the width of the model. That lets you tune on a small model and transfer without re-tuning to the large one (validated on Cerebras-GPT (Dey et al. 2023)).

🧠 Curiosity — tuning a giant without touching it

The μP result sounds like magic: with a “proxy” model of 13 million parameters they found hyperparameters that, transferred blind, beat a hand-tuned BERT-large (350M) —at the cost of tuning a single BERT-large. That said, honesty: the transfer is clean across width, but fragile in depth and in token horizon; there it’s still open research.

26.8 Mini-case: training from scratch

Putting it all together, a modern pre-training recipe fits in a list:

Curate the corpus: collect → deduplicate (exact + fuzzy) → filter → classify by quality → fix the domain mixture.
Size it: choose \(N\) and \(D\). If the model is going to be served a lot, overtrain (more tokens than the 20:1 rule, Llama-3 style).
Tune cheaply: find the learning rate with μP on a small proxy and transfer.
Distribute: data + tensor + pipeline + ZeRO as it fits; bf16 + checkpointing.
Train stably: AdamW (β₂≈0.95), warmup + cosine decay with the cycle matched to the run’s real length, ramped batch, clipping, frequent checkpoints to rewind after a spike.

🧪 Try it — tafagent

tafagent doesn’t train models, but the inference-aware logic of this chapter is exactly where it helps afterward: on a checkpoint that’s already trained, it computes the KV budget (the cache memory it will cost to serve that model at the target length, via γ; Ch. 20). If you’re going to overtrain a small model to serve it a lot, seeing its inference cost ahead of time is exactly the decision this chapter asks you to make.

26.9 Summary

Scaling laws: \(L(N,D)=E+A/N^{\alpha}+B/D^{\beta}\) — loss drops predictably with brain (\(N\)) and study (\(D\)), with an irreducible floor \(E\).
Kaplan vs Chinchilla: Kaplan said “huge model, little data”; Chinchilla corrected it to ~20 tokens/parameter (Chinchilla-70B beat Gopher-280B). The 2024 replication disputes the coefficients, not the rule.
Modern twist (inference-aware): Chinchilla ignores the cost of serving; that’s why small models are overtrained (Llama-3 8B with 15 trillion tokens) — cheaper over the model’s lifetime.
Data: quality ≳ quantity (dedup, filtering, classifiers, mixture); mind the data wall (~2026-2032) and the fact that “quality” is a classifier label.
Parallelism: data / tensor (Megatron) / pipeline (GPipe) / ZeRO-FSDP, combined in 3D; bf16 + gradient checkpointing.
Stability: loss spikes (rewind + skip batches), critical batch (ramp), μP (tune small → transfer to large).

Next (Chapter 26): we now have a powerful base model. How do we specialize it for a concrete task —classification, embeddings, retrieval— without retraining it from scratch? We get into fine-tuning.

26.10 Exercises

The terms. In \(L(N,D)=E+A/N^{\alpha}+B/D^{\beta}\), what does \(E\) represent and why does no model, however large, go below it?
The split. You have 10× more compute. According to Chinchilla (a≈b≈0.5), how much do you grow the model and how much the data? How did this differ from Kaplan?
Inference-aware. A colleague cites Chinchilla to justify a large “optimal” model. What cost does that rule ignore, and why does Llama-3 train an 8B with 15 trillion tokens?
Parallelism. Why does tensor parallelism stay within a node and pipeline parallelism get used across nodes? Think about how much the GPUs “talk”.
Stability. In PaLM, re-feeding the skipped batches didn’t reproduce the loss spike. What does that say about the cause of the spikes?

References

Besiroglu, Tamay et al. 2024. Chinchilla Scaling: A Replication Attempt. https://arxiv.org/abs/2404.10102.

Chen, Tianqi et al. 2016. Training Deep Nets with Sublinear Memory Cost. https://arxiv.org/abs/1604.06174.

Chowdhery, Aakanksha et al. 2022. PaLM: Scaling Language Modeling with Pathways. https://arxiv.org/abs/2204.02311.

Dey, Nolan et al. 2023. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster. https://arxiv.org/abs/2304.03208.

Gao, Leo, Stella Biderman, et al. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. https://arxiv.org/abs/2101.00027.

Grattafiori, Aaron et al. 2024. The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783.

Gunasekar, Suriya et al. 2023. Textbooks Are All You Need. https://arxiv.org/abs/2306.11644.

Hoffmann, Jordan et al. 2022. Training Compute-Optimal Large Language Models (Chinchilla). https://arxiv.org/abs/2203.15556.

Huang, Yanping et al. 2018. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. https://arxiv.org/abs/1811.06965.

Kaplan, Jared et al. 2020. Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361.

Lee, Katherine et al. 2021. Deduplicating Training Data Makes Language Models Better. https://arxiv.org/abs/2107.06499.

McCandlish, Sam et al. 2018. An Empirical Model of Large-Batch Training. https://arxiv.org/abs/1812.06162.

Micikevicius, Paulius et al. 2017. Mixed Precision Training. https://arxiv.org/abs/1710.03740.

Penedo, Guilherme et al. 2023. The RefinedWeb Dataset for Falcon LLM. https://arxiv.org/abs/2306.01116.

Penedo, Guilherme et al. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. https://arxiv.org/abs/2406.17557.

Rajbhandari, Samyam et al. 2019. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. https://arxiv.org/abs/1910.02054.

Sardana, Nikhil, Jacob Portes, Sasha Doubov, and Jonathan Frankle. 2024. “Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws.” ICML. https://arxiv.org/abs/2401.00448.

Shoeybi, Mohammad et al. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053.

Smith, Shaden et al. 2022. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B. https://arxiv.org/abs/2201.11990.

Touvron, Hugo et al. 2023. LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971.

Villalobos, Pablo et al. 2024. Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data. https://arxiv.org/abs/2211.04325.

Wortsman, Mitchell et al. 2023. Small-Scale Proxies for Large-Scale Transformer Training Instabilities. https://arxiv.org/abs/2309.14322.

Xie, Sang Michael et al. 2023. DoReMi: Optimizing Data Mixtures Speeds up Language Model Pretraining. https://arxiv.org/abs/2305.10429.

Yang, Greg et al. 2022. Tensor Programs v: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. https://arxiv.org/abs/2203.03466.