37 Serving and deployment

Where we are. This closes Part VI. We already have the model trained (Part IV), compressed (Ch. 35) and with efficient attention (Ch. 34). What’s left is the last and most practical step: putting it into production. Serving an LLM has its own physics —two phases with opposite bottlenecks, a cache that limits how many users fit, and a constant tension between speed for one and throughput for all. This is the systems chapter.

37.1 The idea in one sentence

Serving an LLM well comes down to keeping the GPU busy despite generation being inherently sequential: the art is in batching many requests and managing the KV cache, balancing throughput (total tokens/sec) against latency (speed for each user).

37.2 Key concepts and their role in the transformer

Before we dig in, let’s define this chapter’s terms and what each one is for inside a transformer:

Prefill. Definition: processing the whole prompt at once (in parallel). In the transformer: it’s compute-bound (it saturates the GPU) and sets the TTFT.
Decode. Definition: generating the answer token by token. In the transformer: it’s bandwidth-bound (it rereads the KV cache every step) → the serving bottleneck.
TTFT (time-to-first-token). Definition: how long until you see the first word. In the transformer: dominated by prefill; key in chat.
TPOT / ITL. Definition: time per output token. In the transformer: dominated by decode; it’s the perceived “speed”.
Throughput vs goodput. Definition: total tokens/sec vs requests/sec that meet an SLO (latency target). In the transformer: maximizing raw throughput can violate each user’s latency.
Continuous batching. Definition: grouping requests at the iteration level (not a fixed batch), adding and removing requests every step. In the transformer: it keeps the GPU full even though responses finish at different moments.
KV cache (in serving). Definition: the memory of already-computed keys/values (Ch. 20). In the transformer: it grows with the batch and the length and competes with the weights for HBM → it limits how many requests fit.
Disaggregated / chunked prefill. Definition: separating or chunking the prefill so it doesn’t stall the decodes in flight. In the transformer: it meets the TTFT and TPOT targets at the same time.

With this in hand, let’s open the server’s box.

37.3 The two phases of inference (and why they clash)

Generating with an LLM has two very distinct phases, and understanding that they have opposite bottlenecks is the key to the whole chapter:

Prefill: processing the entire prompt. Since all the input tokens are known, it’s a massive, parallel matrix-matrix multiplication that saturates the GPU → it’s compute-bound.
Decode: generating the answer one token at a time. Each new token needs to reread the states (K, V) of all the previous ones → it’s a low-intensity matrix-vector operation that underutilizes the compute and is bandwidth-bound: what matters is bringing data from HBM (weights + KV cache), not computing.

Decode is the serving bottleneck, and it’s exactly where the KV cache (Ch. 20) gets reread on every step. Hence the central tension: putting in more requests at once raises total throughput, but slows down each user’s token-by-token.

🧩 Analogy — reading the question vs writing the answer. Prefill is reading the whole question at a glance (fast, in parallel). Decode is writing the answer word by word, rereading your notes at each word (slow, sequential). The server gets stuck writing, not reading.

37.4 The metrics that matter

There’s no single number. Serving is measured along several axes that pull in different directions:

Throughput: output tokens per second summed over all users.
TTFT (time-to-first-token): how long the user takes to see the first word —dominated by prefill.
TPOT / ITL (time per output token / inter-token latency): the perceived speed —dominated by decode.
Total latency ≈ TTFT + TPOT × (number of tokens).
Goodput: the requests/sec the system sustains while meeting an SLO (e.g. P90 TTFT < 200 ms and TPOT < 50 ms). It’s more honest than raw throughput: you can have lots of throughput and little goodput if users miss their latency target (Zhong et al. 2024).

That’s why you optimize differently depending on the case: a chat prioritizes low TTFT and TPOT; a batch (offline) job maximizes raw throughput.

🧩 Analogy — the first course and the pace. TTFT is how long the waiter takes to bring your first course; TPOT is the interval between courses. An impatient diner wants both short; a buffet (batch) just wants to serve the most plates per hour to the whole room.

37.5 Continuous batching: the big win

The problem with the static batch: if you group N requests and wait for them all to finish, the batch isn’t released until the longest one —and since output lengths vary unpredictably, the requests already done keep idle slots occupying the GPU.

Continuous batching (or iteration-level, or in-flight batching) fixes it by scheduling at the token/iteration level (Yu et al. 2022): after each step, it removes the finished requests and adds new ones, keeping the GPU saturated. It’s the biggest throughput jump in modern serving (Orca combines it with selective batching: attention, with different lengths per sequence, is handled separately). vLLM and TensorRT-LLM implement it.

🧩 Analogy — the shared taxi. The static batch is a van that won’t start the next trip until it’s completely emptied. Continuous batching is a shared taxi that picks up and drops off passengers nonstop: as soon as one gets off, another gets on, and the car is never half empty.

⚠ Honest — the “23×” doesn’t port

The speedup figures for continuous batching (up to ~23× over a naive static batch have been reported) are highly workload-dependent: that maximum only shows up with high variance in response lengths; with similar-length responses, all systems converge to ~1×. Treat it as “up to X× under the reported conditions”, never as a guarantee.

37.6 Managing the KV cache when serving

The KV cache is the binding constraint of serving: it grows linearly with batch size and sequence length, and competes with the weights for HBM. How much KV fits decides how many requests you can batch —and therefore your throughput.

This is where PagedAttention / vLLM (Kwon et al. 2023) comes in (its mechanics, in Ch. 34). Its impact in serving: by storing the cache in blocks like virtual memory, it eliminates fragmentation (almost zero waste) → much larger batches fit → more throughput (up to 2-4× under the reported conditions, at equal latency). And it allows prefix sharing: if many requests share the same system prompt or few-shot examples, their KV blocks are physically shared (automatic prefix caching) instead of recomputed.

37.7 Disaggregated and chunked prefill (2024)

A subtle problem: prefill (compute-bound) and decode (bandwidth-bound) interfere if they coexist. A long prefill monopolizes the GPU and stalls the decodes in flight of other users → it spikes their TPOT. Two modern solutions:

Disaggregation — DistServe (Zhong et al. 2024): put prefill and decode on separate GPUs/instances, each phase optimized on its own side, eliminating the interference (at the cost of transferring the KV cache between them). It optimizes goodput under TTFT and TPOT SLOs (up to 7.4× more requests under the reported conditions).
Chunked prefill — Sarathi-Serve (Agrawal et al. 2024): chunk the long prefill into fragments and interleave them with the decodes in flight in the same batch, without pausing them (stall-free batching) → it balances TTFT and TPOT on the same GPUs.

These are two strategies —spatial separation vs temporal interleaving— for the same interference problem.

37.8 Other levers (that you already know)

Serving composes almost everything above:

Speculative decoding (Ch. 29) → lowers latency (same result, faster).
Quantization (Ch. 35) → bigger models fit, or more batch in the same HBM.
Tensor/pipeline parallelism (Ch. 25) → models that don’t fit on one GPU.
Multi-LoRA — S-LoRA (Sheng et al. 2023): one base model + thousands of adapters in host memory, swapped on the fly, batching requests for different adapters together (Ch. 28).

37.9 Compilation and kernels (to situate)

A cross-cutting layer of efficiency: graph compilation and kernel fusion combine several operations into a single optimized kernel → less launch overhead and fewer trips to memory. Tools: torch.compile (PyTorch), TensorRT-LLM (NVIDIA: fused kernels + in-flight batching + paged KV + quantization) and ONNX Runtime (cross-platform operator fusion). No need to go into detail: the idea is to generate kernels that move less data, consistent with decode being bandwidth-bound.

37.10 The decision framework (honest)

Deploying is navigating a latency–cost–throughput triangle (with quality in the background). In practice:

You have a latency budget and you maximize throughput within it.
The batch size is the main lever: more batch → more throughput, more latency per user.
Cost is dominated by decode + KV-cache memory, so almost all optimizations attack exactly that.

⚠ Honest — benchmarks don’t transfer

Almost all the serving “X× faster/cheaper” claims (continuous batching, PagedAttention, DistServe, S-LoRA…) are highly workload-specific: they depend on the baseline, the GPU and its interconnect, the model size and the input/output lengths. Read them as “up to X× under the paper’s conditions”, and measure on your own workload before believing any curve.

37.11 Bridge to our theme (brief and honest)

The decode bottleneck is the KV cache (it’s reread in HBM every step), and that cache is the binding constraint on how many requests fit in the batch. Our D_f window derived from γ (Ch. 20) bounds how much KV to retain per request → reducing the effective cache length means more requests in the same HBM, that is, it touches the serving constraint directly. (Honest: this is an observation of the book’s own; the serving sources don’t talk about γ or D_f.)

🧪 Try it — tafagent

tafagent computes the KV budget from γ (Ch. 20): how much cache your model really needs at the target length. Since that cache is what limits the batch (and therefore throughput), knowing it in advance helps you size the service —how many concurrent users fit on your GPU.

37.12 Summary

Two phases: prefill (compute-bound, sets TTFT) and decode (bandwidth-bound, sets TPOT) → decode is the bottleneck, and it’s where the KV cache gets reread.
Metrics: throughput vs goodput (meeting an SLO); latency ≈ TTFT + TPOT × tokens. Chat optimizes latency; batch, throughput.
Continuous batching (Yu et al. 2022): scheduling per iteration (adding/removing requests each step) keeps the GPU full → a big throughput jump.
KV cache: the binding constraint; PagedAttention (Kwon et al. 2023) almost zero waste + prefix sharing → more batch, more throughput.
Disaggregated (DistServe) or chunked (Sarathi-Serve) prefill: keep a long prefill from stalling the decodes → they meet TTFT and TPOT at once.
Composes: speculative (latency), quantization (capacity), parallelism (size), multi-LoRA (S-LoRA); compilation/kernel fusion.
Honest: the “X×” are workload-specific —measure on yours.

Next (Part VII): we leave the engineering behind and return to the book’s underlying question —what is really happening inside?—: mechanistic interpretability, our formula audit (verified/folklore/numerology), the map of the 2026 landscape, and ethics.

37.13 Exercises

Two phases. Why is prefill compute-bound and decode bandwidth-bound? Which is the serving bottleneck and why?
Metrics. Define TTFT and TPOT and say which phase dominates each. Why is goodput more honest than throughput?
Continuous batching. What does the static batch waste and how does the continuous one avoid it? Why isn’t the “23×” a guarantee?
KV cache. Why does the KV cache limit how many requests fit in the batch? What two things does PagedAttention bring to serving?
Interference. How does a long prefill stall the decodes in flight, and how do DistServe and Sarathi-Serve solve it?
Honesty. Why might a serving paper’s “5× faster” not hold up in your deployment?

References

Agrawal, Amey, Nitin Kedia, Ashish Panwar, et al. 2024. “Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve.” OSDI. https://arxiv.org/abs/2403.02310.

Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, et al. 2023. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP. https://arxiv.org/abs/2309.06180.

Sheng, Ying, Shiyi Cao, Dacheng Li, et al. 2023. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. https://arxiv.org/abs/2311.03285.

Yu, Gyeong-In, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI. https://www.usenix.org/conference/osdi22/presentation/yu.

Zhong, Yinmin, Shengyu Liu, Junda Chen, et al. 2024. “DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving.” OSDI. https://arxiv.org/abs/2401.09670.