Part IV — Training and Deployment

Inference

How generation actually works. Autoregressive decoding, temperature, sampling strategies, context windows, and the KV cache.

One Token at a Time

When a language model generates text, it doesn't produce the whole response at once. It generates one token at a time, and each new token depends on every token that came before it. This is called autoregressive generation: the model's own output becomes part of its input for the next step.

Here's what happens concretely when you send a prompt to a model:

  1. The prompt is tokenized into a sequence of tokens.
  2. The entire sequence passes through the transformer (all layers, all attention heads).
  3. The final layer outputs a vector of logits -- one number per token in the vocabulary. A typical vocabulary has 32,000 to 128,000 tokens.
  4. The logits are converted to a probability distribution via softmax.
  5. A token is selected from this distribution (more on how below).
  6. That token is appended to the sequence, and steps 2-5 repeat.

This loop continues until the model produces a special end-of-sequence token or hits a maximum length. Every response you've ever gotten from a language model was generated this way -- one token after another, each conditioned on all previous tokens.

Autoregressive Generation The capital of France is ??? Step 1 Transformer Forward Pass Softmax over vocabulary Paris Selected Step 2 input: The capital of France is Paris ??? next token...

This architecture means that generating a sequence of length N requires N forward passes through the transformer. Each forward pass is relatively cheap compared to a training step (no backpropagation, no gradient computation), but they add up -- a 1,000-token response requires 1,000 sequential forward passes. This sequential nature is the fundamental bottleneck of autoregressive generation, and most of the engineering tricks in this chapter exist to make it faster.

From Logits to Tokens: The Sampling Decision

After each forward pass, the model outputs a vector of logits -- raw, unnormalized scores. A positive logit means the model thinks the token is likely; a negative logit means unlikely. These are converted to probabilities by the softmax function:

P(token_i) = exp(logit_i) / sum(exp(logit_j) for all j in vocabulary)

This gives a proper probability distribution: all values between 0 and 1, summing to 1. For the prompt "The capital of France is," the distribution might put 0.85 on "Paris," 0.03 on "the," 0.02 on "a," and spread the remaining 0.10 across thousands of other tokens.

The question is: how do you pick the next token from this distribution? This is where decoding strategies come in, and they have a major impact on output quality.

Greedy decoding

The simplest strategy: always pick the token with the highest probability. This is greedy decoding. It's deterministic -- the same prompt always produces the same output. It's fast. And for factual questions, it often gives the best answer.

The problem with greedy decoding is that it produces dull, repetitive text for open-ended generation. It gravitates toward the most common continuations, avoids creative language, and frequently gets stuck in loops ("The the the the..."). This happens because the locally optimal choice at each step doesn't always lead to the globally best sequence.

Sampling

The alternative: sample randomly from the probability distribution. This introduces randomness -- different runs produce different outputs. A token with probability 0.3 will be selected 30% of the time, a token with probability 0.05 will be selected 5% of the time. This produces more diverse, creative text, but it also introduces the risk of sampling low-probability tokens that derail the generation ("The capital of France is Tuesday").

The practical strategies are all about controlling this trade-off between diversity and coherence.

Temperature

Temperature is a scalar that divides the logits before softmax:

P(token_i) = exp(logit_i / T) / sum(exp(logit_j / T) for all j)

The effect is intuitive once you see it:

Temperature doesn't change which token is most likely -- it changes how much more likely it is than the alternatives. This makes it a useful control knob: lower temperature for factual tasks where you want the model's best guess, higher temperature for creative tasks where you want variety.

Key idea: Temperature controls the entropy of the output distribution. If you've worked with Thompson Sampling, the analogy is direct: your bandit's Beta distributions become sharper (more confident) as you accumulate evidence. Temperature does the same thing artificially -- T < 1 acts as if the model is more confident than it actually is, T > 1 acts as if it's less confident.

Top-k Sampling

Top-k sampling takes a different approach to controlling randomness: instead of adjusting probabilities, it restricts the candidate set. After computing the full probability distribution, only the k most likely tokens are kept. All other tokens are set to probability zero. The remaining probabilities are renormalized (scaled up so they sum to 1), and sampling proceeds from this reduced set.

With k = 50, the model can only choose among its top 50 candidates. This prevents the catastrophic low-probability samples that can derail generation ("Tuesday" as the capital of France) while preserving some diversity within the high-probability candidates.

The limitation of top-k is that k is fixed regardless of context. Sometimes the model is very confident and only 5 tokens are reasonable. Sometimes it's uncertain and 500 tokens are plausible. A fixed k is too restrictive in the second case and too permissive in the first.

Top-p (Nucleus) Sampling

Top-p sampling, introduced by Holtzman et al. (2019), solves this problem elegantly.1 Instead of fixing the number of candidates, it fixes the cumulative probability mass. You sort tokens by probability from highest to lowest, then include tokens until their cumulative probability reaches p.

With p = 0.9:

This adapts automatically to the model's confidence. When the model is sure, the candidate set is small. When it's uncertain, the candidate set expands. This produces noticeably better text than top-k for the same level of diversity.

In practice, most API providers use top-p sampling as the default (OpenAI, Anthropic, and others all expose a top_p parameter). Temperature and top-p are often combined: temperature adjusts the shape of the distribution, then top-p trims the tails.

The KV Cache

We said that generating N tokens requires N forward passes. But there's a crucial optimization that avoids redundant computation: the KV cache (key-value cache).

Recall from the attention mechanism (covered in earlier chapters): for each token, the model computes three vectors -- a query (Q), a key (K), and a value (V). Attention works by comparing the current token's query against the keys of all previous tokens, then taking a weighted sum of the corresponding values.

The critical observation: when generating token N+1, the keys and values for tokens 1 through N are exactly the same as they were when generating token N. The model is only adding one new token -- the previous tokens haven't changed. So there's no need to recompute their keys and values.

KV Cache: Avoiding Redundant Computation Without KV cache (step 5): Tok 1 Tok 2 Tok 3 Tok 4 Tok 5 Compute K,V for ALL 5 tokens O(n) per step, O(n^2) total With KV cache (step 5): Tokens 1-4: K,V retrieved from cache (no recomputation) Tok 5 Compute K,V for ONLY new token O(1) per step, O(n) total For a 1000-token generation: ~500x less computation with KV cache

The KV cache stores the key and value vectors for all previously processed tokens. At each generation step, only the new token's keys and values are computed and appended to the cache. The attention computation uses the cached keys and values for previous tokens and the freshly computed ones for the new token.

Without the KV cache, generating a sequence of length N would require computing attention over the full sequence at every step -- O(N) work per step, O(N^2) total. With the KV cache, each step only computes the new token's contribution -- O(1) for the new computation (plus O(N) for the attention lookup against cached values), giving O(N) total new computation. This is the difference between generation being painfully slow and being merely slow.

The trade-off: the KV cache consumes memory. For each layer and each attention head, you store a key vector and a value vector for every token in the context. For a model with 32 layers, 32 heads, and 128-dimensional head vectors, the KV cache for a 4,096-token context is about 1 GB in fp16. For a 128K-token context, that's about 32 GB -- sometimes more than the model weights themselves. This is why long-context models need so much memory at inference time, and it's the main practical constraint on context window length.

Context Windows

The context window is the maximum number of tokens the model can process in a single forward pass. It's determined by two factors:

Context windows have grown rapidly:

Model Year Context window
GPT-2 2019 1,024 tokens
GPT-3 2020 2,048 tokens
GPT-4 (initial) 2023 8,192 tokens
Claude 3 2024 200,000 tokens
Gemini 1.5 Pro 2024 1,000,000 tokens
Claude Opus 4 2025 1,000,000 tokens

The jump from 2K to 1M tokens in five years is remarkable, but it came with significant engineering. Techniques like FlashAttention (Dao et al., 2022) make attention computation more memory-efficient by restructuring it to exploit GPU memory hierarchy.3 Multi-query attention and grouped-query attention reduce the KV cache size by sharing key and value heads across multiple query heads.4 Without these optimizations, million-token context windows would be impractical.

The Economics: Inference vs. Training

Training a frontier model costs tens to hundreds of millions of dollars. A single inference call costs a fraction of a cent. But consider the scale: a popular model might serve billions of inference requests per month. Over the model's lifetime, total inference cost often exceeds training cost by a large margin.

This asymmetry drives a massive engineering effort around inference efficiency:

Speculative Decoding

A particularly clever trick for reducing latency is speculative decoding (Leviathan et al., 2022; Chen et al., 2023).5 The idea exploits the fact that autoregressive generation is bottlenecked by sequential steps, not by the computation within each step. A smaller, faster model can often predict what the larger model will say.

The process:

  1. A small draft model generates k tokens quickly (e.g., k = 5).
  2. The large target model processes all k tokens in a single forward pass (this is parallel, not sequential, because the tokens are already determined).
  3. The target model's predictions are compared against the draft tokens. Any token where the target agrees is accepted. At the first disagreement, the target model's prediction is used and the remaining draft tokens are discarded.

If the draft model's predictions are frequently correct (which they are for straightforward text), you effectively generate multiple tokens per forward pass of the large model. In practice, speculative decoding achieves 2-3x speedup without any change in output quality -- the final output is statistically identical to what the target model would have produced on its own.

Putting It Together

Inference is where the rubber meets the road. The architecture, training, and fine-tuning from previous chapters all converge into this single operation: a forward pass through the transformer, repeated token by token, producing the text you read.

The key takeaways:

Key idea: When you interact with a language model, you're watching autoregressive generation in real time. The model isn't "thinking" and then "responding" -- it's generating each token based on probability distributions shaped by the entire preceding context. The streaming effect you see in chat interfaces (text appearing word by word) isn't a UI trick. That's literally how the model produces its output.
Next: Chapter 16 — The Communities. Open source vs. closed source, the research ecosystem, deployment platforms, and the people building all of this.

1 Holtzman et al. (2019), "The Curious Case of Neural Text Degeneration." ICLR 2020. Introduced nucleus (top-p) sampling and demonstrated that pure sampling and top-k both produce degenerate text for different reasons. Top-p adapts the candidate set size to the model's confidence at each step.

2 Su et al. (2021), "RoFormer: Enhanced Transformer with Rotary Position Embedding." Neurocomputing 2024. RoPE encodes position through rotation in the complex plane, naturally extending to longer sequences. Most modern open-source models (LLaMA, Mistral, Qwen) use RoPE.

3 Dao et al. (2022), "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. Restructures the attention computation to minimize reads and writes to GPU high-bandwidth memory, achieving 2-4x speedup with no approximation.

4 Ainslie et al. (2023), "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. Grouped-query attention shares key-value heads across groups of query heads, reducing the KV cache size by 4-8x with minimal quality loss. Used in LLaMA 2 70B, Mistral, and most subsequent models.

5 Leviathan et al. (2022), "Fast Inference from Transformers via Speculative Decoding." ICML 2023. Chen et al. (2023), "Accelerating Large Language Model Decoding with Speculative Sampling." Independently proposed using a smaller draft model to speculatively generate multiple tokens, verified in parallel by the target model. Achieves 2-3x latency reduction with identical output distribution.