Part IV — Training and Deployment

Fine-Tuning

What changes in the weights. Supervised fine-tuning, reinforcement learning from human feedback, instruction tuning, and alignment.

The Two-Phase Training Paradigm

Modern language models are built in two distinct phases, and confusing them is one of the most common mistakes people make when talking about AI.

Pre-training is the first phase. A model reads an enormous corpus of text -- hundreds of billions to trillions of tokens -- and learns to predict the next token. This is where the model acquires its knowledge of language, facts, reasoning patterns, and code. Pre-training is extraordinarily expensive: GPT-4 reportedly cost over $100 million in compute. It runs for weeks or months on thousands of GPUs. What comes out is a base model (sometimes called a foundation model) that is impressively capable but also deeply weird -- it will happily continue any text you give it, including completing hate speech, generating fabricated citations, or roleplaying as a malicious actor. The base model has no concept of being helpful. It's a text completion engine.

Fine-tuning is the second phase. Starting from the pre-trained weights, you train the model further on a much smaller, carefully curated dataset designed to shape its behavior. This is where the model learns to answer questions instead of just continuing text, to refuse harmful requests, and to follow instructions. Fine-tuning is comparatively cheap -- typically a few hundred to a few thousand GPU-hours rather than millions.

Key idea: Pre-training teaches the model what language is. Fine-tuning teaches it how to behave. The base model is the raw capability; fine-tuning is the steering.

This two-phase structure is why fine-tuning matters so much. You don't need to train a model from scratch to get it to do what you want. You take something that already understands language at a deep level and adjust it for your specific purpose. The economics of this are transformative: a task that would cost $100 million from scratch costs $10,000 as fine-tuning.

Supervised Fine-Tuning (SFT)

The simplest form of fine-tuning is supervised fine-tuning: you create a dataset of (input, desired output) pairs and train the model on them using the same cross-entropy loss used in pre-training. The model sees a prompt, generates a response, and its weights are adjusted to make the desired response more likely.

For a chat model, the training examples look like conversations:

The magic of SFT is how little data it takes. While pre-training uses trillions of tokens, SFT datasets are typically in the tens of thousands of examples. The original InstructGPT paper used about 13,000 demonstration examples for the SFT phase.1 This is sufficient because the model already knows the facts and language patterns -- you're just teaching it the format of a helpful response.

What physically changes

A common misconception is that fine-tuning adds new layers or modules to the model. In standard full fine-tuning, every weight in the model shifts slightly. The same weight matrices that were trained during pre-training receive gradient updates during fine-tuning. No new parameters are created. The model architecture is identical before and after -- what changes is the values stored in those matrices.

The gradients during fine-tuning are typically much smaller than during pre-training, and the learning rate is set much lower (often 10x to 100x smaller). This means the weights shift gently from their pre-trained values rather than being overwritten. Think of it as nudging rather than rebuilding.

Reinforcement Learning from Human Feedback (RLHF)

SFT gets you a model that follows the format of helpful responses, but it has a fundamental limitation: it only teaches the model to imitate the demonstrations. If the demonstrations don't cover a situation, the model has no signal about what to do. More importantly, there are many situations where it's much easier for a human to judge which response is better than to write the ideal response from scratch.

This insight led to RLHF -- reinforcement learning from human feedback -- which became the dominant alignment technique starting with the InstructGPT paper from Ouyang et al. in 2022.1 RLHF has three stages:

The RLHF Pipeline Stage 1: SFT Train on human-written demonstrations ~13K examples Stage 2: Reward Model Train on human preferences (A is better than B) ~33K comparisons Stage 3: PPO Optimize policy against reward model RL fine-tuning How the Reward Model Works Human sees two model outputs for the same prompt. Picks the better one. Reward model learns to predict which response humans would prefer. Outputs a scalar score. How PPO Optimization Works Model generates a response. Reward model scores it. PPO updates the model to increase reward -- but with a penalty that prevents it from drifting too far from the SFT model (the KL divergence constraint).

Stage 1 is the SFT step described above. Stage 2 trains a separate model -- the reward model -- on human preference data. Labelers see two model outputs for the same prompt and indicate which one is better. The reward model learns to predict which response a human would prefer, outputting a single scalar score. Stage 3 uses Proximal Policy Optimization (PPO), a reinforcement learning algorithm, to optimize the language model (the "policy") to generate responses that score highly according to the reward model.

The crucial detail in Stage 3 is the KL divergence penalty. Without it, the model would quickly learn to exploit the reward model -- generating weird, high-scoring outputs that don't actually correspond to good responses. The KL penalty forces the model to stay close to its SFT starting point, only making adjustments that improve reward without radically changing its behavior. This is directly analogous to regularization in supervised learning: constrain the optimization so it generalizes rather than overfits.

If you've built a Thompson Sampling bandit, the conceptual link here is direct. In your bandit, the reward signal comes from observed outcomes. In RLHF, the reward signal comes from a learned model of human preferences. Both are using reward to shape policy -- the bandit updates arm probabilities, RLHF updates the weights of the entire language model. The scale is different by many orders of magnitude, but the structure is the same.

Constitutional AI: AI Feedback Instead of Human Feedback

Anthropic introduced Constitutional AI (CAI) in 2022 as an alternative to RLHF that reduces reliance on human labelers.2 The key insight is that you can use the AI itself to generate preference data, guided by a set of principles (the "constitution").

The process works in two phases:

  1. Critique and revision. The model generates a response, then is asked to critique its own response according to a constitutional principle (e.g., "Is this response harmful?"). It then revises the response to address its own critique. This generates improved training data without human labelers.
  2. RL from AI feedback (RLAIF). Instead of humans comparing outputs, the AI itself compares outputs using the constitutional principles. This preference data trains a reward model, and the rest proceeds like RLHF.

The advantage is scalability -- human labeling is expensive and slow. The risk is that the AI's judgment is only as good as the principles and the model's ability to apply them. In practice, CAI produces models that are competitive with RLHF on helpfulness and often superior on harmlessness, precisely because the constitution can encode nuanced principles that are hard to communicate to individual labelers.

Direct Preference Optimization (DPO)

In 2023, Rafailov et al. proposed DPO -- a method that eliminates the reward model entirely.3 Their key insight was mathematical: the optimal policy under the RLHF objective (maximize reward minus KL penalty) has a closed-form relationship with the reward function. This means you can reparameterize the problem so that instead of training a reward model and then doing RL, you directly optimize the language model on the preference data.

Concretely, DPO takes the same preference pairs used to train a reward model (response A is better than response B for this prompt) and uses them to directly adjust the language model's weights. The loss function increases the probability of the preferred response and decreases the probability of the dispreferred one, with the same KL constraint baked into the math.

Why this matters practically:

DPO has become widely adopted since its publication, though RLHF remains common at frontier labs where the additional complexity is manageable and the flexibility of a separate reward model is valuable.

Instruction Tuning

Not all fine-tuning is about alignment. Instruction tuning is a specific form of SFT focused on teaching a model to follow instructions across a wide range of tasks.

The paradigm was established by two efforts. Google's FLAN (Finetuned Language Net, 2022) took a pre-trained model and fine-tuned it on over 1,800 tasks phrased as natural language instructions.4 OpenAI's InstructGPT (2022) combined SFT with RLHF to create a model that follows instructions while also being aligned.1

The striking finding from instruction tuning research is generalization. A model fine-tuned on a diverse set of tasks expressed as instructions becomes better at following instructions it has never seen before. The model isn't memorizing task-specific behaviors -- it's learning the meta-skill of instruction following. This is why you can ask ChatGPT to do things that aren't in any training example and get reasonable results: the model has learned the pattern of "read an instruction and do what it says."

Alignment: Helpful, Harmless, Honest

Alignment is the broader project of making models behave in accordance with human values and intentions. The phrase "helpful, harmless, and honest" (sometimes called HHH) captures the three core objectives:5

These objectives can conflict. A model that is maximally helpful will assist with harmful requests. A model that is maximally harmless will refuse everything. Alignment is the engineering problem of navigating these trade-offs, and fine-tuning -- through SFT, RLHF, DPO, or constitutional methods -- is the primary mechanism for doing it.

It's worth noting that alignment is not the same as safety, though they overlap. Safety is about preventing catastrophic outcomes. Alignment is about making the model do what you actually want. A perfectly aligned model that is being used by someone with harmful intentions is safe in one sense (it's doing what the user wants) and unsafe in another (the outcomes are harmful). This distinction drives much of the ongoing debate in the field.

Catastrophic Forgetting

Fine-tuning isn't free. The main risk is catastrophic forgetting: the model loses capabilities it had before fine-tuning. This happens because the gradient updates that improve performance on the fine-tuning task can degrade performance on tasks that were well-handled by the pre-trained model.

The mechanism is straightforward. The pre-trained weights encode information about many tasks simultaneously. When you fine-tune on a specific task with a relatively small dataset, the optimization doesn't know about the other tasks -- it just adjusts weights to minimize loss on the fine-tuning data. If the adjustments that improve the fine-tuning task happen to hurt other capabilities, those capabilities degrade.

Mitigations include:

The Fine-Tuning Landscape

Putting it all together, the post-pretraining pipeline for a modern language model typically looks like:

Step Method Purpose Data scale
1 Pre-training Learn language, knowledge, reasoning Trillions of tokens
2 Supervised fine-tuning Learn response format, instruction following 10K-100K examples
3 RLHF / DPO / CAI Align behavior with human preferences 30K-100K comparisons
4 Domain-specific fine-tuning Specialize for particular use cases Varies

Steps 2-4 collectively take a base model that can autocomplete any text and turn it into an assistant that follows instructions, refuses harmful requests, and performs well on specific tasks. The remarkable thing is how much behavior can be shaped by fine-tuning on relatively small amounts of data -- the pre-training provides the foundation, and fine-tuning steers it.

This also explains why fine-tuning is now the most accessible entry point for customizing AI models. You don't need to pre-train from scratch. You don't even need to fine-tune all the weights. Chapter 14 covers the methods that make this practical even on consumer hardware.

Next: Chapter 14 — LoRA and Efficient Methods. Low-rank adaptation, adapters, quantization. Why you don't need to retrain everything, and what's actually happening mathematically.

1 Ouyang et al. (2022), "Training language models to follow instructions with human feedback." NeurIPS 35. The InstructGPT paper. Describes the full SFT + reward model + PPO pipeline. The SFT phase used approximately 13,000 human-written demonstrations; the reward model training used approximately 33,000 comparisons.

2 Bai et al. (2022), "Constitutional AI: Harmlessness from AI Feedback." Anthropic. Introduced the idea of using AI-generated critiques and preferences guided by a set of constitutional principles, reducing dependence on human preference labelers.

3 Rafailov et al. (2023), "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model." NeurIPS 36. Showed that the RLHF objective can be reparameterized to eliminate the reward model, turning the alignment problem into a standard supervised learning problem on preference pairs.

4 Chung et al. (2022), "Scaling Instruction-Finetuned Language Models." Google. The FLAN-T5 / FLAN-PaLM paper, demonstrating that instruction tuning on a diverse mixture of 1,800+ tasks improves performance on unseen tasks and that this effect scales with model size.

5 Askell et al. (2021), "A General Language Assistant as a Laboratory for Alignment." Anthropic. Introduced the HHH (helpful, harmless, honest) framework as a practical operationalization of AI alignment objectives.