Part IV — How LLMs Work

Pre-training

Next-token prediction, the data, and what happens when you scale both far beyond what anyone thought was useful.

The Objective

The core of pre-training is almost comically simple to state. Given a sequence of tokens, predict the next one. That's it. The entire foundation of GPT, Claude, Llama, and every other modern large language model rests on this single objective: causal language modeling, which is a formal way of saying "read left to right, guess what comes next."

More precisely: given a sequence of tokens [t1, t2, ..., tn], the model produces a probability distribution over its entire vocabulary for token tn+1. It assigns a probability to every possible next token — "the" might get 0.12, "a" might get 0.07, "defenestration" might get 0.000003 — and the training signal comes from comparing that distribution against what actually came next in the training data.

The loss function is cross-entropy loss over the vocabulary. If the model assigns probability p to the correct next token, the loss for that prediction is −log(p). When the model is confident and correct (assigns high probability to what actually came next), the loss is small. When it's wrong or uncertain, the loss is large. Averaged across billions of predictions, this single number — the cross-entropy loss — is the objective the model minimizes during training.

Causal Language Modeling INPUT The cat sat on the ??? PREDICT Transformer (billions of parameters) OUTPUT P(mat) = 0.32 P(floor) = 0.11 P(table) = 0.08 P(...) = ... probability distribution over entire vocabulary (~50,000–100,000 tokens)

What makes this objective powerful is that it's self-supervised. The training data doesn't need human labels. Any text corpus is automatically a source of training examples: every sequence of tokens contains a prediction target at every position. A single book with 100,000 tokens yields roughly 100,000 training examples. A dataset of trillions of tokens yields trillions of training examples, each one free.

Key idea: The pre-training objective is simple, but that simplicity is the point. Next-token prediction is a task so general that learning to do it well requires the model to internalize grammar, facts, reasoning patterns, style, and structure. The objective doesn't specify any of these things. They emerge because they're useful for prediction.

The Data

Pre-training data for modern LLMs is drawn from a mix of sources, each contributing different kinds of knowledge and language. The goal is breadth: the model should encounter as many domains, registers, languages, and reasoning styles as possible.

Major sources

Source What it contributes Scale
Common Crawl Web pages — broad coverage of topics, styles, and languages. Noisy: includes ads, spam, boilerplate, duplicate content. Petabytes raw; hundreds of billions of tokens after filtering
Books Long-form coherent text. Sustained argument, narrative structure, domain depth. Projects Gutenberg and Books3 commonly cited. Tens of billions of tokens
Wikipedia Encyclopedic factual content, structured and well-edited. Strong signal for factual recall and citation-like patterns. ~4 billion tokens (English); more across languages
Code GitHub, Stack Overflow. Formal syntax, logical structure, function composition. Models trained on code show improved reasoning on non-code tasks.1 Hundreds of billions of tokens
Scientific papers arXiv, PubMed, Semantic Scholar. Technical reasoning, mathematical notation, domain-specific terminology. Tens of billions of tokens
Forums and Q&A Reddit, Stack Exchange. Conversational tone, question-answer structure, multi-turn reasoning. Tens of billions of tokens

The total pre-training corpus for frontier models is on the order of trillions of tokens. Llama 2 (Meta, 2023) was trained on 2 trillion tokens.2 Llama 3 (2024) scaled to approximately 15 trillion tokens.3 These are not small numbers. For reference, the entire English Wikipedia is roughly 4 billion tokens. A trillion-token dataset is like reading Wikipedia 250 times over, except the rest is web pages, books, code, and everything else.

Data quality

Raw web data is garbage. Anyone who has scraped the internet knows this. Common Crawl contains exact duplicates, near-duplicates, SEO spam, auto-generated text, pornography, personal information, malware payloads, and vast amounts of text that is technically English but carries no useful signal.

The filtering pipeline is a significant engineering effort in its own right:

Key idea: Pre-training data is not "the internet." It's a heavily curated, filtered, deduplicated, and rebalanced subset of text from many sources. The curation decisions — what to include, what to exclude, how much weight to give each source — shape the model's knowledge and behavior as much as the architecture does.

The Scale

To appreciate what pre-training involves, consider the compute required. Training a frontier model is not something you do on a workstation, a university cluster, or even a modest cloud setup. It is an industrial operation.

GPT-3 (175 billion parameters) required approximately 3,640 petaflop-days of compute to train — equivalent to running a petaflop-class system for ten years, or a thousand such systems for 3.6 days.5 In practice, OpenAI used a cluster of thousands of Nvidia V100 GPUs. The estimated cost at 2020 cloud GPU prices was in the range of $4.6 million for a single training run.6

The numbers have only grown. Meta's Llama 3 405B model used a cluster of 16,384 Nvidia H100 GPUs and trained for approximately 54 days.3 GPT-4's training costs have been estimated at over $100 million, though OpenAI has not confirmed specifics. The energy consumption of a single frontier training run is measured in gigawatt-hours — comparable to the annual electricity consumption of a small town.

The GPT Progression GPT-1 2018 117M params 12 layers 768 hidden dim ~5B tokens (BookCorpus) 8 GPUs ~30 days GPT-2 2019 1.5B params 48 layers 1600 hidden dim ~40B tokens (WebText) ~256 GPUs (estimated) GPT-3 2020 175B params 96 layers 12288 hidden dim ~300B tokens (filtered Common Crawl+) ~10,000 V100 GPUs ~$4.6M GPT-4 2023 Undisclosed MoE architecture (rumored) ~13T tokens (estimated) ~25,000 A100 GPUs est. >$100M Dashed border = officially undisclosed details. Estimates from reporting and technical analysis.

The progression from GPT-1 to GPT-3 spans three orders of magnitude in parameters (117 million to 175 billion) and about two orders of magnitude in training data. Each jump brought qualitative changes in capability that were not predicted by extrapolating from the previous model. GPT-1 could complete sentences. GPT-2 could generate coherent paragraphs. GPT-3 could perform tasks it was never explicitly trained for.

What Emerges at Scale

The most consequential finding of the pre-training era is that large language models develop capabilities that nobody designed into them. These capabilities were not predicted from the architecture, not specified in the training objective, and not present in smaller models trained the same way. They appeared as the models got bigger.

In-context learning

GPT-3's most striking capability was in-context learning: the ability to perform a task by being shown examples in the prompt, without any parameter updates. Give GPT-3 a few examples of English-to-French translation, and it translates. Give it examples of sentiment classification, and it classifies. Give it examples of a made-up task — say, reversing words — and it often picks up the pattern. This is few-shot learning, and smaller models trained with the exact same objective simply couldn't do it.5

In-context learning was not engineered. Nothing in the training objective says "learn to learn from examples in the prompt." The model was trained only to predict the next token. But predicting the next token in a diverse corpus means recognizing that certain contexts establish patterns — a list of input-output pairs followed by a new input strongly predicts an output that follows the established pattern. The model learned to detect and follow patterns in context because doing so reduced its prediction loss.

Chain-of-thought reasoning

At sufficient scale, models can be prompted to "think step by step" and produce intermediate reasoning before an answer. This chain-of-thought behavior, first systematically studied by Wei et al. (2022),7 dramatically improves performance on arithmetic, logic, and commonsense reasoning tasks. Crucially, it's an emergent capability: the same prompting technique applied to smaller models produces no improvement. There appears to be a threshold — roughly in the range of 60-100 billion parameters — below which chain-of-thought prompting doesn't help and above which it does.

Zero-shot task performance

Perhaps most remarkably, large models can perform tasks with no examples at all — just a natural language instruction. "Translate this sentence to French." "Summarize this paragraph." "Write a Python function that..." This zero-shot capability emerges from the model having internalized enough task structure from its training data that it can generalize to novel instructions.

Key idea: The emergent capabilities of large language models — in-context learning, chain-of-thought reasoning, zero-shot task transfer — were not designed. They were discovered. They arose from a simple training objective applied at sufficient scale to sufficient data. This has been described as the most surprising empirical finding in modern AI, and it remains poorly understood theoretically.

Scaling Laws

If larger models produce better results, the natural question is: how much better? And: what's the most efficient way to get there?

In January 2020, Jared Kaplan and colleagues at OpenAI published a paper that gave the field its first precise answers.8 They showed that the performance of language models — measured as cross-entropy loss on held-out test data — follows power-law relationships with three variables:

The relationships are strikingly smooth. Over many orders of magnitude, the test loss decreases as a power law:

L(N) ∝ N−0.076     L(D) ∝ D−0.095     L(C) ∝ C−0.050

The exponents are small — meaning you need to increase each variable by a lot to get a meaningful improvement. Roughly: a 10x increase in model size yields about a 17% reduction in loss. A 10x increase in data yields about a 20% reduction. These are diminishing returns, but they're predictable, and they don't plateau within the ranges that have been tested.

Scaling Laws: Loss vs. Compute (Log-Log) Compute (PF-days, log scale) Test Loss (log scale) 10-2 100 102 104 106 108 2.5 3.0 3.5 4.0 L(C) ∝ C −0.050 Straight line on log-log = power law Note: y-axis is inverted — lower loss (better performance) is at the bottom.

The Kaplan scaling laws had an immediate practical implication: for a fixed compute budget, it's more efficient to train a larger model for fewer steps than a smaller model for more steps. This is counterintuitive — it suggests you should build the biggest model you can afford and stop training before it fully converges — but the data supported it clearly.

The Chinchilla correction

Two years later, Hoffmann et al. at DeepMind published a correction that reshaped the field.9 Their paper, known as the "Chinchilla paper" after their model, argued that Kaplan et al. had underweighted the importance of training data relative to model size. The Kaplan laws suggested making models as large as possible for a given compute budget. Chinchilla showed that the optimal balance was roughly equal scaling of parameters and training tokens.

The specific finding: for compute-optimal training, a model should be trained on approximately 20 tokens per parameter. By this rule, a 70 billion parameter model should see about 1.4 trillion tokens. Chinchilla itself was a 70 billion parameter model trained on 1.4 trillion tokens, and it outperformed the 280 billion parameter Gopher model that had been trained on only 300 billion tokens — a model four times larger but fed five times less data.

This had immediate consequences. The field had been racing to build the largest possible models (the "bigger is better" era). Chinchilla showed that many existing models were undertrained — they had more parameters than their training data could support. The efficient path was not just "make it bigger" but "make it bigger and feed it proportionally more data."

Prescription Kaplan et al. (2020) Chinchilla / Hoffmann et al. (2022)
When you 10x compute... Increase model size ~5.5x, data ~1.8x Increase model size ~3.2x, data ~3.2x
Optimal tokens-per-parameter Not explicitly stated (implies data matters less) ~20 tokens per parameter
Practical effect Build the biggest model you can Balance model size with training data

Later work by the Llama team at Meta took this even further. Llama 1 (2023) trained a 65 billion parameter model on 1.4 trillion tokens — right at the Chinchilla-optimal ratio. But Llama 2 pushed to 2 trillion tokens for the same 70B model size, and Llama 3 went to 15 trillion tokens for a 70B model, far beyond what Chinchilla prescribed.23 Why? Because Chinchilla optimizes for a fixed compute budget. If your goal is to produce the best possible model at a given size — perhaps because you need a model small enough to deploy on limited hardware — it pays to overtrain it on more data than the compute-optimal ratio suggests. The loss keeps decreasing, just less efficiently per FLOP.

The Bitter Lesson

In 2019, Rich Sutton — one of the founders of modern reinforcement learning — published a short essay called "The Bitter Lesson."10 His argument: across the 70-year history of AI research, general methods that leverage computation have consistently won over methods that try to encode human knowledge. Chess programs built on brute-force search beat chess programs built on grandmaster heuristics. Speech recognition systems based on statistical models beat systems built on linguistic rules. Machine translation systems trained on parallel text beat systems designed by linguists.

The "bitter" part is that researchers keep making the same mistake. They build clever, domain-specific systems that work well at small scale, and then general-purpose methods — with enough compute — blow past them. The lesson keeps being learned and then forgotten.

Pre-training is perhaps the purest example of the bitter lesson in action. The next-token prediction objective is not clever. It encodes no linguistic knowledge, no rules of reasoning, no understanding of the world. It's a general statistical objective applied with enormous compute to enormous data. And it produces systems that can translate, summarize, reason, write code, and answer questions — none of which were specified in the objective.

Sutton's conclusion: the two methods that scale with computation are search and learning. Everything else — hand-crafted features, domain-specific architectures, expert-designed rules — provides diminishing returns relative to simply using more compute for search and learning. The history of AI bears this out, and the scaling laws formalize it: make the model bigger, give it more data, spend more compute, and performance improves along a smooth, predictable curve.

This doesn't mean engineering doesn't matter. The Transformer architecture (Chapter 9) was a crucial innovation — not because it encoded linguistic knowledge, but because it was more parallelizable and therefore could leverage computation more efficiently than RNNs. The bitter lesson isn't that architecture doesn't matter. It's that the architectures that win are the ones that make the best use of compute.

The Compute Reality

Pre-training a frontier model requires infrastructure at an industrial scale. The key resource is GPU-hours — specifically, hours on Nvidia's data-center GPUs (A100, H100, and their successors). A rough accounting:

The cost and infrastructure requirements create a natural concentration of pre-training capability. As of 2025, the organizations that can train frontier models from scratch include OpenAI, Anthropic, Google DeepMind, Meta, Mistral, and a small number of Chinese labs (ByteDance, Alibaba, Baidu). The gap between what's needed for frontier pre-training and what an academic lab or startup can afford continues to grow.

Pre-training vs. Fine-tuning

Pre-training creates a general-purpose foundation. The model that comes out of pre-training is a powerful text predictor, but it's not immediately useful as an assistant, a code generator, or anything task-specific. It has absorbed the patterns and knowledge of its training data, but it has no notion of "follow the user's instructions" or "be helpful and honest."

That's where fine-tuning comes in — covered in detail in Chapter 13. The distinction matters:

Pre-training Fine-tuning
Objective Predict the next token Follow instructions, be helpful, align with values
Data Trillions of tokens, broad and unstructured Thousands to millions of curated examples
Compute Millions of GPU-hours Hundreds to thousands of GPU-hours
What changes All parameters, from random initialization All or some parameters, from pre-trained weights
Result A general text predictor A task-specific or instruction-following model

An analogy: pre-training is like educating a person broadly — years of reading, observing, absorbing how the world works, building a general understanding. Fine-tuning is like job training — teaching them the specific behaviors, norms, and skills their role requires. The education doesn't specify the job, but without it, the job training has nothing to build on. A fine-tuned model that was never pre-trained would be useless. A pre-trained model that was never fine-tuned can be powerful but is unrefined — it might complete your sentence with the rest of a Wikipedia article, or continue your question with another question, rather than answering it.

The asymmetry in compute cost is important. Pre-training is the expensive part — millions of dollars, weeks of training, trillions of tokens. Fine-tuning is comparatively cheap — thousands of dollars, hours to days, thousands of examples. This asymmetry is what enables the current ecosystem: a few large organizations pre-train foundation models, and the rest of the field fine-tunes them for specific applications.

Why Next-Token Prediction Works

There's a lingering puzzle: why does predicting the next token produce something that looks like understanding? The model is never told what a fact is, what an argument is, or what it means for something to be true. It's just trying to minimize prediction error on text. How does that produce a system that can answer questions, write code, and reason about novel problems?

One way to think about it: predicting the next token in a sufficiently diverse corpus is equivalent to compressing that corpus, and compression requires understanding. If the text says "The capital of France is," predicting "Paris" requires factual knowledge. If the text is a proof, predicting the next step requires logical reasoning. If the text is a conversation, predicting the next utterance requires modeling intent, context, and social convention. Every domain of human knowledge and every mode of reasoning, insofar as they appear in text, become prediction targets.

This doesn't mean the model "understands" in any deep philosophical sense — that question is open and probably not well-posed. What it means is that the statistical objective of next-token prediction, applied at scale, creates internal representations that capture an enormous amount of structure about language, knowledge, and reasoning. Whether those representations constitute understanding or merely simulate it is a question for philosophers. For engineering purposes, the capabilities are real and useful.

Ilya Sutskever, co-founder of OpenAI, put it concisely: "If you predict the next token well enough, you must have a world model inside." Whether that "must" is philosophically justified is debatable. But empirically, models that predict the next token better also perform better on tasks that seem to require world knowledge, spatial reasoning, and causal inference. The correlation is strong enough to drive hundreds of billions of dollars in investment.


Pre-training takes a pile of text and a simple objective — predict what comes next — and, given enough parameters and enough compute, produces something that looks like knowledge. The scaling laws tell us this process is predictable and hasn't plateaued. The emergent capabilities tell us it produces surprises along the way. The bitter lesson tells us this has happened before, across the entire history of AI.

But the model that comes out of pre-training is a statistical object: billions of floating-point numbers organized into weight matrices. What do those numbers actually represent? What is the model's "knowledge," and in what sense does it "know" anything?

Next: Chapter 12 — What the Model "Knows." Embeddings, latent space, what parameters represent. Why it's compression, not growth. What it means for a model to encode knowledge in 175 billion floating-point numbers — and what gets lost along the way.

1 Chen et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374. The Codex paper showed that code-trained models improved not only on programming tasks but also exhibited stronger structured reasoning on non-code benchmarks.

2 Touvron et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Meta AI.

3 Dubey et al. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. Meta AI. Reports training on approximately 15 trillion tokens using a cluster of 16,384 H100 GPUs.

4 Penedo et al. (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Alone." arXiv:2306.01116. Technology Innovation Institute.

5 Brown et al. (2020). "Language Models are Few-Shot Learners." arXiv:2005.14165. OpenAI. The GPT-3 paper, which demonstrated few-shot in-context learning and reported the 175B parameter count, training data composition, and compute requirements.

6 Li et al. (2020) estimated GPT-3 training costs at approximately $4.6M based on cloud GPU pricing. Actual costs to OpenAI may differ due to custom hardware arrangements.

7 Wei et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903. Google Research. Demonstrated that prompting models to produce intermediate reasoning steps significantly improves performance on arithmetic, commonsense, and symbolic reasoning tasks, with the effect appearing only in models above ~100B parameters.

8 Kaplan et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. OpenAI. Established power-law relationships between model performance and model size, dataset size, and compute.

9 Hoffmann et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556. DeepMind. The "Chinchilla" paper, which revised the Kaplan scaling laws and showed that contemporary models were significantly undertrained relative to their size.

10 Sutton, Rich (2019). "The Bitter Lesson." Published on Sutton's personal website, March 13, 2019. An informal essay arguing that general methods leveraging computation have historically outperformed methods leveraging human knowledge.