Chapter 12: What the Model "Knows"

A model like GPT-4 has on the order of a trillion parameters. It would be natural to think of these as storage — a trillion facts, or at least a trillion atomic pieces of information, packed into a giant lookup table. This is wrong, and the difference matters.

Parameters don't store facts. They store statistical regularities — patterns in how tokens co-occur, how syntax structures itself, how arguments flow, how questions relate to answers. The model has no row for "Paris is the capital of France." What it has are weight configurations that make the token sequence "The capital of France is" overwhelmingly likely to be followed by "Paris." The knowledge isn't stored; it's implied by the geometry of the parameter space.

This distinction explains most of the behaviors that confuse people about language models — both the things they're unreasonably good at and the things they're unreasonably bad at. A model that stores statistical patterns can generalize, compose, and transfer knowledge in ways a database never could. But it can also generate plausible nonsense with perfect confidence, because plausibility and truth are different signals, and only one of them is in the training data.

Token Embeddings: Words as Coordinates

The first concrete place to see what "knowledge in parameters" means is the embedding layer — the very first thing a model does with its input.

After tokenization (Chapter 10), each token is an integer — an index into a vocabulary. "cat" might be token 2,341; "dog" might be token 5,889. These numbers are arbitrary. They carry no meaning. The embedding layer's job is to turn each token index into a dense vector — a list of floating-point numbers in a high-dimensional space. In GPT-3, for instance, each token maps to a vector of 12,288 dimensions.

This vector is the model's internal representation of the token. It's not a definition or a description — it's a position in a learned coordinate system. During training, the model adjusts these vectors so that tokens appearing in similar contexts end up near each other in this space. "cat" and "dog" will be close because they're used in similar sentences. "cat" and "parliament" will be far apart.

What makes this interesting is not just that similar words cluster together — any bag-of-words model can approximate that. It's that the directions in this space encode semantic relationships. The vector from "king" to "queen" points in roughly the same direction as the vector from "man" to "woman." This is the famous Word2Vec result: king − man + woman ≈ queen.¹

The arithmetic works because the embedding space has learned to represent gender as a direction, royalty as a region, and plurality, tense, formality, and dozens of other linguistic features as other directions — all simultaneously, in the same space. Each token's position is the sum of all the features relevant to it.

These embedding vectors are parameters of the model. For GPT-3's vocabulary of roughly 50,000 tokens, each with a 12,288-dimensional vector, that's about 614 million parameters just for the embedding table — before any attention or feedforward layers. These numbers are learned during training, adjusted by gradient descent like everything else. The model discovers the geometry of language by predicting the next token, billions of times.

Latent Space: The Model's Internal World

Token embeddings are just the entry point. As a token's representation passes through the model's layers — through attention heads and feedforward networks — it gets transformed. By the time a vector reaches the final layer, it no longer represents just the token it started as. It represents that token in context — shaped by every other token in the sequence, by the syntactic structure, by the semantic meaning of the whole passage.

This internal representation space — the space of all possible intermediate vectors the model can produce — is called the latent space. "Latent" because it's hidden: you don't see it in the input or output, only in the model's internal computations. The latent space is where the model "thinks," to the extent that word means anything here.

The latent space has the same dimensionality as the embedding space (12,288 for GPT-3), but the representations in it are richer. An embedding vector for "bank" is the same regardless of context. By the time it passes through 96 transformer layers, the internal representation of "bank" in "river bank" and "bank account" will have diverged — the attention mechanism has routed different contextual information into each one.

This is a fundamentally different kind of knowledge representation than a database, an ontology, or a knowledge graph. Those are discrete: entity A has relation R to entity B. The latent space is continuous. Concepts don't have hard boundaries. "Dog" blends gradually into "wolf" blends gradually into "wild" blends gradually into "untamed" — not through explicit links, but through proximity in a continuous space. This continuity is what enables the model to generalize: it can handle inputs it has never seen by mapping them to regions of latent space that are near things it has seen.

Compression, Not Growth

Here is where the nature of what the model "knows" becomes clearest. Training is a compression process. The training data — trillions of tokens of text — is compressed into a fixed set of parameters. GPT-3 was trained on roughly 300 billion tokens and has 175 billion parameters. The entire model, in 32-bit floating point, takes up about 700 GB. The training data was much larger. The model is a lossy compression of its training data.²

This framing, which has been formalized in information-theoretic terms,³ clarifies several things:

You put this precisely: the model is "wide landscapes of finite information and finite abilities that grow smaller as they produce an outcome." The parameters are the landscape. Generation is traversal. Each token the model produces commits it to a path through this landscape, constraining what can come next. The space of possible continuations narrows with every token — the landscape doesn't grow, it gets consumed.

This is the fundamental difference between an LLM and a biological intelligence. A brain adds synapses, strengthens connections, prunes unused pathways, and grows new representations in response to experience. A trained model is frozen. The "building blocks of what already exists" — as you described it — are arranged during training and then fixed. Inference is rearrangement, not creation.

Superposition: More Features Than Dimensions

If each dimension of the embedding space encoded one concept, a 12,288-dimensional model could represent at most 12,288 features. But human language has far more than 12,288 concepts. How does the model fit them?

The answer is superposition. The model represents more features than it has dimensions by encoding them as non-orthogonal directions in the same space. Think of it like this: if you had a 3D room and needed to point in 100 different "meaningful directions," you couldn't make them all perfectly perpendicular (you only have 3 perpendicular axes). But you could pick 100 directions that are mostly non-overlapping — close enough to distinguishable, even if not perfectly clean.

Anthropic's research on superposition, particularly the "Toy Models of Superposition" paper,⁴ demonstrated that neural networks naturally learn to do exactly this. When the number of features the model needs to represent exceeds the number of available dimensions — which it always does — the model packs multiple features into overlapping directions. The representations become polysemantic: a single neuron might respond to both legal language and the color blue, not because these are related, but because they rarely co-occur and can safely share a dimension.

Superposition is a tradeoff. The benefit: the model can represent a combinatorially large number of concepts in a fixed-size space. The cost: representations interfere with each other. When two features that share a dimension are both active at the same time, the model gets confused — it can't cleanly separate them. This interference is one plausible contributor to hallucination and to the kinds of errors where a model conflates two loosely related concepts.

This connects directly to the "quantized to the most basic knowledge" intuition. The parameters don't represent concepts one-to-one. They represent the basis functions — the minimal set of reusable directions from which all the model's concepts are composed. Just as a JPEG doesn't store individual pixels but stores frequency components that reconstruct the image, the model doesn't store individual facts but stores the components from which facts can be reconstructed.

Mechanistic Interpretability: Opening the Black Box

If the model's knowledge lives in the geometry of its parameter space, a natural question follows: can we read it? Can we point to a specific neuron or circuit and say "this is what encodes the concept of negation" or "this circuit computes subject-verb agreement"?

This is the program of mechanistic interpretability — an emerging field that treats neural networks as objects to be reverse-engineered, the way you might disassemble a compiled binary to understand what the source code did.

But mechanistic interpretability is still in early stages. Understanding a circuit that handles subject-verb agreement is very different from understanding how a model reasons about a novel ethical dilemma. The gap between interpreting individual circuits and understanding the whole model is analogous to the gap between understanding a single transistor and understanding an operating system. The tools exist; the scale of the challenge is immense.

Emergent Capabilities

One of the most debated phenomena in modern AI is emergence — the appearance of capabilities at scale that were not explicitly trained for and were not present in smaller models.

The original claim, from Wei et al. (2022), was that these capabilities appear abruptly — absent in smaller models, suddenly present above a threshold.⁹ This has been partially challenged: Schaeffer et al. (2023) argued that much of the apparent abruptness is an artifact of the metrics used, and that with better-chosen metrics, performance tends to improve smoothly with scale.¹⁰ The debate is ongoing. What is not in dispute is that large models can do things small models cannot, and that some of those capabilities were not anticipated by the people who built them.

The relationship between emergence and compression is worth sitting with. A compression algorithm that only stored surface-level patterns — word frequencies, common bigrams — could not exhibit emergence. The fact that models do exhibit it suggests the compression captures something deeper: structural regularities, abstract relationships, compositional rules. The model doesn't have a module for logic or a module for analogy. But the patterns it has compressed are rich enough that logic-like and analogy-like behaviors emerge from the geometry of the compressed space.

The "Stochastic Parrot" Debate

In 2021, Bender, Gebru, McMillan-Major, and Shmitchell published "On the Dangers of Stochastic Parrots," arguing that large language models are fundamentally sophisticated pattern-matchers that produce text without understanding its meaning.¹¹ The term "stochastic parrot" — a system that produces plausible language through statistical mimicry rather than comprehension — became shorthand for a skeptical position on LLM capabilities.

The paper raised legitimate concerns: about environmental costs, about training data bias, about the risk of mistaking fluency for understanding. But the core claim — that there is no meaningful understanding happening, only statistical regurgitation — has been harder to defend as models have scaled.

The honest position is probably somewhere between the poles. Language models are not "just" stochastic parrots — the patterns they learn are deep enough to support composition, generalization, and something that looks like reasoning. But they are also not reasoning the way humans reason. They have no persistent beliefs, no goals, no model of their own knowledge. They produce the most likely continuation of a sequence, and the fact that "most likely continuation" often coincides with "correct, thoughtful answer" is a testament to how much structure is present in human language — not necessarily to understanding in the model.

Hallucination: The Cost of Compression

If the model is a lossy compression of its training data, and it generates text by pattern completion rather than fact retrieval, then hallucination — generating plausible but false information — is not a bug to be fixed but a structural feature of the architecture.

The model doesn't know what it knows. It has no index, no confidence calibration that says "I have strong evidence for this" versus "I'm interpolating." When it generates the sentence "The 1987 Nobel Prize in Chemistry was awarded to...", it produces whatever name its compressed patterns assign the highest probability to. If the compression lost the relevant detail, or if the patterns point to a plausible but wrong answer, the model will generate that wrong answer with the same fluency and confidence as a correct one.

Hallucination is the gap between what compression preserves and what accuracy requires. A lossless database never hallucinates because it stores facts discretely and retrieves them exactly. A lossy compression hallucinates because it reconstructs from patterns, and reconstruction is approximate. Retrieval-augmented generation (RAG), tool use, and citation mechanisms are all attempts to patch this gap — to give the model access to discrete facts when its compressed representation isn't reliable enough.

Parameters as Compressed Knowledge

So what does a parameter actually represent? Not a fact, not a concept, not a rule. A single parameter is a number — a floating-point weight in a matrix multiplication. It means nothing in isolation, just as a single pixel in a JPEG means nothing in isolation. Meaning emerges from the collective configuration of all parameters together.

The map analogy is worth dwelling on. A map of New York is not New York. It's a lossy compression: it preserves the street grid, the subway lines, the relative positions of landmarks, but it drops the smell of the subway, the sound of traffic, the specific pattern of cracks in the sidewalk. You can navigate with the map. You can answer questions about the city that the map has the right kind of information for. But you can't use it to find out what the barista at a specific coffee shop looks like — that detail was never captured.

A language model is a map of language, and transitively, a map of the world as described by language. It preserves the structures that appeared frequently enough in training to be worth compressing. It drops the rest. Its failures are map failures: the territory has detail the map didn't capture, or the map has features that correspond to nothing real (hallucinations are phantom streets).

The Vectors Are Building Blocks, Not Growth

Your original observation bears repeating in full, because it captures something the field has been circling around with more formal language:

Analogy	The model is like...	What breaks
Database	A lookup table of facts	No discrete storage, no retrieval, lossy
Brain	A network that processes information	No plasticity after training, no embodiment, no feedback loop
Compression codec	A compressed version of training data	Closest, but the model generalizes beyond reconstruction
Map	A simplified representation of territory	Good: captures structure, omits detail, enables navigation. Breaks at the edges.

This is a clean description of what the research shows. The vectors — the embedding vectors, the weight matrices, the attention patterns — are not building up an understanding. They are a static decomposition of observed patterns into reusable components. "Quantized to the most basic knowledge" maps to the concept of basis functions in linear algebra: the model learns a set of fundamental directions in its representation space, and every concept it can express is a linear combination of those directions. The "building blocks of what already exists" is exactly right. The model cannot create new building blocks. It can only recombine the ones it extracted during training.

And the observation that the landscape "grows smaller as they produce an outcome" maps to a real computational property: autoregressive generation is a narrowing process. Each token produced conditions the distribution for the next token. The probability mass concentrates. Paths are pruned. The tree of possible continuations gets narrower with every step. The model started with the full landscape of everything it could say, and each generated token collapses a dimension of that possibility.

What's interesting is that this is exactly what lossy compression does in reverse. Compression takes a wide field of data and reduces it to a compact representation. Generation takes the compact representation and expands it back — but it can only expand along the paths the compression preserved. The output is shaped by the input, constrained by the compression, and narrowed by each sequential choice. This is not growth. It is decompression along a single path.

What This Means in Practice

The implications for how you use and build on language models follow directly from the nature of what they are:

The picture that emerges is this: a language model "knows" the world the way a hologram "contains" a 3D scene. Every piece of the holographic film encodes information about the whole scene from a particular angle. You can reconstruct the scene from the film, but the film is not the scene. Cut the film in half and you get a blurrier version of the whole scene, not half the scene — because the information is distributed, not localized. The model's parameters are like this. Knowledge is distributed across all of them, encoded in their collective geometry, recoverable through the forward pass, but never stored in any single place.

This is what it means for a system to "know" something through compression rather than through growth. It's a different kind of knowledge than biological intelligence produces — powerful, general, and fundamentally static.

¹ Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781. The king/queen analogy comes from the follow-up paper: Mikolov, T., Yih, W.-t., & Zweig, G. (2013). "Linguistic Regularities in Continuous Space Word Representations." Proceedings of NAACL-HLT 2013.

² This framing has been explored formally. Deletang et al. (2024) argued that "language modeling is compression" and showed the equivalence between prediction and compression via arithmetic coding. Deletang, G., Ruoss, A., Grau-Moya, J., Genewein, T., Wenliang, L. K., Catt, E., Cundy, C., Hutter, M., Legg, S., Veness, J., & Ortega, P. A. (2024). "Language Modeling Is Compression." Proceedings of ICLR 2024.

³ The connection between prediction and compression is rooted in the relationship between Kolmogorov complexity and Shannon entropy. A model that predicts well compresses well, and vice versa. See Hutter, M. (2005). "Universal Artificial Intelligence." Springer.

⁴ Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). "Toy Models of Superposition." Anthropic. Published at transformer-circuits.pub.

⁵ Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). "Zoom In: An Introduction to Circuits." Distill. For language model neurons specifically, see Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., & Bertsimas, D. (2023). "Finding Neurons in a Haystack: Case Studies with Sparse Probing." arXiv:2305.01610.

⁶ Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small." Proceedings of ICLR 2023.

⁷ Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., & Henighan, T. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic.

⁸ Brown, T. B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. This paper (the GPT-3 paper) demonstrated in-context learning at scale.

⁹ Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research (TMLR).

¹⁰ Schaeffer, R., Miranda, B., & Koyejo, S. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS 2023.

¹¹ Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" Proceedings of FAccT 2021.

¹² Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., & Wattenberg, M. (2023). "Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task." Proceedings of ICLR 2023.

¹³ Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). "Locating and Editing Factual Associations in GPT." NeurIPS 2022. This paper introduced the ROME (Rank-One Model Editing) method, demonstrating both the possibility and the difficulty of targeted edits.

What the Model "Knows"

The Parameters Are Not a Database