Chapter 4: Mathematical Foundations

The first three chapters traced intelligence from biology to the earliest AI models. From here, the conversation shifts to the actual machinery. Neural networks are mathematical objects. Training them is an optimization problem. Evaluating them is a statistical question. The gap between "a neuron computes a weighted sum" and "here is a system that writes poetry" is bridged entirely by the math in this chapter.

None of this is arbitrary. Every concept here was pulled into AI because it solved a specific problem: how to represent data, how to measure similarity, how to quantify error, how to improve a model incrementally, how to decide what "better" even means. The math came first. The AI is built on top of it.

Linear Algebra: The Language of Data

Vectors

A vector is an ordered list of numbers. That's all it is structurally. What makes it powerful is what it represents.

In AI, a vector is the standard representation for almost anything the model needs to work with. A word, a sentence, an image, a user's preferences, the internal state of a neuron — all of these get encoded as vectors. When people talk about "embeddings," they mean vectors: a word like "king" gets mapped to something like [0.21, -0.94, 0.43, ...] with hundreds or thousands of dimensions. Each dimension captures some learned feature of meaning.

In two or three dimensions, you can picture a vector as an arrow pointing somewhere in space. In 768 dimensions (the embedding size of BERT), you obviously can't visualize it — but the geometry still works. Vectors can be added, scaled, and compared, and those operations retain their geometric meaning regardless of dimensionality.

The Dot Product

The dot product of two vectors is the operation that measures their similarity. Given two vectors a and b, each with n components:

That's a dot product — it's asking: how much do these two vectors point in the same direction? If both vectors point the same way, the dot product is large and positive. If they're perpendicular (unrelated), it's zero. If they point in opposite directions, it's large and negative.

The dot product is everywhere in AI. In a neural network, each neuron computes a dot product between its weight vector and its input vector — that's the "weighted sum" from Chapter 3's perceptron. In attention mechanisms (the core of transformers, Chapter 9), the model computes dot products between query and key vectors to determine which parts of the input to focus on. In recommendation systems and search engines, cosine similarity — a normalized version of the dot product — measures how closely two items match.

The connection between the dot product and the angle between vectors comes from the geometric identity:

Cosine similarity normalizes this by dividing out the magnitudes, leaving just cos(θ) — a value between -1 and 1 that measures direction regardless of length. This matters when you care about what two vectors represent, not how large they are. Two documents about the same topic should be similar even if one is three times longer.

Matrices and Matrix Multiplication

A matrix is a rectangular grid of numbers. An m × n matrix has m rows and n columns. If a vector is a single list, a matrix is a table — and the reason matrices matter is that they represent transformations.

Each row of W computes a dot product with x. So the output vector y has one entry per row of W, and each entry is a weighted combination of the inputs. This is exactly what a layer of neurons does: each neuron takes the same input vector, applies its own weights (one row of the matrix), and produces one output. A layer with 512 neurons receiving 768 inputs is a 512 × 768 matrix multiplication.

Matrix-matrix multiplication follows the same logic. If A is m × n and B is n × p, then C = AB is m × p, where each entry is:

The inner dimensions must match — the number of columns in A must equal the number of rows in B. This is a hard constraint, not a suggestion. Dimension mismatches are one of the most common errors when building neural networks, and the fix is always structural: reshape your data or adjust your layer sizes until the matrices are compatible.

The Weight Matrix as a Learned Transformation

The reason matrices matter so much in AI is that the weight matrix in a neural network layer is a learned transformation. Before training, it's random — it maps inputs to meaningless outputs. During training, gradient descent adjusts the values so that the transformation does something useful: maps a word embedding to a prediction, maps an image patch to a feature vector, maps a query to a set of attention scores.

A matrix doesn't just scale inputs — it can rotate, stretch, compress, and project them. A 768-dimensional embedding passing through a 128 × 768 weight matrix gets projected down to 128 dimensions. Information is compressed. What gets kept and what gets discarded is determined by what the training process found useful. That's what learning is, expressed as linear algebra: the network learns which projections of the input space carry the information that matters for the task.

Eigenvalues and Eigenvectors

An eigenvector of a matrix A is a vector that, when A is applied to it, only gets scaled — it doesn't change direction. The scaling factor is the eigenvalue.

Why does this matter? Because eigenvectors tell you what a matrix transformation actually does in its simplest form. Every matrix, no matter how complex its apparent effect, is really just stretching or compressing space along a set of independent axes — its eigenvectors. The eigenvalues tell you how much stretching happens along each axis.

In AI, eigenvalues and eigenvectors show up most directly in Principal Component Analysis (PCA) — a technique for dimensionality reduction. PCA computes the eigenvectors of the data's covariance matrix. The eigenvector with the largest eigenvalue points in the direction of greatest variance in the data — the axis along which the data is most spread out. The second eigenvector points in the direction of second-greatest variance, and so on. By projecting your data onto the top k eigenvectors, you get a k-dimensional representation that captures the most important structure while discarding noise.¹

You'll also encounter eigenvalues in understanding training dynamics — whether a loss surface is easy or hard to optimize depends on the eigenvalue spectrum of its Hessian (the matrix of second derivatives). Large ratios between the biggest and smallest eigenvalues create "narrow valleys" that gradient descent struggles with.

Calculus: How Networks Learn

Linear algebra tells you what a network computes. Calculus tells you how to make it compute something better. The entire training process — the mechanism by which a network goes from random weights to useful behavior — is a calculus problem.

Derivatives and Partial Derivatives

A derivative measures how much a function's output changes when you nudge its input. If f(x) = x², then f'(x) = 2x: at x = 3, a tiny nudge in x changes f by roughly 6 times the nudge. The derivative is the slope of the function at that point — the instantaneous rate of change.

Neural networks don't have one input — they have millions of parameters. Each weight is a separate variable. A partial derivative measures how much the output changes when you nudge one weight while holding all others fixed:

The collection of all partial derivatives — one for every parameter in the model — forms the gradient, written ∇L. The gradient is itself a vector, pointing in the direction of steepest increase in the loss function. It tells you: if you want the loss to go up as fast as possible, move this way. If you want the loss to go down, move in the opposite direction. That's gradient descent.

The Chain Rule

The chain rule is what makes training deep networks possible. It tells you how to compute the derivative of a composition of functions — and a neural network is exactly that: a composition of functions, one per layer.

In a three-layer network, the output is something like L = loss(f₃(f₂(f₁(x)))). To compute how the loss changes with respect to a weight in the first layer, the chain rule says: multiply the derivative of the loss with respect to layer 3's output, times the derivative of layer 3's output with respect to layer 2's output, times the derivative of layer 2's output with respect to layer 1's output, times the derivative of layer 1's output with respect to the weight you care about.

This chained multiplication, applied systematically from the output layer back to the input layer, is backpropagation. It was popularized by Rumelhart, Hinton, and Williams in 1986, and it remains the algorithm that trains virtually every neural network today.² Without the chain rule, there would be no way to efficiently compute how each of millions of weights should change. Chapter 6 will cover backpropagation in full mechanical detail.

Gradient Descent

Once you've computed the gradient — the vector of partial derivatives telling you which direction increases the loss — the update step is straightforward. Move each weight a small step in the opposite direction:

The learning rate η controls the step size. Too large and you overshoot — the loss bounces around or diverges. Too small and training takes forever or gets stuck in a local minimum. Choosing a good learning rate (and scheduling it — shrinking it over time) is one of the most important practical decisions in training a neural network.

The diagram shows a simplified picture — one weight, one smooth curve. Real loss surfaces have millions of dimensions and complex topology: saddle points, flat plateaus, sharp ravines. But the principle is the same. At each step, compute the gradient, step opposite to it, repeat. The variants (SGD with momentum, Adam, AdaGrad) all modify this basic recipe to handle the challenging geometry of real loss surfaces. Chapter 6 covers these in detail.

Probability: Reasoning Under Uncertainty

Neural networks produce outputs — but what do those outputs mean? When a language model says the next word is "cat" with probability 0.73, what does that number represent? When we say a model has "learned" the distribution of English text, what distribution? The language for all of this is probability theory.

Probability Basics

A probability distribution assigns a number between 0 and 1 to each possible outcome such that all the numbers sum to 1. For discrete outcomes (like the next word from a vocabulary of 50,000 tokens), the model produces 50,000 numbers that sum to 1 — that's the predicted distribution over the vocabulary.

Conditional probability — the probability of one event given that another has occurred — is written:

Language models are fundamentally conditional probability machines. When GPT generates text, it's computing P(next word | all previous words) at each step. The entire model is an estimator of this conditional distribution.

Bayes' Theorem

Bayes' theorem tells you how to update a belief when you get new evidence. It follows directly from the definition of conditional probability:

In words: the probability of a hypothesis H after seeing data D (the posterior) depends on how likely the data would be if the hypothesis were true (the likelihood), weighted by how plausible the hypothesis was before seeing the data (the prior), normalized by the overall probability of the data (the evidence).

Bayes' theorem is the formal framework for learning from evidence, and it runs through AI in several forms. Bayesian inference underpins probabilistic graphical models. The "prior" and "posterior" language shows up when discussing pre-training and fine-tuning: the pre-trained model is a prior belief about language, and fine-tuning updates that belief with task-specific evidence. Thompson Sampling — the bandit algorithm in your retrieval-weight experiment — is explicitly Bayesian: it maintains a Beta distribution as a prior over each arm's success rate and updates it with each observed outcome.³

Key Distributions

The Bernoulli distribution models a single binary outcome — heads or tails, spam or not spam, clicked or didn't click. It has one parameter p, the probability of success:

The Gaussian (normal) distribution is the bell curve. It's parameterized by its mean μ (center) and variance σ² (spread):

Gaussians show up everywhere: weight initialization (weights are typically drawn from a Gaussian), the reparameterization trick in variational autoencoders, noise injection for regularization, the central limit theorem (which says that averages of many random variables converge to a Gaussian regardless of the underlying distribution — explaining why Gaussians model so many natural phenomena).

Maximum Likelihood Estimation

Here's the question that connects probability to training: given a dataset and a model with adjustable parameters, what parameter values make the data most probable?

Maximum likelihood estimation (MLE) says: find the parameters θ that maximize the probability of the observed data.

The product is taken because we assume the data points are independent — so the joint probability is the product of individual probabilities. In practice, we take the logarithm (turning products into sums, which is numerically more stable) and minimize the negative log-likelihood instead of maximizing the likelihood. This gives us:

This is what training a neural network actually is. When you train a language model on a corpus of text, you're adjusting its parameters to maximize the likelihood of the training data — to make the model assign high probability to the sequences that actually occurred. When the training loss goes down, it means the model is getting better at predicting its training data. The loss function is the negative log-likelihood.

Information Theory: Measuring Knowledge and Surprise

Information theory was created by Claude Shannon in 1948 to solve a specific engineering problem: how to transmit messages efficiently over noisy channels.⁴ It turns out to provide exactly the mathematical framework AI needs for measuring how much a model knows, how far its predictions are from reality, and what the optimal loss function should be.

Entropy

Entropy measures the average amount of surprise — or equivalently, the average amount of information — in a probability distribution. For a discrete distribution with outcomes {x₁, ..., x_n}:

The intuition: if an event is very likely (probability near 1), it's not surprising when it happens, so it carries little information. If an event is very unlikely, it's highly surprising and carries a lot of information. Entropy is the weighted average of these surprises.

A fair coin has entropy of 1 bit — you need one bit to encode the outcome. A biased coin (say, 99% heads) has entropy near 0 — the outcome is almost certain, so it carries almost no information. A uniform distribution over 50,000 words has very high entropy. A well-trained language model that's confident the next word is "the" has low entropy for that prediction.

Entropy sets a floor on how well any model can predict a sequence. If English text has an entropy of about 1.3 bits per character (Shannon's original estimate), no model can predict it with fewer than 1.3 bits of surprise per character on average. When we report a language model's perplexity — a standard evaluation metric — it's just 2^H, the exponential of the entropy. Lower perplexity means the model is less surprised by the test data.

Cross-Entropy

If entropy measures surprise under the true distribution, cross-entropy measures surprise when you use a model's distribution Q to encode events that actually come from a true distribution P:

Cross-entropy is always greater than or equal to entropy: H(P, Q) ≥ H(P). The only way they're equal is if Q = P — if your model perfectly matches reality.

Cross-entropy loss is the standard loss function for classification in neural networks. When you train a model to predict the next word (or classify an image, or decide if an email is spam), you're minimizing the cross-entropy between the true distribution (one-hot: the correct answer has probability 1, everything else has probability 0) and the model's predicted distribution. This is exactly the negative log-likelihood of the correct answer:

If the model assigns probability 0.9 to the correct answer, the loss is −log(0.9) ≈ 0.105. If it assigns probability 0.01, the loss is −log(0.01) ≈ 4.6. The loss penalizes confident wrong answers much more heavily than slightly uncertain correct ones. This is the right behavior — you want the model to be punished severely for saying "I'm almost certain" and being wrong.

KL Divergence

Kullback-Leibler divergence (KL divergence) measures how different two probability distributions are. It's the "extra" surprise incurred by using distribution Q instead of the true distribution P:

This is just the difference between cross-entropy and entropy: D_KL(P || Q) = H(P, Q) − H(P). Since entropy H(P) is a constant that doesn't depend on the model, minimizing cross-entropy is equivalent to minimizing KL divergence. Training a neural network by minimizing cross-entropy is the same as finding the model distribution Q that's closest to the true distribution P, in the KL sense.

KL divergence is not symmetric: D_KL(P || Q) ≠ D_KL(Q || P) in general. This asymmetry matters. D_KL(P || Q) penalizes Q for assigning low probability where P has high probability — it punishes the model for missing things that are likely. D_KL(Q || P) penalizes Q for assigning high probability where P has low probability — it punishes the model for hallucinating things that are unlikely. This distinction shows up in variational inference and in understanding different failure modes of generative models.⁵

How It All Connects

These four branches of mathematics — linear algebra, calculus, probability, and information theory — are not independent topics that happen to be used in AI. They interlock into a single, coherent picture of what training a neural network actually means.

The forward pass is linear algebra: matrix multiplications transforming input vectors into output vectors. The loss function is information theory: cross-entropy measuring the gap between predictions and reality. The backward pass is calculus: the chain rule computing gradients through the network. The training objective is probability: maximum likelihood estimation finding the parameters that best explain the data. Remove any one of these and the system doesn't work.

This isn't a coincidence. Neural networks were designed by people who understood these tools and built the architecture to exploit them. The dot product appears in attention because it's the natural measure of similarity. Cross-entropy is the loss function because it's the information-theoretically optimal way to train a probabilistic model. Gradient descent works because the chain rule makes it computationally tractable. The math isn't decoration — it's the load-bearing structure.

Branch	What it provides	Where it appears
Linear algebra	The representation and transformation of data	Embeddings, weight matrices, attention scores, every forward pass
Calculus	The ability to compute how to improve	Gradients, backpropagation, the training loop
Probability	The framework for reasoning under uncertainty	Model outputs, training objectives (MLE), Bayesian methods
Information theory	The measure of what's learned and what's lost	Loss functions (cross-entropy), evaluation (perplexity), regularization (KL terms)

¹ PCA was developed by Karl Pearson in 1901 and independently by Harold Hotelling in 1933. It remains one of the most widely used dimensionality reduction techniques, though modern approaches like t-SNE (van der Maaten & Hinton, 2008) and UMAP (McInnes et al., 2018) handle nonlinear structure better for visualization.

² Rumelhart, Hinton, and Williams, "Learning representations by back-propagating errors," Nature, 1986. The chain rule itself was not new — the contribution was demonstrating that it could train multi-layer networks effectively. Earlier formulations of backpropagation include work by Linnainmaa (1970) and Werbos (1974).

³ Thompson, "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples," Biometrika, 1933. The algorithm samples from the posterior distribution over each arm's reward probability and selects the arm with the highest sample — a simple but elegant approach to the exploration-exploitation tradeoff.

⁴ Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal, 1948. This paper founded the field of information theory. Shannon's entropy formula is engraved on his tombstone at MIT.

⁵ The asymmetry of KL divergence produces different behaviors in generative modeling. Minimizing D_KL(P || Q) (the "forward KL") produces mode-covering behavior — Q tries to cover all of P, even at the cost of spreading probability mass too widely. Minimizing D_KL(Q || P) (the "reverse KL") produces mode-seeking behavior — Q concentrates on the highest-probability regions of P, potentially missing entire modes. This distinction is discussed in depth by Bishop (2006), Pattern Recognition and Machine Learning, Chapter 10.