Chapter 6: The Training Loop

Chapter 5 built the network: inputs, weights, biases, activation functions, layers. A feedforward neural network takes an input, multiplies through weight matrices, applies nonlinearities, and produces an output. But at initialization, those weights are random. The network's output is noise. Training is the process of adjusting every weight in the network so that the output becomes useful.

The training loop is how that happens, and it follows the same structure in virtually every neural network ever trained:

That's it. Initialize, forward, loss, backward, update, repeat. Every neural network you've heard of — GPT, DALL-E, AlphaFold — trains with this same loop. The differences are in the architecture (what the forward pass computes), the loss function (what "wrong" means), and the optimizer (how weights get updated). The loop itself is universal.

The Forward Pass

The forward pass is the network doing its job: taking an input and producing an output. Data flows in one direction — from the input layer, through hidden layers, to the output layer. At each layer, three things happen:

Here f is the hidden layer activation (typically ReLU), g is the output activation (softmax for classification, linear for regression), and ŷ ("y-hat") is the network's prediction. The superscripts index layers, not exponents. Each W is a matrix, each b is a bias vector.

Nothing about the forward pass is learned. It's pure arithmetic — matrix multiplications and elementwise nonlinearities. The intelligence, such as it is, lives entirely in the values of W and b. The forward pass just executes whatever function those weights define. At initialization, that function is garbage. After training, it's (hopefully) useful.

Loss Functions: Measuring How Wrong You Are

Once the forward pass produces a prediction ŷ, you need a way to measure how far it is from the true answer y. That measurement is the loss function (also called the cost function or objective function). The loss is a single number — a scalar — that quantifies the network's error on a given example. The entire point of training is to make this number smaller.

Mean Squared Error (MSE) — Regression

When the network is predicting a continuous value — a house price, a temperature, a stock return — the natural error measure is: how far off was the prediction?

For a single example, this is just (y − ŷ)². Over n examples, you average. The squaring does two important things: it makes all errors positive (an overestimate and an underestimate of the same magnitude are equally bad), and it penalizes large errors disproportionately (an error of 10 contributes 100 to the loss, not 10). This makes the network prioritize reducing its worst predictions.

Cross-Entropy Loss — Classification

When the network is classifying — is this image a cat or a dog? what's the next word? — the output is a probability distribution over classes. The appropriate loss here is cross-entropy, which measures how different the predicted distribution is from the true distribution.

The true label y is a one-hot vector (all zeros except a 1 at the correct class), so this sum collapses to just −log(ŷ_correct). If the network assigns probability 0.9 to the correct class, the loss is −log(0.9) ≈ 0.105. If it assigns probability 0.1, the loss is −log(0.1) ≈ 2.303. The loss grows sharply as confidence in the correct answer drops — which is exactly the behavior you want. A classifier that's confidently wrong should be punished harshly.

Cross-entropy comes from information theory (Chapter 4). It measures the number of extra bits needed to encode data from the true distribution using the predicted distribution. When the two distributions match perfectly, cross-entropy equals the entropy of the true distribution — the theoretical minimum. Any mismatch adds extra bits, which is the loss.

Backpropagation: The Chain Rule, Applied Recursively

The forward pass gives you a prediction. The loss function tells you how wrong it is. Now you need to answer a harder question: how should each weight change to reduce the loss?

This is a calculus problem. You need the partial derivative of the loss with respect to every weight in the network — the gradient. The gradient tells you the direction and magnitude of the steepest increase in the loss for each weight. To reduce the loss, you move each weight in the opposite direction.

For the output layer, this is straightforward. But for hidden layers, the relationship between a weight and the loss is indirect — a weight in layer 1 affects the activations in layer 1, which affect the pre-activations in layer 2, which affect the activations in layer 2, which affect the loss. To compute the gradient, you need to propagate the error signal backward through the chain of operations. This is backpropagation — the chain rule of calculus applied systematically from the output layer back to the input layer.

The Chain Rule, Concretely

The chain rule states that if y = f(g(x)), then dy/dx = f'(g(x)) · g'(x). You multiply the derivatives of each function in the chain. Backpropagation is just this rule applied to a deep composition of functions.

Let's work through it for a concrete 2-layer network with one neuron per layer, so the notation stays clean. The network computes:

We want ∂L/∂w₁ — how does changing the first weight affect the loss? Trace the dependency chain: w₁ → z₁ → a₁ → z₂ → ŷ → L. The chain rule gives:

Notice that the ŷ(1−ŷ) from the sigmoid derivative cancels with the denominator from the cross-entropy derivative. This is one reason sigmoid + cross-entropy was historically popular — the math simplifies cleanly. With ReLU, the derivative is even simpler: 1 if z > 0, and 0 otherwise.

This is the key pattern. Once you compute ∂L/∂z₂ = (ŷ − y), you can reuse it when computing the gradient for w₁. The error signal at layer 2 gets multiplied by w₂ and the local activation derivative to get the error signal at layer 1. Each layer receives the error from the layer above, multiplies by its own local derivatives, and passes the result backward. That's backpropagation.

Why Backpropagation Was the Unlock

The chain rule itself is from first-year calculus. What made backpropagation transformative was the realization that you could compute gradients for all weights in one backward pass by reusing intermediate results.

Without backpropagation, the naive approach is numerical differentiation: perturb each weight by a tiny amount, run the forward pass again, and measure how the loss changed. For a network with N weights, that requires N+1 forward passes per training step. A modern network has billions of weights. This is intractable.

Backpropagation computes the gradient for every weight using exactly one forward pass and one backward pass — regardless of how many weights the network has. The backward pass is roughly the same computational cost as the forward pass. So the cost of computing all gradients is about 2x the cost of a single prediction. For a network with a billion weights, backprop is roughly a billion times more efficient than numerical differentiation.

The algorithm was described independently several times — by Seppo Linnainmaa (1970) in the context of automatic differentiation, by Paul Werbos (1974) in his PhD thesis, and most influentially by David Rumelhart, Geoffrey Hinton, and Ronald Williams (1986), who demonstrated it training multi-layer networks and brought it to widespread attention in the neural network community.¹

Gradient Descent: Walking Downhill

Once backpropagation gives you the gradient of the loss with respect to every weight, you need to use that information to update the weights. The simplest approach: move each weight a small step in the direction that reduces the loss. Since the gradient points toward steepest ascent, you subtract it:

Here η (eta) is the learning rate — a scalar that controls the step size. This is the most important hyperparameter in neural network training.

Think of the loss function as a landscape — a surface where every point is defined by the values of all the weights, and the height is the loss. Gradient descent is walking downhill on this surface. The gradient is the slope under your feet. You take a step in the steepest downhill direction, check the slope again, take another step. Repeat until you reach a valley.

The Learning Rate

In practice, nobody uses a fixed learning rate for an entire training run. Learning rate schedules reduce the rate over time — you start with larger steps to make fast initial progress, then shrink the steps as you approach the minimum for finer convergence. Common schedules include step decay (halve the rate every N epochs), cosine annealing (smoothly decrease following a cosine curve), and warmup + decay (start small, ramp up, then decrease). The specifics matter less than the principle: adaptive step sizes outperform fixed ones.

Vanilla SGD

Stochastic Gradient Descent applies the basic update rule but computes the gradient on a random subset of the data rather than the full dataset (more on batching below). The update is noisy — any individual mini-batch gives a rough estimate of the true gradient — but the noise actually helps. It prevents the optimizer from settling too precisely into sharp local minima that don't generalize well, and it adds an implicit regularization effect.

The problem with vanilla SGD is that it treats every weight and every direction equally. The learning rate is one number applied uniformly. In reality, different weights may need very different step sizes — some are in flat regions of the loss surface (need bigger steps), others are in steep narrow valleys (need smaller steps).

SGD with Momentum

The first improvement: add momentum. Instead of stepping only in the direction of the current gradient, maintain a running average of past gradients and step in that direction:

The velocity v accumulates gradients over time. β (typically 0.9) controls how much history to keep. This is like a ball rolling downhill — it builds up speed in consistent directions and dampens oscillation in directions where the gradient keeps flipping sign. If the gradient has been pointing the same way for 10 steps, momentum makes you go faster. If it's been oscillating, momentum smooths it out.

RMSProp

RMSProp (Root Mean Square Propagation), proposed by Geoffrey Hinton in an unpublished lecture in 2012, takes a different approach: adapt the learning rate per weight based on the recent magnitude of its gradients.²

s is a running average of squared gradients. Dividing by √s normalizes the update: weights with consistently large gradients get smaller effective learning rates (preventing overshooting), and weights with consistently small gradients get larger effective learning rates (allowing them to make progress). ε (typically 10⁻⁸) prevents division by zero.

Adam

Adam (Adaptive Moment Estimation) combines both ideas — momentum and per-weight adaptive rates.³ It maintains two running statistics for each weight:

m is the first moment — the exponential moving average of the gradient (momentum). v is the second moment — the exponential moving average of the squared gradient (per-weight scaling). The bias correction terms (dividing by 1−β^t) compensate for the fact that m and v are initialized to zero and are biased toward zero in the first few steps. β₁ = 0.9 and β₂ = 0.999 are the standard defaults; they work well across a wide range of problems.

Adam is the default optimizer for most modern neural network training. It's not always the best — SGD with momentum sometimes finds better minima, especially for image classification tasks — but Adam is robust, requires less tuning, and converges reliably. When you don't know what to use, use Adam.

Batching: How Much Data Per Step

Optimizer	What it does	When to use
Vanilla SGD	Constant learning rate, no memory	Almost never on its own
SGD + Momentum	Accumulates velocity in consistent directions	Vision tasks, when you can afford careful tuning
RMSProp	Per-weight adaptive learning rate	RNNs (historically); largely superseded by Adam
Adam	Momentum + adaptive rates + bias correction	Default choice. LLMs, transformers, most tasks
AdamW	Adam with decoupled weight decay	Current best practice for transformers⁴

In principle, you could compute the gradient on every training example, average them, and take one step. That's full-batch gradient descent. The gradient would be precise — a perfect estimate of the true gradient of the loss over the entire dataset. But for a dataset with millions of examples, computing the full-batch gradient for one step would be extremely expensive. And you'd have to do this thousands of times to converge.

At the other extreme, you could compute the gradient on a single example and update immediately. That's stochastic gradient descent in the original sense — each step is based on one randomly selected example. The gradient is a very noisy estimate of the true gradient, but you get to update the weights after every single example. The noise means the path is erratic, but over many steps, the average direction is correct.

The practical middle ground is mini-batch gradient descent: compute the gradient on a small random subset (a "batch") of the data, typically 32 to 512 examples, and update. This gives you a gradient that's noisy but not wildly so — a reasonable estimate that lets you update frequently.

Key idea: Mini-batch training wins for three reasons. (1) It's computationally efficient — modern GPUs are designed for parallel matrix operations, and a batch of 256 examples can be processed nearly as fast as a single example. (2) The gradient noise acts as implicit regularization, helping the model avoid sharp minima that don't generalize. (3) You get to update the weights many times per pass through the data, so convergence is faster in wall-clock time even if each individual step is less precise than a full-batch step.

A typical training run lasts tens to hundreds of epochs. The loss generally decreases quickly in the early epochs and then plateaus, with diminishing returns on each additional epoch. When to stop is its own question — addressed next.

Overfitting: Memorization vs. Learning

Training loss going down is necessary but not sufficient. The real question is whether the network's performance generalizes — whether it does well on data it has never seen.

A network with enough capacity (enough weights relative to the amount of training data) can memorize the training set exactly. Given sufficient parameters, the network can learn a unique mapping for every training example without extracting any general pattern. The training loss drops to near zero. But on new data, the network is useless — it learned the noise and idiosyncrasies of the training set, not the underlying structure.

This is overfitting, and it's the central tension in machine learning: you want the model complex enough to capture real patterns, but not so complex that it fits the noise.

The training loss is computed on the data the model is learning from. It almost always decreases. The validation loss is computed on a held-out set the model never trains on. When validation loss stops decreasing and starts rising while training loss continues to fall, the model has started overfitting — it's memorizing the training data instead of learning transferable patterns.

The gap between training and validation loss is called the generalization gap. Some gap is normal — a model will always perform slightly better on data it was trained on. But a large and growing gap means the model's complexity is being used to fit noise rather than signal.

Regularization: Keeping the Model Honest

Regularization is any technique that constrains the model to prevent overfitting. The idea: give the model less room to memorize, forcing it to learn simpler, more general patterns.

Dropout

Dropout, introduced by Nitish Srivastava et al. in 2014 (following an idea by Geoffrey Hinton), randomly sets a fraction of neuron outputs to zero during each training step.⁵ A typical dropout rate is 0.5 for hidden layers — meaning half the neurons are randomly silenced on each forward pass.

This forces the network to be redundant. No single neuron can become a "hero" that the entire network depends on, because it might be dropped on any given step. The network has to distribute knowledge across neurons, building multiple overlapping representations. At inference time, all neurons are active (with outputs scaled down to compensate for the increased capacity), and the model effectively averages over the many sub-networks seen during training.

Dropout is also interpretable as training an ensemble of many different subnetworks (each with a different random subset of neurons active) and averaging their predictions at test time. Ensembles are a well-known technique for reducing variance, and dropout achieves a similar effect at the cost of roughly zero additional computation.

Weight Decay (L2 Regularization)

Weight decay adds a penalty to the loss function proportional to the squared magnitude of the weights:

λ is the regularization strength. This penalty discourages any individual weight from growing large. Large weights mean the network is relying heavily on specific features — which is often a sign of overfitting to training-data-specific patterns. By penalizing large weights, the model is pushed toward simpler functions with smaller, more evenly distributed weights.

In Adam, weight decay is slightly more nuanced. The original Adam paper applied L2 regularization by adding the penalty to the gradient, but Loshchilov and Hutter (2019) showed this interacts poorly with Adam's adaptive learning rates. Their fix — AdamW — decouples weight decay from the gradient, applying it directly to the weights instead.⁴ AdamW is now the standard optimizer for training transformers.

Early Stopping

The simplest regularization of all: stop training when validation loss stops improving. Monitor the validation loss after each epoch. If it hasn't improved for some number of epochs (the "patience"), stop and use the model weights from the epoch with the lowest validation loss.

Early stopping is free — it doesn't change the model or the training procedure. It just prevents you from continuing to train past the point of diminishing returns. In the training-vs-validation loss diagram above, you'd stop training roughly where the validation curve bottoms out.

Other Techniques

The Full Loop, Revisited

That's the complete algorithm. Everything else in training — batch normalization, dropout, learning rate schedules, gradient clipping, mixed-precision arithmetic — is optimization and regularization layered on top of this fundamental loop. The loop itself has remained essentially unchanged since Rumelhart, Hinton, and Williams demonstrated it in 1986.

Key idea: The training loop is gradient descent guided by backpropagation. It is not intelligent search. It is not evolutionary. It does not reason about what the weights should be. It computes a local slope and takes a step downhill. The fact that this procedure — repeated billions of times on billions of examples — produces systems that can write poetry, diagnose diseases, and play chess at superhuman level is one of the most surprising empirical results in the history of science. Nothing in the theory guarantees it should work this well. But it does.

¹ Rumelhart, Hinton, and Williams (1986), "Learning representations by back-propagating errors." Nature 323:533-536. This paper didn't invent backpropagation but demonstrated its effectiveness on multi-layer networks, making it the standard training method. Werbos described the algorithm in his 1974 Harvard PhD thesis. Linnainmaa's 1970 master's thesis at the University of Helsinki formalized reverse-mode automatic differentiation, which is the mathematical framework underlying backpropagation.

² Hinton proposed RMSProp in Lecture 6e of his Coursera course "Neural Networks for Machine Learning" (2012). It was never formally published in a paper, which makes it one of the most cited unpublished results in the field.

³ Kingma and Ba (2014), "Adam: A Method for Stochastic Optimization." Published at ICLR 2015. arXiv:1412.6980. One of the most cited papers in deep learning — it has over 150,000 citations as of 2025.

⁴ Loshchilov and Hutter (2019), "Decoupled Weight Decay Regularization." Published at ICLR 2019. arXiv:1711.05101. Introduced AdamW, which is now the default optimizer for training large language models including GPT and similar architectures.

⁵ Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014), "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research 15:1929-1958.

⁶ Ioffe and Szegedy (2015), "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." Proceedings of the 32nd International Conference on Machine Learning (ICML). arXiv:1502.03167.

The Training Loop

The Big Picture