Chapter 2 described the biological neuron in detail: dendrites collect thousands of incoming signals, the soma integrates them, and if the total exceeds a threshold, the neuron fires an action potential down the axon. The artificial neuron is a deliberate compression of this process into a single equation.
An artificial neuron takes a vector of inputs x = [x1, x2, ..., xn], multiplies each input by a corresponding weight wi, sums the results, adds a bias term b, and passes the total through an activation function f:
That's it. Every artificial neuron in every neural network — from the perceptron in 1957 to the neurons in GPT-4 — computes some version of this. The differences between architectures are about how neurons are connected, how many there are, and what activation function f is used. The core computation is a weighted sum followed by a nonlinear transformation.
In the notation from Chapter 4, the weighted sum w · x + b is a dot product plus a scalar — this is an affine transformation. It maps the input vector to a single number. The activation function then applies a nonlinear squashing or thresholding operation to that number.
The artificial neuron preserves three essential features of its biological counterpart:
The simplifications matter because they define the boundaries of what artificial networks can and cannot do without architectural intervention:
The result is that an artificial neuron is a radical simplification — it captures the computational core (weighted sum + threshold) while discarding the temporal dynamics, the dendritic computation, and the neurochemical context. That simplification is what makes the math tractable, and the math being tractable is what makes training possible.
The activation function f is what makes a neural network more than a system of linear equations. Without it, stacking layers of neurons would be pointless — the composition of linear functions is itself linear, so any multi-layer linear network collapses to a single equivalent linear transformation. The activation function breaks this linearity and gives networks their representational power.
The sigmoid function was the default activation for decades, from the 1980s through the early 2010s:
It squashes any input to a value between 0 and 1, with a smooth S-shaped curve centered at z = 0. When z is very negative, the output approaches 0. When z is very positive, the output approaches 1. The biological motivation is clear: it mimics a neuron's firing rate, transitioning smoothly from "not firing" to "firing at maximum rate."
The sigmoid has two problems that made it fall out of favor for hidden layers. First, saturation: for large positive or negative inputs, the gradient is nearly zero. During backpropagation (Chapter 6), gradients get multiplied together across layers, so near-zero gradients in early layers mean those layers learn extremely slowly. This is the vanishing gradient problem, and it was the main obstacle to training deep networks for years. Second, non-zero-centered output: because the sigmoid always outputs positive values, the gradients on weights are always the same sign within a layer, which creates inefficient zig-zagging during optimization.
The hyperbolic tangent function is a rescaled sigmoid:
It maps inputs to the range (-1, 1) instead of (0, 1), which solves the zero-centering problem — outputs can be positive or negative. It still saturates at extreme values, so the vanishing gradient problem persists, but it's generally preferred over the sigmoid for hidden layers when a saturating activation is needed.
The Rectified Linear Unit is the modern default for hidden layers, and its simplicity is almost comically disproportionate to its impact:
If the input is positive, pass it through unchanged. If it's negative, output zero. That's it.
ReLU solved the vanishing gradient problem by construction: for any positive input, the gradient is exactly 1, so gradients flow through ReLU layers without shrinking. This was the key enabler for training networks deeper than a few layers. Krizhevsky, Sutskever, and Hinton used ReLU in AlexNet (2012), which won the ImageNet competition by a large margin and is generally considered the moment deep learning became the dominant paradigm in computer vision.2
ReLU has its own failure mode: dying neurons. If a neuron's weights shift such that the weighted sum is negative for every input in the training set, the output is always zero, the gradient is always zero, and the neuron stops learning permanently. Variants like Leaky ReLU (which outputs a small fraction of negative inputs, e.g., 0.01z for z < 0) and GELU (Gaussian Error Linear Unit, used in transformers) address this, but standard ReLU remains the default starting point.
Softmax isn't used in hidden layers — it's an output-layer activation for classification tasks. Given a vector of raw outputs (called logits) from the final layer, softmax converts them into a probability distribution:
Each output becomes a positive number, and all outputs sum to 1. The exponential amplifies differences between logits — the largest logit gets the largest probability, and the gap is exaggerated. If you have a network classifying images into 1,000 categories, the final layer has 1,000 neurons, and softmax turns their raw outputs into 1,000 probabilities that sum to 1.
Softmax shows up again in the transformer's attention mechanism (Chapter 9), where it converts raw attention scores into a probability distribution over tokens.
This point is worth spending time on because it's the single most important thing to understand about neural network architecture.
A linear function maps inputs to outputs via multiplication and addition: y = Wx + b. If you compose two linear functions — feed the output of one into the input of another — you get another linear function. Stack a hundred linear layers and the entire network is equivalent to a single matrix multiplication:
where W' = W100W99...W1 and b' is some combined bias. A hundred layers of computation, collapsed into one. All that architecture achieves nothing that a single layer couldn't.
Nonlinear activation functions break this collapse. Once you apply a sigmoid or ReLU between layers, the composition is no longer reducible to a single linear operation. Each layer can carve the input space in a new way, creating decision boundaries that are curves rather than lines, surfaces rather than planes, and — in higher dimensions — arbitrarily complex manifolds rather than hyperplanes.
A single neuron computes a single weighted sum and applies an activation function. It can learn a linear decision boundary (or a slightly curved one, depending on the activation). To learn complex functions, you need to compose neurons into layers and stack layers into networks.
A feedforward neural network (also called a multi-layer perceptron, or MLP) has three types of layers:
When every neuron in one layer connects to every neuron in the next, the layers are called fully connected or dense. The total number of parameters (weights + biases) in a fully connected layer with m inputs and n outputs is m × n + n — one weight per connection, plus one bias per output neuron. A layer with 784 inputs and 256 hidden neurons has 200,960 parameters. Scale this to networks with millions or billions of neurons and you see why modern models have billions of parameters.
In a trained network, each layer learns to extract progressively more abstract features from the input. This is easiest to see in image recognition networks, where researchers have directly visualized what each layer responds to:
This hierarchical feature learning is not programmed — it emerges from the training process. Nobody tells the network "layer 1 should detect edges." The network discovers, through gradient descent (Chapter 6), that detecting edges first and then combining them is the most efficient way to reduce its error. The architecture (many layers of nonlinear transformations) enables this hierarchical decomposition; the training algorithm discovers it.
A "deep" neural network is simply one with multiple hidden layers — typically more than two. The term "deep learning" refers to training deep networks effectively. This sounds trivial, but it wasn't: training networks deeper than two or three layers was essentially impossible before about 2006, because gradients either vanished (sigmoid activations) or exploded (weights growing without bound) during backpropagation through many layers.
Several breakthroughs made deep learning practical:
The practical advantage of depth is compositional efficiency. A deep network can represent certain functions exponentially more efficiently than a shallow one.
Consider a function that depends on hierarchical structure — recognizing a face, for instance. A shallow network (one hidden layer) would need to learn every possible combination of pixel values that constitutes a face. A deep network can learn the hierarchy: edges compose into features, features compose into parts, parts compose into faces. Each layer reuses the representations learned by the layer below it, so the total number of parameters needed grows polynomially with depth rather than exponentially with the input dimension.
A concrete way to think about it: a function that a 10-layer network can represent with 10,000 parameters might require a 1-layer network with millions of parameters to approximate to the same accuracy. Depth doesn't give you new capabilities in theory (the universal approximation theorem, below, says a single hidden layer is sufficient), but it gives you those capabilities practically — with feasible numbers of parameters and reasonable training time.
The universal approximation theorem states that a feedforward network with a single hidden layer containing a sufficient number of neurons, with any "squashing" activation function (sigmoid, tanh, etc.), can approximate any continuous function on a compact subset of Rn to arbitrary accuracy.5
The theorem was proven independently by George Cybenko (1989) for sigmoid activations and by Kurt Hornik (1991) in a more general form covering any bounded, non-constant activation function. Later work by Leshno et al. (1993) extended it to non-polynomial activation functions, including ReLU.
What the theorem says: a sufficiently wide single-hidden-layer network is a universal function approximator. Given enough neurons in the hidden layer, it can learn any continuous input-output mapping.
What the theorem does not say:
The theorem is analogous to the fact that any continuous curve can be approximated by a polynomial of sufficiently high degree (the Weierstrass approximation theorem from analysis). That's true but not practically useful — you wouldn't approximate a complex function with a million-degree polynomial. Similarly, you could approximate a complex function with a single, enormously wide hidden layer, but deep networks achieve the same result with far fewer parameters and better generalization.
The universal approximation theorem established that neural networks are theoretically powerful enough to learn anything expressible as a continuous function. The research that followed — on depth, architecture, and training algorithms — was about making that theoretical power practically accessible.
Before a network can learn, its weights need starting values. The choice of initial weights has a surprisingly large effect on whether training succeeds at all. Initialize too large, and signals explode as they propagate through layers — activations saturate, gradients blow up. Initialize too small, and signals shrink toward zero — activations flatten, gradients vanish. Both scenarios prevent learning.
The goal of a good initialization scheme is to keep the variance of activations roughly constant across layers. If each layer preserves the scale of its inputs, then signals can propagate forward and gradients can propagate backward without growing or shrinking, regardless of network depth.
Xavier Glorot and Yoshua Bengio analyzed the variance of activations in networks with symmetric activations (like tanh) and showed that weights should be initialized from a distribution with variance:6
where nin is the number of inputs to the layer and nout is the number of outputs. In practice, this means drawing weights from a uniform or normal distribution scaled by the fan-in and fan-out. The intuition: layers with more connections need smaller individual weights to keep the total signal in the right range.
When ReLU replaced sigmoid/tanh as the standard activation, Xavier initialization stopped working well. ReLU zeroes out roughly half its inputs (the negative ones), which effectively halves the variance at each layer. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun derived a corrected initialization:7
This is exactly Xavier initialization with a factor-of-two correction to account for ReLU's asymmetry. He initialization is the default for ReLU networks and is used in essentially all modern deep learning frameworks.
Weight initialization might seem like a minor detail, but its impact is hard to overstate. Before these principled initialization schemes, training deep networks required extensive hyperparameter tuning and often simply failed. With them, networks that are dozens or hundreds of layers deep train reliably. The underlying principle — maintaining signal variance through the network — is the same principle that motivates batch normalization, layer normalization, and residual connections. Scale stability is a recurring theme in deep learning.
Neural networks are powerful function approximators, but that power comes with constraints worth understanding clearly.
Understanding these limits isn't about dismissing neural networks — they're the most powerful learning systems humans have built. It's about knowing where the boundaries are so you can architect systems that compensate. Your Thompson Sampling bandit, for instance, handles exploration/exploitation tradeoffs that a neural network alone wouldn't address — it operates at a different level of abstraction, deciding which retrieval strategy to use rather than implementing the strategy itself.
At this point in the guide, you have the full static picture of a neural network — everything it is before it learns anything:
| Component | What it does | Where it comes from |
|---|---|---|
| Weights | Scale inputs to each neuron, determining which features matter | Initialized randomly (Xavier/He), then adjusted by training |
| Biases | Shift the activation threshold of each neuron | Typically initialized to zero |
| Activation functions | Introduce nonlinearity, enabling complex representations | Chosen by the designer (ReLU for hidden layers is the default) |
| Architecture | Number of layers, neurons per layer, connection pattern | Chosen by the designer, often guided by the problem structure |
| Parameters | The total set of learnable values (all weights + all biases) | Count determined by architecture; values determined by training |
A network with randomly initialized weights is a random function — it maps inputs to essentially arbitrary outputs. It has the capacity to represent the function you want, but it doesn't yet know which function that is. The gap between capacity and knowledge is bridged by training: showing the network examples, measuring its errors, and adjusting the weights to reduce those errors.
That process — the training loop — is where the network goes from a random function to a useful one. It's where weights find their values, where features emerge from noise, and where the network discovers the hierarchical representations that make it work. It's also where the calculus from Chapter 4 comes alive: the chain rule applied recursively through every layer, every neuron, every connection. That's the subject of Chapter 6.
Next: Chapter 6 — The Training Loop. Forward pass, loss functions, backpropagation (the chain rule applied recursively through the entire network), and gradient descent with its variants (SGD, momentum, Adam). How a random network becomes a useful one.1 David et al. (2020), "A single cortical neuron can implement complex nonlinear computations equivalent to a 5-8 layer deep neural network." Confirmed by Beniaguev, Segev, and London in Neuron (2021), who showed that a deep neural network with 5-8 layers was needed to accurately predict the I/O behavior of a detailed biophysical model of a single pyramidal neuron.
2 Krizhevsky, Sutskever, and Hinton, "ImageNet Classification with Deep Convolutional Neural Networks" (2012). Known as AlexNet. It achieved a top-5 error rate of 15.3% on ImageNet, compared to 26.2% for the second-place entry — a gap so large it effectively ended the debate about whether deep learning was viable for computer vision. Nair and Hinton (2010) had introduced ReLU for restricted Boltzmann machines slightly earlier.
3 Ioffe and Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (2015). Presented at ICML 2015.
4 He, Zhang, Ren, and Sun, "Deep Residual Learning for Image Recognition" (2015). Introduced ResNet, which won the ImageNet competition with a 152-layer network. The key idea — skip connections that allow gradients to bypass layers — remains fundamental to virtually all modern deep architectures including transformers.
5 Cybenko, "Approximation by Superpositions of a Sigmoidal Function," Mathematics of Control, Signals and Systems, 2(4):303-314, 1989. Hornik, "Approximation capabilities of multilayer feedforward networks," Neural Networks, 4(2):251-257, 1991. Leshno et al., "Multilayer feedforward networks with a nonpolynomial activation function can approximate any function," Neural Networks, 6(6):861-867, 1993.
6 Glorot and Bengio, "Understanding the difficulty of training deep feedforward neural networks," Proceedings of AISTATS, 2010.
7 He, Zhang, Ren, and Sun, "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification," Proceedings of ICCV, 2015.
8 Lake and Baroni, "Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks," Proceedings of ICML, 2018. Showed that standard neural architectures fail at systematic compositional generalization even when they perform well on in-distribution test sets.
9 Pearl, The Book of Why: The New Science of Cause and Effect, Basic Books, 2018. See also Pearl and Mackenzie's argument that all current machine learning operates at the associational level of Pearl's "ladder of causation," unable to reason about interventions or counterfactuals.