Chapter 8: Architectures

By the end of Chapter 7, you had the hardware to train deep networks. But the network described in Chapters 5 and 6 — layers of fully connected neurons, each connected to every neuron in the next layer — has a fundamental problem: it treats every input as an unstructured bag of numbers. Feed it an image, and it doesn't know that pixel (3, 4) is next to pixel (3, 5). Feed it a sentence, and it doesn't know that word 7 came after word 6. All spatial and temporal structure is lost the moment data enters the network.

This chapter is about the architectures that solved that problem. Each one was built to exploit a specific kind of structure in data — spatial regularity, sequential dependence, compressed representation, adversarial dynamics. They were not discovered by searching some abstract space of possible designs. They were engineered, one at a time, by people who understood the structure of their data and designed networks to match it.

Convolutional Neural Networks

The biological precedent

In 1959, David Hubel and Torsten Wiesel inserted electrodes into the primary visual cortex of anesthetized cats and projected simple shapes onto a screen. They discovered that individual neurons responded to specific features — edges at particular orientations — and only within a restricted region of the visual field they called the neuron's receptive field. Deeper in the visual cortex, neurons responded to increasingly complex patterns: corners, then shapes, then objects. This hierarchical, spatially local processing — simple features composing into complex ones — earned them the Nobel Prize in Physiology or Medicine in 1981.¹

This is the biological insight behind convolutional networks. The visual cortex doesn't process an entire image at once with a single massive set of connections. It processes local patches, detects simple features, and composes them into complex representations through a hierarchy of layers. CNNs do exactly the same thing, in math.

The convolution operation

The core operation is straightforward. Take a small matrix — a filter or kernel, typically 3x3 or 5x5 — and slide it across the input image. At each position, compute the element-wise product of the filter and the patch of the image it overlaps, then sum the results. That sum becomes one value in the output. The output matrix is called a feature map.

Formally, for a 2D input I and a kernel K of size m × n, the output feature map S at position (i, j) is:

That's a dot product between the kernel and the local patch of the input. If you recall from Chapter 4 — a dot product measures how much two vectors point in the same direction. Here it measures how much a local patch of the image matches the pattern encoded in the kernel. A horizontal edge detector kernel will produce a high value when it slides over a horizontal edge, and near zero when it slides over a uniform region.

Pooling and hierarchical features

After convolution, a pooling layer reduces the spatial dimensions by summarizing local regions. The most common variant, max pooling, takes the maximum value in each small window (typically 2x2). This does two things: it reduces the number of parameters in subsequent layers, and it introduces a degree of translation tolerance — small shifts in the input change the exact feature map values but often leave the pooled output unchanged.

A typical CNN stacks these operations: convolution, activation (ReLU), pooling, repeat. Each layer's feature maps become the input to the next layer's kernels. The result is a hierarchy where early layers detect edges and textures, middle layers detect parts (eyes, wheels, corners), and deep layers detect whole objects or scenes. This is the same hierarchy Hubel and Wiesel found in the visual cortex, implemented in linear algebra.

The history: LeNet to ResNet

LeNet-5 (Yann LeCun et al., 1998) was the architecture that proved CNNs worked in practice. It had two convolutional layers, two pooling layers, and three fully connected layers at the end — modest by today's standards, but it achieved state-of-the-art performance on handwritten digit recognition (MNIST) and was deployed by the US Postal Service for reading ZIP codes. The core ideas — learned kernels, pooling, hierarchical features — were in place by 1989 in LeCun's earlier work, but LeNet-5 was the mature version that got deployed.²

Then the field waited. For over a decade, CNNs were a known technique but not widely used. The hardware wasn't there (Chapter 7), and support vector machines dominated machine learning benchmarks through the 2000s.

AlexNet (Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012) was the architecture that changed the field. It won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with a top-5 error rate of 15.3% — the runner-up, a hand-engineered feature pipeline, had 26.2%. That 10-point gap, achieved by a deep neural network trained on GPUs, was the signal that deep learning had crossed a threshold. AlexNet used five convolutional layers, three fully connected layers, ReLU activations, dropout for regularization, and critically, was trained on two Nvidia GTX 580 GPUs — one of the first demonstrations that GPU training was the path forward.³

VGGNet (Karen Simonyan and Andrew Zisserman, 2014) pushed the idea to its logical extreme: go deeper, but keep the architecture simple. VGG-16 and VGG-19 used only 3x3 kernels stacked repeatedly — 16 to 19 layers deep. The insight was that two stacked 3x3 convolutions have the same effective receptive field as one 5x5 convolution but with fewer parameters and more nonlinearity. VGGNet showed that depth itself was a powerful lever.⁴

But VGGNet also exposed a problem. Going deeper should, in principle, never hurt — a deeper network can always learn the identity function for its extra layers and match a shallower one. In practice, deeper networks trained worse. Gradients either vanished (became too small to update early layers) or exploded (became too large and destabilized training). Twenty layers worked. Fifty didn't.

ResNet (Kaiming He et al., 2015) solved this with skip connections — also called residual connections or shortcut connections. Instead of asking a layer to learn the desired output directly, ResNet asks it to learn the residual: the difference between the input and the desired output. The input is added back via a shortcut that bypasses the layer entirely.

Where F(x) is whatever the layer computes, and x is the layer's input passed through unchanged. If the optimal function for a layer is close to the identity, the layer only needs to learn F(x) ≈ 0, which is far easier than learning the identity itself. More importantly, the skip connection gives gradients a direct path backward through the network during backpropagation, bypassing the chain of multiplications that causes vanishing gradients.

ResNet made it possible to train networks with 152 layers — and won ILSVRC 2015 with a 3.57% top-5 error rate, surpassing human performance on the benchmark for the first time. The skip connection is one of the most consequential ideas in deep learning; it reappears in the transformer architecture (Chapter 9), in U-Net for image segmentation, and in virtually every deep network designed after 2015.⁵

Recurrent Neural Networks

The sequence problem

CNNs solved spatial structure. But language, music, time series, and video are sequential — the meaning of a word depends on the words before it, and a stock price today depends on prices from previous days. A feedforward network (even a CNN) processes each input independently. It has no memory of what it saw before.

Architecture	Year	Depth	Key Innovation	ImageNet Top-5 Error
LeNet-5	1998	7	Learned convolutional filters	N/A (MNIST)
AlexNet	2012	8	GPU training, ReLU, dropout	15.3%
VGGNet	2014	16-19	Uniform 3x3 kernels, depth	7.3%
GoogLeNet	2014	22	Inception modules (parallel filter sizes)	6.7%
ResNet	2015	152	Skip connections	3.57%

A recurrent neural network (RNN) addresses this by introducing a loop. At each time step t, the network receives the current input x_t and the hidden state h_t-1 from the previous time step. It produces a new hidden state h_t and (optionally) an output y_t:

The hidden state h_t is the network's memory. It's a fixed-size vector that gets updated at every time step, carrying forward a compressed summary of everything the network has seen so far. The weight matrices W_hh, W_xh, and W_hy are shared across all time steps — the same parameters are reused at every position in the sequence, just like a CNN reuses its kernel at every position in an image.

When you unroll an RNN through time, it becomes a very deep feedforward network — one layer per time step — with the same weights at every layer. Training uses backpropagation through time (BPTT): the standard backpropagation algorithm applied to the unrolled network. And this is where the problems start.

The vanishing gradient problem

When gradients are backpropagated through many time steps, they pass through a chain of matrix multiplications — one per time step. If the weight matrix W_hh has eigenvalues less than 1, the gradients shrink exponentially. If the eigenvalues are greater than 1, the gradients grow exponentially. In practice, for sequences longer than about 10-20 steps, the gradient signal either vanishes (early time steps receive negligible updates) or explodes (updates become numerically unstable).

This was identified formally by Sepp Hochreiter in his 1991 diploma thesis and later by Yoshua Bengio et al. in 1994.⁶ The consequence is devastating for language: a vanilla RNN can't learn that the subject of a sentence ("The cat that sat on the mat") determines the verb form 8 words later ("was" not "were"). The gradient signal from the verb to the subject has been multiplied by the weight matrix 8 times, and by then it's effectively zero.

Gradient clipping (capping the gradient norm to a maximum value) helps with explosion, but there's no simple fix for vanishing. The solution required a fundamentally different architecture.

LSTMs: Learning what to remember

The Long Short-Term Memory (LSTM) network, introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997, solved the vanishing gradient problem by redesigning the recurrent unit.⁷ Instead of a single hidden state updated by a tanh nonlinearity, the LSTM maintains a cell state — a separate memory vector that runs through time with minimal modification. Information is added to or removed from the cell state through learned gates.

Where σ is the sigmoid function, ⊙ is element-wise multiplication, and [h_t-1, x_t] means concatenating the two vectors.

The critical line is the cell state update: C_t = f_t ⊙ C_t-1 + i_t ⊙ C̃_t. The cell state is updated by addition, not by repeated multiplication through a weight matrix. This is the key insight. The forget gate can hold values close to 1, allowing information to flow through the cell state across many time steps with minimal degradation — the gradient has a highway to travel on. It's the same principle as ResNet's skip connection, discovered two decades earlier but for a different problem.

LSTMs dominated sequence modeling from the late 2000s through 2017. They powered Google Translate (before the switch to transformers), speech recognition systems at Baidu and Google, and handwriting recognition. They were the first architecture to make deep learning work on sequential data at production scale.

GRUs: A simpler alternative

The Gated Recurrent Unit (GRU), introduced by Kyunghyun Cho et al. in 2014, is a simplified version of the LSTM.⁸ It merges the cell state and hidden state into a single vector and uses two gates instead of three:

The GRU has fewer parameters than the LSTM and trains faster. In practice, the two perform comparably on most tasks — neither consistently dominates. The choice between them is often pragmatic: GRUs train faster, LSTMs have slightly more expressive capacity for very long sequences.

Autoencoders: Compression as Learning

Autoencoders take a different approach entirely. Instead of classifying inputs or predicting sequences, an autoencoder learns to reconstruct its own input. The architecture has two halves: an encoder that compresses the input into a lower-dimensional representation (the latent vector or bottleneck), and a decoder that reconstructs the original input from that compressed representation.

The training objective is to minimize the difference between input and output — typically measured by mean squared error or binary cross-entropy. The network is forced to learn a compressed representation because the bottleneck has fewer dimensions than the input. If the input has 784 dimensions (a 28x28 image) and the bottleneck has 32, the encoder must learn to preserve the 32 most informative features while discarding the rest. The decoder must learn to reconstruct the full image from only those 32 values.

Autoencoders by themselves are useful but not revolutionary. Their importance is conceptual: they introduced the idea that a network can learn a useful representation by being trained on a reconstruction task, without labeled data. This idea — learning representations from unlabeled data — is the foundation of self-supervised learning, which underpins modern LLMs.

The variational autoencoder (VAE), introduced by Diederik Kingma and Max Welling in 2013, extends the autoencoder by making the latent space probabilistic: the encoder outputs a mean and variance for each latent dimension, and the decoder samples from this distribution.¹⁰ This makes the latent space smooth and continuous — you can sample new points from it and decode them into plausible outputs. VAEs were one of the first practical generative models: networks that don't just analyze data but create new data.

GANs: Learning Through Adversarial Competition

In 2014, Ian Goodfellow and colleagues proposed a radically different approach to generative modeling. Instead of learning to compress and reconstruct data, a Generative Adversarial Network (GAN) pits two networks against each other.¹¹

Both networks are trained simultaneously. The generator tries to fool the discriminator; the discriminator tries not to be fooled. Formally, they play a minimax game:

The first term rewards the discriminator for correctly identifying real data (outputting values close to 1). The second term rewards the discriminator for correctly rejecting fake data (outputting values close to 0 for generated samples), while the generator wants to maximize D(G(z)) — make the discriminator think the generated data is real.

At equilibrium — which is hard to reach in practice — the generator produces data indistinguishable from the real distribution, and the discriminator outputs 0.5 for everything (it can't tell the difference). The generator has learned the data distribution without ever seeing an explicit reconstruction loss.

GANs produced stunning results — photorealistic face generation (StyleGAN, Karras et al., 2019), image-to-image translation (pix2pix, Isola et al., 2017), super-resolution, style transfer. But they also brought notorious training difficulties:

GANs demonstrated something conceptually important: you can train a generative model without defining an explicit reconstruction loss — the discriminator is the loss function, and it's learned alongside the generator. This adversarial training principle has influenced many subsequent approaches, including some aspects of RLHF (reinforcement learning from human feedback) used in modern LLM alignment.

The Trajectory

Notice the pattern. CNNs solved the structure problem for space. RNNs solved the structure problem for time. LSTMs and GRUs solved the memory problem within RNNs. Each solution introduced new constraints. CNNs have fixed, local receptive fields — they can't easily model global relationships across an image. LSTMs process sequences one step at a time — they can't be parallelized across the sequence length, which makes training on long sequences slow. And the fixed-size hidden state is a bottleneck: every piece of information about the entire sequence must be compressed into a single vector.

By 2016, these limitations were well understood. The field needed an architecture that could model dependencies across arbitrary distances in a sequence, with no sequential bottleneck, and with the ability to attend selectively to different parts of the input. What it needed was a mechanism where every element in a sequence could directly interact with every other element, in parallel, weighted by relevance.

Architecture	Problem Solved	Limitation Remaining
Fully connected	Universal function approximation	No spatial/temporal structure; doesn't scale
CNN	Spatial patterns (parameter sharing, local receptive fields)	Fixed receptive field; no sequential reasoning
RNN	Sequential dependencies (hidden state carries memory)	Vanishing gradients; can't learn long-range dependencies
LSTM / GRU	Long-range dependencies (gated memory)	Sequential processing (can't parallelize); fixed-size memory
Autoencoder	Unsupervised feature learning; compression	Reconstruction objective limits generation quality
GAN	High-quality generation without explicit density	Training instability; mode collapse; hard to evaluate

That mechanism was attention. And the architecture built entirely on attention — with no convolution and no recurrence — would be called the transformer.

¹ Hubel, D.H. and Wiesel, T.N. (1962). "Receptive fields, binocular interaction and functional architecture in the cat's striate cortex." Journal of Physiology, 160(1), 106-154. They received the Nobel Prize in Physiology or Medicine in 1981 for their discoveries concerning information processing in the visual system.

² LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11), 2278-2324. The earlier work establishing backpropagation-trained convolutional networks appeared in LeCun et al. (1989), "Backpropagation applied to handwritten zip code recognition," Neural Computation, 1(4), 541-551.

³ Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." Advances in Neural Information Processing Systems 25 (NeurIPS).

⁴ Simonyan, K. and Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition." Proceedings of the International Conference on Learning Representations (ICLR). arXiv:1409.1556.

⁵ He, K., Zhang, X., Ren, S., and Sun, J. (2016). "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778. The paper reported 3.57% top-5 error on ImageNet, winning the ILSVRC 2015 competition.

⁶ Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen." Diploma thesis, Technische Universitat Munchen. Bengio, Y., Simard, P., and Frasconi, P. (1994). "Learning long-term dependencies with gradient descent is difficult." IEEE Transactions on Neural Networks, 5(2), 157-166.

⁷ Hochreiter, S. and Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735-1780.

⁸ Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:1406.1078.

⁹ Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). "Extracting and composing robust features with denoising autoencoders." Proceedings of the 25th International Conference on Machine Learning (ICML), 1096-1103.

¹⁰ Kingma, D.P. and Welling, M. (2014). "Auto-Encoding Variational Bayes." Proceedings of the International Conference on Learning Representations (ICLR). arXiv:1312.6114. Originally posted December 2013.

¹¹ Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). "Generative Adversarial Nets." Advances in Neural Information Processing Systems 27 (NeurIPS).

Architectures

Why Architecture Matters