Part III — Building Artificial Neural Networks

GPUs and the Hardware Revolution

Why the math worked for decades before the hardware caught up. Graphics cards, CUDA, and the accidental infrastructure of deep learning.

The Bottleneck Was Never the Math

By the mid-2000s, the core ideas of deep learning already existed. Backpropagation had been formalized in the 1980s. Convolutional networks had been demonstrated on handwritten digits. Recurrent networks had been proposed for sequences. The math was there. The algorithms were there. What was missing was the ability to run them at scale in any reasonable amount of time.

Training a neural network is dominated by one operation: matrix multiplication. A forward pass through a single layer multiplies an input vector by a weight matrix and adds a bias. A network with 10 layers does this 10 times. Backpropagation does it again in reverse. Training means repeating the entire process millions of times over millions of data points. Every step is a matrix multiply.

To put numbers on it: training a modest image classifier in 2005 might require on the order of 1014 floating-point operations. A modern large language model requires 1023 to 1025. The difference between "this takes a week" and "this takes a century" comes down to how many multiplications you can do per second, and whether you can do them in parallel.

Why CPUs Can't Do It

A CPU is a general-purpose processor designed for serial tasks. Modern CPUs have somewhere between 4 and 128 cores (a high-end server chip like AMD's EPYC 9965 has 192, but that's extreme). Each core is a powerful, complex piece of silicon — it can do branch prediction, out-of-order execution, speculative execution, and complex control flow. It's optimized for the case where the next instruction depends on the result of the current one.

This design philosophy is excellent for running an operating system, compiling code, or executing a database query — workloads with lots of conditional branching and unpredictable control flow. But it's a terrible match for matrix multiplication, which is the opposite kind of workload: you're doing the same operation on thousands of independent data points simultaneously. There are no branches to predict. There are no dependencies between individual multiplications. You just need raw arithmetic throughput.

A CPU doing matrix multiplication is like hiring one brilliant lawyer to individually process ten thousand identical parking tickets. They'll do each one perfectly, but they'll do them one at a time.

What GPUs Are, and Why They're Different

A GPU — graphics processing unit — was designed to solve a completely different problem: rendering pixels on a screen. Every frame of a video game requires computing the color of millions of pixels, and each pixel's color is calculated independently from the others. The math involved is linear algebra: transforming 3D coordinates to 2D screen positions (matrix multiplications), calculating lighting (dot products), applying textures (interpolation). The key property is that every pixel can be computed in parallel.

So GPU designers took the opposite approach from CPU designers. Instead of a few powerful, complex cores, they built thousands of simple cores, each capable of doing basic arithmetic — add, multiply, fused multiply-add — but not much else. No sophisticated branch prediction, no deep instruction pipelines. Just raw math at massive scale.

CPU vs GPU Architecture CPU — Few Powerful Cores Core 1 ALU, FPU Branch Pred. Core 2 ALU, FPU Branch Pred. Core 3 ALU, FPU Branch Pred. Core 4 ALU, FPU Branch Pred. Large shared cache (L2/L3) — optimized for latency 4-128 cores, each highly capable Optimized for serial, branching workloads GPU — Thousands of Simple Cores High-bandwidth memory — optimized for throughput Thousands of cores, each doing simple math Optimized for parallel, uniform workloads The Same Matrix Multiply: A[1024x1024] x B[1024x1024] CPU (8-core, 2024) ~500 GFLOPS (FP32) Processes chunks sequentially Each core handles a block of rows GPU (RTX 4090, 2022) ~83,000 GFLOPS (FP32) All elements computed simultaneously 16,384 cores, each handling a few elements

This is the fundamental mismatch. A modern high-end CPU delivers somewhere around 500 GFLOPS (billions of floating-point operations per second) in single-precision (FP32) arithmetic. An Nvidia RTX 4090, a consumer GPU from 2022, delivers roughly 83 TFLOPS — about 166 times more raw arithmetic throughput. A datacenter GPU like the H100 delivers around 990 TFLOPS in FP32, or nearly 2,000 times the CPU's throughput.

The GPU doesn't achieve this by being "faster" in any individual operation. Its clock speed is actually lower than a CPU's. It achieves it by doing thousands of operations simultaneously. This style of computation is called SIMD — Single Instruction, Multiple Data — though modern GPUs use a slightly more flexible variant called SIMT (Single Instruction, Multiple Threads), where groups of threads (called warps on Nvidia hardware, 32 threads each) execute the same instruction in lockstep.

Key idea: Neural network training is almost entirely matrix multiplication, and matrix multiplication is almost entirely independent arithmetic operations that can be parallelized. GPUs have thousands of cores designed for exactly this pattern. The match is not coincidental — it's structural.

The Gaming Connection

It bears emphasizing that GPUs were not designed for AI. They were designed to render video games.

Rendering a 3D scene requires transforming every vertex of every polygon from 3D world coordinates to 2D screen coordinates — a matrix multiplication. Calculating how light bounces off a surface — a dot product. Applying a texture to a surface — interpolation across a grid. Blending transparency, computing shadows, anti-aliasing edges — all of it is linear algebra on grids of numbers.

The gaming industry's insatiable demand for better graphics drove GPU manufacturers — primarily Nvidia and ATI (later acquired by AMD) — to pack more and more parallel arithmetic units onto their chips, generation after generation. By the mid-2000s, a $300 gaming GPU had more raw floating-point throughput than a $10,000 CPU cluster. The hardware that AI researchers needed already existed. It was sitting in millions of gaming PCs. It just wasn't programmable for anything other than graphics.

CUDA: The Unlock

Before 2007, if you wanted to run non-graphics code on a GPU, you had to disguise your computation as a graphics operation — literally encoding your data as pixel colors and your algorithm as a shader program. Researchers actually did this. It was painful, limited, and fragile. Each new GPU driver update could break everything.1

In June 2007, Nvidia released CUDA — Compute Unified Device Architecture. CUDA is a programming framework that lets you write general-purpose code in a C-like language and run it directly on Nvidia GPU hardware. You write a function (called a kernel), specify how many parallel threads to launch, and the GPU schedules them across its thousands of cores.

CUDA did several things that mattered enormously:

It is difficult to overstate how much CUDA mattered. AMD had competitive GPU hardware throughout this period. Intel had massive R&D budgets. But Nvidia had a usable software ecosystem, and that ecosystem created a moat that persists to this day. Every major deep learning framework — TensorFlow, PyTorch, JAX — was built on CUDA first, and supporting other hardware has been a perpetual afterthought.

Key idea: The hardware advantage alone wasn't enough. CUDA turned GPUs from specialized graphics engines into general-purpose parallel computers. The software layer — the programming model, libraries, and developer tools — is what made the hardware usable for research. This is why Nvidia dominates AI computing despite not being the only company that can build fast chips.

AlexNet: The Proof

The moment the field changed has a precise date: September 30, 2012. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted a paper to the NeurIPS conference (then called NIPS) describing AlexNet, a convolutional neural network trained on two Nvidia GTX 580 GPUs. It won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) that year with a top-5 error rate of 15.3%, compared to 26.2% for the second-place entry — a non-neural method. The margin was unprecedented.2

What made AlexNet significant wasn't any single architectural innovation. It used techniques that already existed: convolutional layers, ReLU activations, dropout regularization, data augmentation. What was new was the scale — 60 million parameters trained on 1.2 million images — and that scale was only possible because of GPUs. Training AlexNet took 5-6 days on two GPUs. On a CPU, the same training would have taken estimated weeks to months.

AlexNet demonstrated three things simultaneously:

  1. Deep neural networks could dramatically outperform hand-engineered feature systems on real-world tasks.
  2. Scale — more data, bigger networks, more compute — mattered more than algorithmic cleverness.
  3. GPUs made that scale practical.

The result was a stampede. Within two years, virtually every competitive entry in ImageNet used deep learning trained on GPUs. Within five years, the same approach had been applied to speech recognition, machine translation, game playing, and protein structure prediction. The deep learning revolution was, at its core, a hardware revolution enabled by a software ecosystem.

The FLOPS Progression

Once the field realized that more compute meant better results, GPU performance became a direct input to research capability. The progression has been staggering.

GPU Year CUDA Cores FP32 TFLOPS Memory
GeForce 8800 GTX 2006 128 0.5 768 MB GDDR3
GTX 580 2010 512 1.6 1.5 GB GDDR5
Tesla K40 2013 2,880 5.0 12 GB GDDR5
Tesla V100 2017 5,120 15.7 16-32 GB HBM2
A100 2020 6,912 19.5 40-80 GB HBM2e
H100 2022 16,896 ~67 80 GB HBM3
B200 2024 18,432 ~90 192 GB HBM3e

Note the memory column — that progression matters as much as the FLOPS, for reasons explored below. Also note that the table shows FP32 (single-precision) numbers. Modern AI training increasingly uses lower-precision formats — FP16, BF16, FP8, INT8 — where the same hardware delivers even higher throughput. The H100 delivers ~990 TFLOPS in FP16 with sparsity, and the Transformer Engine on H100 and B200 supports FP8 natively. Reduced precision works because neural network training is tolerant of rounding errors in a way that, say, scientific simulation is not.3

Google's TPUs

Not everyone decided that repurposed gaming hardware was the right answer. In 2016, Google announced the Tensor Processing Unit (TPU) — custom silicon designed from the ground up for neural network inference (and later, training).

TPUs are ASICs — Application-Specific Integrated Circuits. Unlike GPUs, which still retain some general-purpose flexibility, TPUs are architecturally committed to matrix multiplication and related tensor operations. The first-generation TPU was an inference-only accelerator deployed inside Google's data centers starting in 2015, running production workloads like Google Search ranking and Google Translate. Google later revealed that AlphaGo's match against Lee Sedol in 2016 used TPUs for inference during play.

Key differences from GPUs:

Google's TPU v4 pods (2022) can be configured with thousands of chips connected via high-speed interconnects, forming a supercomputer-scale system. Google has used these internally to train models like PaLM (540B parameters) and Gemini. The latest generation, TPU v5p (2023) and Trillium/TPU v6e (2024), continue to push throughput and interconnect bandwidth.

TPUs proved an important point: if you know exactly what computation you need, you can build hardware that does it far more efficiently than general-purpose hardware. The tradeoff is flexibility — TPUs are not good at anything other than tensor operations.

The Memory Wall

By roughly 2018, a counterintuitive problem emerged in deep learning hardware: compute was no longer the primary bottleneck. Memory was.

There are two dimensions to the memory problem:

Capacity: Can the model fit?

A model's parameters must be held in GPU memory (VRAM) during training. Each parameter in FP32 takes 4 bytes. A 7-billion-parameter model in FP32 requires 28 GB just for the weights — before accounting for gradients (another 28 GB), optimizer states (another 28-56 GB for Adam, which stores two additional values per parameter), and activations (variable, but often tens of GB).

This means training a 7B model in FP32 with Adam can require 100+ GB of memory. An H100 has 80 GB. You either distribute across multiple GPUs, use mixed precision to halve the memory, use gradient checkpointing to trade compute for memory, or use techniques like ZeRO (from DeepSpeed) that shard optimizer states across GPUs. For models at the 70B-400B scale, you're talking about clusters of hundreds or thousands of GPUs.

Bandwidth: Can the data arrive fast enough?

Even when data fits in GPU memory, the cores can only compute if data reaches them fast enough. GPU cores are often idle — waiting for data to arrive from memory. This ratio of compute capability to memory bandwidth is called the arithmetic intensity, and when a workload's ratio doesn't match the hardware's, either compute or memory bandwidth is wasted.

Consider: an H100 can perform ~67 TFLOPS in FP32. To keep those ALUs busy, it needs to deliver operands at a rate that matches. If each FLOP requires reading 4 bytes from memory, you need 67 teraFLOPS * 4 bytes = 268 TB/s of memory bandwidth. The H100's HBM3 provides about 3.35 TB/s. That's roughly 80x less than what would be needed to keep the compute units saturated for a purely memory-bound operation. The hardware is designed around the assumption that workloads will have high arithmetic intensity — that each byte fetched from memory will be reused for many operations. Matrix multiplication, fortunately, has this property (an element is reused O(n) times in an n-dimensional matmul). But operations like element-wise additions, normalization, and attention score computation have low arithmetic intensity and often bottleneck on memory bandwidth.

Key idea: The "speed" of AI hardware is not a single number. Compute throughput (FLOPS) and memory bandwidth (TB/s) are both constraints, and different operations hit different ceilings. The memory wall — where you can compute faster than you can feed data to the compute units — is the defining hardware challenge of modern deep learning.

HBM: Stacking Memory to Feed the Beast

HBM — High Bandwidth Memory — is the memory technology that makes modern AI accelerators viable. It works by stacking multiple layers of DRAM (dynamic random-access memory) vertically and connecting them to the processor through a silicon interposer — a thin layer of silicon that provides thousands of tiny connections between the memory stacks and the GPU die.

Traditional GDDR memory (the kind used in gaming GPUs) connects to the GPU through a relatively narrow bus on a printed circuit board. HBM connects through thousands of microscopic wires running through silicon, delivering dramatically higher bandwidth in a smaller physical footprint.

Memory Type Bandwidth (approx.) Used In
GDDR6X ~1 TB/s RTX 4090 (consumer GPU)
HBM2e ~2 TB/s A100 (datacenter GPU)
HBM3 ~3.35 TB/s H100 (datacenter GPU)
HBM3e ~8 TB/s B200 (datacenter GPU)

HBM is manufactured primarily by SK Hynix, Samsung, and Micron. It is expensive — both to manufacture (the stacking and interposer process has lower yields than standard DRAM) and to package (the GPU die and HBM stacks must be co-packaged on the interposer). This is a significant part of why datacenter GPUs cost $25,000-$40,000 while consumer GPUs with similar core counts cost $1,500-$2,000. The HBM alone can account for a third of the total chip cost.

Apple Silicon: A Different Architecture

Apple's M-series chips — including the M4 in your Mac Mini — take a fundamentally different approach to the memory problem: unified memory architecture (UMA).

In a traditional PC, the CPU has its own RAM (system memory, typically DDR5) and the GPU has its own separate RAM (VRAM, typically GDDR6 or HBM). When you want to do GPU computation, data must first be copied from system memory to GPU memory over a PCIe bus — a relatively narrow connection. This copy takes time and the two memory pools are managed independently.

Apple's M-series chips put the CPU, GPU, and Neural Engine on a single piece of silicon (a System on a Chip, or SoC), and they all share a single pool of high-bandwidth LPDDR5 memory. There's no copy step. The GPU can read directly from the same memory the CPU is using. This eliminates the PCIe bottleneck entirely and means the full 24 GB (in your configuration) is available to both CPU and GPU workloads.

The M4's memory bandwidth is approximately 120 GB/s. That's an order of magnitude less than an H100's 3,350 GB/s. The M4's GPU has 10 cores, compared to the H100's 16,896 CUDA cores. In raw compute, the M4 delivers roughly 4-5 TFLOPS in FP32 on its GPU — about 15x less than an H100.

So where does Apple Silicon fit in the landscape?

Apple's Neural Engine — a fixed-function accelerator included on M-series chips — adds another ~38 TOPS (trillion operations per second) in INT8. It's designed for inference on specific operation patterns (convolutions, matrix multiplies) and is used by Core ML for on-device inference in apps. It's fast for its power budget but not programmable in the way a GPU is.

Why Nvidia Dominates

As of early 2026, Nvidia holds somewhere between 80-95% market share for AI training hardware (the exact number varies by how you count, but the dominance is not in dispute). This might seem strange for a company that makes GPUs — a product category where AMD has been a credible competitor for over two decades. The explanation is almost entirely about software.

The moat has several layers:

  1. CUDA (2007-). Seventeen years of developer tooling, documentation, and community investment. Researchers who learned GPU programming learned CUDA. Codebases are written in CUDA. University courses teach CUDA. Switching costs are enormous.
  2. cuDNN (2014-). The deep learning primitives library that every major framework calls under the hood. cuDNN kernels are hand-optimized by Nvidia engineers for each GPU generation. When PyTorch calls torch.nn.Conv2d, it ultimately calls a cuDNN kernel. These optimizations are substantial — a naive GPU implementation of convolution might be 10-50x slower than cuDNN's implementation.
  3. Framework integration. PyTorch, TensorFlow, and JAX all have first-class Nvidia support. AMD's ROCm and Intel's oneAPI exist, but framework support is consistently behind, less tested, and more likely to have bugs or missing features.
  4. Networking. Nvidia acquired Mellanox in 2020 for $7 billion, giving them control of InfiniBand — the high-speed networking technology used to connect GPUs in datacenter clusters. Training large models requires thousands of GPUs communicating constantly. Nvidia now sells the complete stack: GPU + memory + network.
  5. NVLink and NVSwitch. Proprietary high-speed interconnects between GPUs within a node. NVLink 5 (in the Blackwell generation) provides 1.8 TB/s of bidirectional bandwidth between GPUs — far more than PCIe. Multi-GPU training performance depends heavily on inter-GPU bandwidth.

AMD's MI300X (2023) is a competitive chip on paper — 192 GB of HBM3, strong compute numbers. Intel's Gaudi accelerators exist. But hardware specs are only part of the story. The question is always: can I take my existing PyTorch code, change one line, and have it work? For Nvidia, the answer is usually yes. For everyone else, the answer is usually "mostly, but you'll spend days debugging."

This is a classic platform lock-in dynamic, similar to x86 in PCs or iOS in mobile. The best hardware doesn't always win. The best ecosystem does.

The Full Stack

To summarize the hardware landscape as it stands:

The AI Hardware Stack Application Your code, model definition, training script Framework PyTorch, TensorFlow, JAX — autograd, optimization Libraries cuDNN, cuBLAS, NCCL (multi-GPU comms), TensorRT Runtime CUDA — thread scheduling, memory management, kernel launch Interconnect NVLink, InfiniBand, PCIe — data movement between chips Silicon GPU cores + HBM (or TPU systolic arrays, or Apple Neural Engine) Nvidia controls this Nvidia's moat isn't any single layer — it's vertical integration across the entire stack.

The critical insight is that performance in AI computing is not about any one component. It's about the interaction between compute (FLOPS), memory (capacity and bandwidth), interconnect (how fast chips talk to each other), software (how efficiently the hardware is utilized), and the ecosystem (whether your code just works). Nvidia's dominance comes from controlling or influencing nearly every layer simultaneously.

What This Means for Training at Scale

To ground all of this in a concrete example: Meta reported that training Llama 2 70B (2023) required roughly 1.7 million GPU-hours on A100 hardware. At cloud pricing of approximately $2-3/GPU-hour, that's on the order of $3-5 million in compute cost for a single training run — and large runs are often repeated multiple times due to instabilities, bugs, or hyperparameter tuning.4

Training GPT-4 (2023) is estimated to have cost $50-100 million in compute, though OpenAI has not disclosed exact figures. Models currently in training at frontier labs are likely costing several hundred million dollars per run.

This cost structure has implications:


Summary

The deep learning revolution was a hardware revolution. The algorithms existed for decades. What changed was:

  1. GPUs — originally built for gaming — provided the parallel arithmetic throughput that neural network training requires.
  2. CUDA (2007) made GPUs programmable for general computation.
  3. AlexNet (2012) proved the concept — deep learning trained on GPUs could crush traditional methods.
  4. HBM addressed the memory bandwidth wall as models grew.
  5. TPUs showed that custom silicon could be even more efficient for specific workloads.
  6. Nvidia's ecosystem — CUDA, cuDNN, NVLink, InfiniBand — created a vertically integrated platform that competitors have struggled to match.

Your Mac Mini M4 sits in an interesting position in this landscape. Its unified memory architecture makes it genuinely capable for inference and development — you can run 14B parameter models locally in a way that would require a discrete GPU on a PC. For training, it's not competitive with datacenter hardware, but that's not its purpose. The hardware you have is well-suited for exactly the work you're doing: running local models, prototyping, and iterating fast.

The next chapter moves from the hardware that runs neural networks to the architectures that define them. Hardware determines what's computationally feasible. Architecture determines what the computation actually does.

Next: Chapter 8 — Architectures. CNNs for spatial patterns, RNNs and LSTMs for sequences and memory. Why each architecture was built, what problem it solved, and why they all eventually ran into the same wall that the transformer would break through.

1 Early GPGPU (General-Purpose computing on Graphics Processing Units) work includes Owens et al., "A Survey of General-Purpose Computation on Graphics Hardware" (2007), which documents the pre-CUDA era of encoding scientific computation as texture operations.

2 Krizhevsky, Sutskever, and Hinton, "ImageNet Classification with Deep Convolutional Neural Networks" (NeurIPS 2012). The paper is sometimes cited as the beginning of the modern deep learning era. Sutskever went on to co-found OpenAI; Hinton shared the 2018 Turing Award with Yann LeCun and Yoshua Bengio.

3 Micikevicius et al., "Mixed Precision Training" (ICLR 2018), demonstrated that neural networks can be trained in FP16 with minimal accuracy loss, provided a few operations (loss scaling, master weights) remain in FP32. This effectively doubled training throughput on supported hardware.

4 Meta AI, "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023). The paper reports 1,720,320 GPU-hours on A100-80GB hardware for the 70B model. Cost estimates are based on publicly available cloud GPU pricing and vary by provider.