Chapter 18: The Hardware Stack

Every chapter in this guide so far has been about software: architectures, algorithms, training methods. But all of it runs on physical hardware, and the constraints of that hardware shape what's possible. The transformer architecture works because modern GPUs can execute the massive parallel matrix multiplications it requires. The scaling laws that drive frontier model development are, at their core, statements about what happens when you apply more compute. And "more compute" means more chips, faster chips, better memory — all of which depend on a supply chain that passes through a remarkably small number of companies.

Understanding the hardware stack isn't optional background. It's the foundation that determines who can train frontier models, how much it costs, and where the bottlenecks are.

The Stack: From Sand to Inference

AI hardware is a vertically integrated supply chain. Each layer depends on the one below it, and a disruption at any layer cascades upward.

Nvidia: The Center of Gravity

Nvidia's dominance of AI compute is difficult to overstate. As of early 2025, Nvidia GPUs power an estimated 80-95% of all AI training and the majority of inference workloads. The company's market capitalization briefly exceeded $3 trillion in 2024, making it one of the most valuable companies on Earth. To understand why, you need to understand both the hardware and the software.

GPU Architecture for AI

A GPU (Graphics Processing Unit) was originally designed for rendering graphics — computing the color of millions of pixels simultaneously. This requires thousands of small processing cores executing the same operation on different data. That architecture — SIMD (Single Instruction, Multiple Data) or more precisely SIMT (Single Instruction, Multiple Threads) in Nvidia's case — turns out to be exactly what you need for training neural networks.

Neural network training is dominated by matrix multiplications. A forward pass through a transformer layer is essentially a series of large matrix multiplies. Backpropagation to compute gradients is more matrix multiplies. These operations are embarrassingly parallel — each element of the output matrix can be computed independently. A CPU with 16-64 cores struggles with this. A GPU with thousands of cores thrives.

Modern Nvidia data center GPUs (the A100, H100, H200, and B200) have specialized hardware beyond general-purpose cores:

The GPU Progression

Each generation roughly doubles effective performance for AI workloads while increasing memory capacity. This is why the scaling laws work in practice: chip performance has kept pace with researchers' appetite for compute.

The CUDA Moat

GPU	Year	Tensor TFLOPS (FP16)	HBM	Significance
V100	2017	125	32GB HBM2	First GPU with Tensor Cores. Trained most GPT-2/BERT-era models.
A100	2020	312	80GB HBM2e	Workhorse of the GPT-3/LLaMA era. BF16 support. Multi-Instance GPU (MIG).
H100	2023	990	80GB HBM3	Transformer Engine with FP8 support. 3x A100 for LLM training. The most sought-after chip in AI history.
H200	2024	990	141GB HBM3E	Same compute as H100 but nearly double the memory and bandwidth. Inference-optimized.
B200	2024-25	~2,250	192GB HBM3E	Blackwell architecture. Second-gen Transformer Engine with FP4 support.

Nvidia's hardware advantage is significant, but its real moat is software. CUDA (Compute Unified Device Architecture), released in 2006, is Nvidia's parallel computing platform. It provides:

When a researcher writes model.cuda() in PyTorch, they're invoking an ecosystem that has been built and optimized over 18 years. Every layer of the AI software stack — from the training loop to the optimizer to the attention kernel — has been hand-optimized for CUDA. This creates enormous switching costs. Even if a competitor builds better hardware, the software ecosystem doesn't port automatically. Developers have to rewrite, retest, and re-optimize their entire stack.

Key idea: Nvidia's dominance is an example of a platform lock-in effect. The hardware is excellent, but the reason competitors can't easily displace Nvidia isn't hardware performance — it's that 18 years of software optimization, training materials, community knowledge, and developer habits are built on CUDA. This is the same dynamic that makes Windows or iOS hard to displace: the ecosystem, not the product, is the moat.

AMD: The Challenger

AMD's Instinct accelerators (MI250X, MI300X, MI325X) are competitive with Nvidia on raw hardware specifications. The MI300X, released in late 2023, offers 192GB of HBM3 memory — more than the H100 — and competitive compute throughput. On paper, it should be a viable alternative.

The gap is software. AMD's ROCm (Radeon Open Compute) platform is the CUDA equivalent, but it's years behind in maturity:

AMD is making progress. Meta has publicly invested in MI300X clusters for LLaMA training, providing validation. Microsoft has used AMD GPUs in Azure. But "works" and "works as well as CUDA with the same developer effort" are different things. For now, AMD is the viable second choice rather than an equal competitor.

Intel: Playing Catch-Up

Intel dominated computing for decades through its x86 CPU architecture, but it was late to AI accelerators. Its Gaudi series (developed by Habana Labs, acquired in 2019 for $2 billion) provides competitive performance-per-dollar for certain workloads, and its "Ponte Vecchio" GPU (Intel Data Center GPU Max) targets HPC and AI. But Intel's market share in AI training is minimal.

Intel's real challenge is that it's fighting on two fronts: against Nvidia for AI accelerators and against TSMC for chip fabrication (Intel is one of the few companies that both designs and fabricates chips, but its fabrication technology has fallen behind TSMC's). The Intel Foundry Services initiative is attempting to become a competitive foundry for external customers, but this is a multi-year, multi-billion-dollar bet with uncertain outcomes.

TSMC: The Most Critical Bottleneck

Taiwan Semiconductor Manufacturing Company (TSMC) is, by some analyses, the most strategically important company in the world. It fabricates chips for Nvidia, AMD, Apple, Google, Qualcomm, and hundreds of other companies. At the leading edge of chip fabrication (sub-5nm process nodes), TSMC has approximately 90% market share.¹

Some context on what "fabrication" means: chip companies like Nvidia design chips but don't manufacture them. They send their designs to TSMC, which operates fabrication plants (fabs) that cost $10-20 billion each to build. These fabs contain ASML extreme ultraviolet (EUV) lithography machines — the most complex machines ever built, costing ~$350 million each — that pattern transistors at scales of a few nanometers onto silicon wafers.²

Memory: The Other Bottleneck

For large language models, memory is often the binding constraint, not compute. During inference (generating text from a trained model), the model's weights must be loaded from memory for every token generated. A 70-billion parameter model in FP16 requires ~140GB of memory just for the weights. If your GPU has 80GB of HBM, the model doesn't fit on one chip.

HBM (High Bandwidth Memory) is a specialized type of memory used in AI accelerators. Unlike standard DRAM, HBM stacks multiple memory dies vertically and connects them with through-silicon vias (TSVs), providing dramatically higher bandwidth in a smaller footprint. The progression:

The HBM market is dominated by three companies: SK Hynix (South Korea, ~50% share), Samsung (South Korea), and Micron (United States). SK Hynix has the technology lead — it was the first to mass-produce HBM3E and has a close relationship with Nvidia. HBM supply was a significant bottleneck in 2023-2024, with lead times stretching to 6-12 months.

Key idea: The AI hardware supply chain has two critical bottlenecks: TSMC for chip fabrication and SK Hynix/Samsung for HBM memory. Both are in East Asia (Taiwan and South Korea). This geographic concentration creates supply chain risks that governments and companies are actively trying to mitigate through domestic fab construction, but the timeline for diversification is measured in years, not months.

Google TPUs: The Vertical Integration Play

Google's TPU (Tensor Processing Unit) is a custom ASIC (Application-Specific Integrated Circuit) designed specifically for neural network workloads. Unlike GPUs, which are general-purpose parallel processors adapted for AI, TPUs are built from the ground up for matrix multiplication and nothing else.

TPUs run on the JAX framework (and its predecessor, TensorFlow), not CUDA. This means code written for TPUs doesn't run on Nvidia GPUs and vice versa without porting. Google uses TPUs internally for training Gemini and other models, and offers them through Google Cloud.

The strategic logic: by building its own chips, Google reduces dependence on Nvidia and can optimize hardware for its specific workloads. The tradeoff is a smaller developer ecosystem. Most external researchers use Nvidia GPUs, so TPU-specific optimizations stay within Google's ecosystem.

Apple Silicon: The Inference Exception

Apple's M-series chips (M1, M2, M3, M4 and their Pro/Max/Ultra variants) take a different approach: unified memory architecture. Instead of separate CPU memory and GPU memory with a slow bus between them, Apple Silicon puts everything in a single pool of memory accessible by both CPU and GPU cores at full bandwidth.

For AI inference, this has a notable advantage: a MacBook Pro with an M4 Max has 128GB of unified memory accessible to the GPU. This is enough to run a 70B parameter model at reasonable speed — something that would require multiple Nvidia GPUs in a traditional setup. The inference isn't as fast as a dedicated data center GPU, but it works on a laptop with no cloud costs.

For training, Apple Silicon is not competitive. The GPU cores lack the specialized Tensor Core equivalent at the scale needed for training frontier models, and the total compute throughput is far below a data center GPU. Apple Silicon is an inference platform, not a training platform.

This matters for practitioners like you: running local inference with tools like llama.cpp or Ollama on an M-series Mac is a viable way to experiment with open-weight models without cloud costs. Your Mac Mini with Ollama running qwen2.5:14b is a practical example of this.

The Full Picture

Layer	Key Players	What Could Go Wrong
Chip Design	Nvidia, AMD, Google, Intel	Design mistakes (rare, high stakes)
Fabrication	TSMC (~90%), Samsung	Geopolitical disruption, capacity limits, yield problems
EUV Equipment	ASML (100% monopoly)	Export controls, production bottlenecks
Memory	SK Hynix, Samsung, Micron	HBM supply constraints, technology transitions
Systems	Nvidia (DGX), Dell, Supermicro	Power/cooling capacity, networking bottlenecks
Cloud	AWS, Azure, GCP, CoreWeave	GPU availability, pricing, regional capacity

One entity deserves special mention: ASML (Netherlands). ASML is the sole manufacturer of EUV lithography machines, which are required for fabricating chips at 7nm and below. Without ASML, TSMC cannot manufacture advanced chips. Without TSMC, Nvidia cannot produce GPUs. ASML is the deepest single point of failure in the entire AI supply chain. The company has no competitor at the EUV level — the technology required decades and tens of billions of dollars to develop, and the barriers to entry are effectively insurmountable in the near term.³

Power: The Next Bottleneck

Training frontier models requires staggering amounts of electricity. A cluster of 10,000 H100 GPUs draws approximately 7-10 megawatts of power — enough for a small town. Cooling these clusters requires additional energy. Data center operators are now the largest buyers of new power capacity in many regions, and some are investing in nuclear power (Microsoft's deal with Constellation Energy for Three Mile Island restart, Amazon's investment in small modular reactors) to secure long-term supply.

This isn't a theoretical concern. New data center construction is being delayed in some regions because the local power grid cannot support the demand. The environmental impact is also significant: AI compute now accounts for a measurable fraction of global electricity consumption, and that fraction is growing rapidly.

The hardware stack tells you who controls the physical infrastructure of AI. But between the chips and the end user sit the hyperscalers — the companies that operate data centers at planetary scale and sell compute as a service. That's the subject of the next chapter: how AI gets deployed, served, and paid for at scale.

Previous: Chapter 17 — The Labs Next: Chapter 19 — Hyperscalers and Enterprise AI

¹ TSMC's market share at leading-edge nodes (5nm and below) is approximately 90%, based on revenue data from TrendForce and IC Insights. Samsung Foundry accounts for most of the remainder. Intel Foundry Services is not yet competitive at these nodes for external customers.

² ASML's EUV machines weigh ~180 tons each, require multiple 747 cargo planes to ship, and take months to install. As of 2024, ASML has shipped approximately 200 EUV systems total, primarily to TSMC, Samsung, and Intel. The company's next-generation High-NA EUV systems cost approximately $380 million each.

³ ASML's EUV monopoly is the result of a 20+ year development effort involving over 100,000 components from 5,000 suppliers. The core technology — generating and focusing extreme ultraviolet light at 13.5nm wavelength with sufficient power and precision — required simultaneous advances in optics, materials science, plasma physics, and precision engineering. No other company has attempted to replicate this at scale.

The Hardware Stack

Why Hardware Matters