Every chapter in this guide so far has been about software: architectures, algorithms, training methods. But all of it runs on physical hardware, and the constraints of that hardware shape what's possible. The transformer architecture works because modern GPUs can execute the massive parallel matrix multiplications it requires. The scaling laws that drive frontier model development are, at their core, statements about what happens when you apply more compute. And "more compute" means more chips, faster chips, better memory — all of which depend on a supply chain that passes through a remarkably small number of companies.
Understanding the hardware stack isn't optional background. It's the foundation that determines who can train frontier models, how much it costs, and where the bottlenecks are.
AI hardware is a vertically integrated supply chain. Each layer depends on the one below it, and a disruption at any layer cascades upward.
Nvidia's dominance of AI compute is difficult to overstate. As of early 2025, Nvidia GPUs power an estimated 80-95% of all AI training and the majority of inference workloads. The company's market capitalization briefly exceeded $3 trillion in 2024, making it one of the most valuable companies on Earth. To understand why, you need to understand both the hardware and the software.
A GPU (Graphics Processing Unit) was originally designed for rendering graphics — computing the color of millions of pixels simultaneously. This requires thousands of small processing cores executing the same operation on different data. That architecture — SIMD (Single Instruction, Multiple Data) or more precisely SIMT (Single Instruction, Multiple Threads) in Nvidia's case — turns out to be exactly what you need for training neural networks.
Neural network training is dominated by matrix multiplications. A forward pass through a transformer layer is essentially a series of large matrix multiplies. Backpropagation to compute gradients is more matrix multiplies. These operations are embarrassingly parallel — each element of the output matrix can be computed independently. A CPU with 16-64 cores struggles with this. A GPU with thousands of cores thrives.
Modern Nvidia data center GPUs (the A100, H100, H200, and B200) have specialized hardware beyond general-purpose cores:
| GPU | Year | Tensor TFLOPS (FP16) | HBM | Significance |
|---|---|---|---|---|
| V100 | 2017 | 125 | 32GB HBM2 | First GPU with Tensor Cores. Trained most GPT-2/BERT-era models. |
| A100 | 2020 | 312 | 80GB HBM2e | Workhorse of the GPT-3/LLaMA era. BF16 support. Multi-Instance GPU (MIG). |
| H100 | 2023 | 990 | 80GB HBM3 | Transformer Engine with FP8 support. 3x A100 for LLM training. The most sought-after chip in AI history. |
| H200 | 2024 | 990 | 141GB HBM3E | Same compute as H100 but nearly double the memory and bandwidth. Inference-optimized. |
| B200 | 2024-25 | ~2,250 | 192GB HBM3E | Blackwell architecture. Second-gen Transformer Engine with FP4 support. |
Each generation roughly doubles effective performance for AI workloads while increasing memory capacity. This is why the scaling laws work in practice: chip performance has kept pace with researchers' appetite for compute.
Nvidia's hardware advantage is significant, but its real moat is software. CUDA (Compute Unified Device Architecture), released in 2006, is Nvidia's parallel computing platform. It provides:
When a researcher writes model.cuda() in PyTorch, they're invoking an ecosystem that has been built and optimized over 18 years. Every layer of the AI software stack — from the training loop to the optimizer to the attention kernel — has been hand-optimized for CUDA. This creates enormous switching costs. Even if a competitor builds better hardware, the software ecosystem doesn't port automatically. Developers have to rewrite, retest, and re-optimize their entire stack.
AMD's Instinct accelerators (MI250X, MI300X, MI325X) are competitive with Nvidia on raw hardware specifications. The MI300X, released in late 2023, offers 192GB of HBM3 memory — more than the H100 — and competitive compute throughput. On paper, it should be a viable alternative.
The gap is software. AMD's ROCm (Radeon Open Compute) platform is the CUDA equivalent, but it's years behind in maturity:
AMD is making progress. Meta has publicly invested in MI300X clusters for LLaMA training, providing validation. Microsoft has used AMD GPUs in Azure. But "works" and "works as well as CUDA with the same developer effort" are different things. For now, AMD is the viable second choice rather than an equal competitor.
Intel dominated computing for decades through its x86 CPU architecture, but it was late to AI accelerators. Its Gaudi series (developed by Habana Labs, acquired in 2019 for $2 billion) provides competitive performance-per-dollar for certain workloads, and its "Ponte Vecchio" GPU (Intel Data Center GPU Max) targets HPC and AI. But Intel's market share in AI training is minimal.
Intel's real challenge is that it's fighting on two fronts: against Nvidia for AI accelerators and against TSMC for chip fabrication (Intel is one of the few companies that both designs and fabricates chips, but its fabrication technology has fallen behind TSMC's). The Intel Foundry Services initiative is attempting to become a competitive foundry for external customers, but this is a multi-year, multi-billion-dollar bet with uncertain outcomes.
Taiwan Semiconductor Manufacturing Company (TSMC) is, by some analyses, the most strategically important company in the world. It fabricates chips for Nvidia, AMD, Apple, Google, Qualcomm, and hundreds of other companies. At the leading edge of chip fabrication (sub-5nm process nodes), TSMC has approximately 90% market share.1
Some context on what "fabrication" means: chip companies like Nvidia design chips but don't manufacture them. They send their designs to TSMC, which operates fabrication plants (fabs) that cost $10-20 billion each to build. These fabs contain ASML extreme ultraviolet (EUV) lithography machines — the most complex machines ever built, costing ~$350 million each — that pattern transistors at scales of a few nanometers onto silicon wafers.2
Why this matters for AI:
For large language models, memory is often the binding constraint, not compute. During inference (generating text from a trained model), the model's weights must be loaded from memory for every token generated. A 70-billion parameter model in FP16 requires ~140GB of memory just for the weights. If your GPU has 80GB of HBM, the model doesn't fit on one chip.
HBM (High Bandwidth Memory) is a specialized type of memory used in AI accelerators. Unlike standard DRAM, HBM stacks multiple memory dies vertically and connects them with through-silicon vias (TSVs), providing dramatically higher bandwidth in a smaller footprint. The progression:
The HBM market is dominated by three companies: SK Hynix (South Korea, ~50% share), Samsung (South Korea), and Micron (United States). SK Hynix has the technology lead — it was the first to mass-produce HBM3E and has a close relationship with Nvidia. HBM supply was a significant bottleneck in 2023-2024, with lead times stretching to 6-12 months.
Google's TPU (Tensor Processing Unit) is a custom ASIC (Application-Specific Integrated Circuit) designed specifically for neural network workloads. Unlike GPUs, which are general-purpose parallel processors adapted for AI, TPUs are built from the ground up for matrix multiplication and nothing else.
The TPU progression:
TPUs run on the JAX framework (and its predecessor, TensorFlow), not CUDA. This means code written for TPUs doesn't run on Nvidia GPUs and vice versa without porting. Google uses TPUs internally for training Gemini and other models, and offers them through Google Cloud.
The strategic logic: by building its own chips, Google reduces dependence on Nvidia and can optimize hardware for its specific workloads. The tradeoff is a smaller developer ecosystem. Most external researchers use Nvidia GPUs, so TPU-specific optimizations stay within Google's ecosystem.
Apple's M-series chips (M1, M2, M3, M4 and their Pro/Max/Ultra variants) take a different approach: unified memory architecture. Instead of separate CPU memory and GPU memory with a slow bus between them, Apple Silicon puts everything in a single pool of memory accessible by both CPU and GPU cores at full bandwidth.
For AI inference, this has a notable advantage: a MacBook Pro with an M4 Max has 128GB of unified memory accessible to the GPU. This is enough to run a 70B parameter model at reasonable speed — something that would require multiple Nvidia GPUs in a traditional setup. The inference isn't as fast as a dedicated data center GPU, but it works on a laptop with no cloud costs.
For training, Apple Silicon is not competitive. The GPU cores lack the specialized Tensor Core equivalent at the scale needed for training frontier models, and the total compute throughput is far below a data center GPU. Apple Silicon is an inference platform, not a training platform.
This matters for practitioners like you: running local inference with tools like llama.cpp or Ollama on an M-series Mac is a viable way to experiment with open-weight models without cloud costs. Your Mac Mini with Ollama running qwen2.5:14b is a practical example of this.
Putting the supply chain together:
| Layer | Key Players | What Could Go Wrong |
|---|---|---|
| Chip Design | Nvidia, AMD, Google, Intel | Design mistakes (rare, high stakes) |
| Fabrication | TSMC (~90%), Samsung | Geopolitical disruption, capacity limits, yield problems |
| EUV Equipment | ASML (100% monopoly) | Export controls, production bottlenecks |
| Memory | SK Hynix, Samsung, Micron | HBM supply constraints, technology transitions |
| Systems | Nvidia (DGX), Dell, Supermicro | Power/cooling capacity, networking bottlenecks |
| Cloud | AWS, Azure, GCP, CoreWeave | GPU availability, pricing, regional capacity |
One entity deserves special mention: ASML (Netherlands). ASML is the sole manufacturer of EUV lithography machines, which are required for fabricating chips at 7nm and below. Without ASML, TSMC cannot manufacture advanced chips. Without TSMC, Nvidia cannot produce GPUs. ASML is the deepest single point of failure in the entire AI supply chain. The company has no competitor at the EUV level — the technology required decades and tens of billions of dollars to develop, and the barriers to entry are effectively insurmountable in the near term.3
Training frontier models requires staggering amounts of electricity. A cluster of 10,000 H100 GPUs draws approximately 7-10 megawatts of power — enough for a small town. Cooling these clusters requires additional energy. Data center operators are now the largest buyers of new power capacity in many regions, and some are investing in nuclear power (Microsoft's deal with Constellation Energy for Three Mile Island restart, Amazon's investment in small modular reactors) to secure long-term supply.
This isn't a theoretical concern. New data center construction is being delayed in some regions because the local power grid cannot support the demand. The environmental impact is also significant: AI compute now accounts for a measurable fraction of global electricity consumption, and that fraction is growing rapidly.
The hardware stack tells you who controls the physical infrastructure of AI. But between the chips and the end user sit the hyperscalers — the companies that operate data centers at planetary scale and sell compute as a service. That's the subject of the next chapter: how AI gets deployed, served, and paid for at scale.