Part IV — The Ecosystem

Hyperscalers and Enterprise AI

AWS, Azure, GCP, and how AI moves from research demos to production systems that serve millions of users.

What "Hyperscaler" Means

The term hyperscaler refers to companies that operate computing infrastructure at a scale where traditional approaches to data center design, networking, and operations break down and must be reinvented. There are effectively three: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). Alibaba Cloud and Oracle Cloud occupy significant positions in specific markets (China and enterprise databases, respectively), but the first three dominate the global cloud infrastructure market with a combined ~65% share.¹

The scale is difficult to grasp. AWS operates millions of servers across 33 geographic regions. Azure serves 95% of Fortune 500 companies. Google runs data centers that collectively use as much electricity as a medium-sized country. When you call an API from OpenAI, Anthropic, or any other AI provider, your request almost certainly hits one of these three clouds.

Understanding the hyperscalers matters because they are the primary distribution channel for AI. The labs build the models; the hyperscalers deliver them to users. The relationship between the two determines pricing, availability, performance, and ultimately who has access to AI capabilities.

The Big Three

Amazon Web Services (AWS)

The largest cloud provider by revenue (~32% market share). AWS was first to market in 2006 with EC2 (Elastic Compute Cloud), and its first-mover advantage persists. For AI specifically:

SageMaker: Managed service for training and deploying ML models. Handles infrastructure provisioning, distributed training, model hosting. The most widely used enterprise ML platform.
Bedrock: Managed service for accessing foundation models (Claude, LLaMA, Mistral, Amazon's own Titan models) through a unified API. This is where Anthropic's Claude is available for enterprise customers.
Inferentia / Trainium: Amazon's custom AI chips, designed to compete with Nvidia on price-performance for specific workloads. Inferentia targets inference; Trainium targets training. Adoption is growing but small relative to Nvidia GPUs.
GPU instances: P5 instances with H100 GPUs, P4d with A100s. Elastic capacity for training runs that need hundreds of GPUs temporarily.

AWS's strategy is to be the platform where AI gets deployed, regardless of which model or framework you use. They don't have a frontier foundation model of their own (Titan is not competitive with GPT-4 or Claude), so they compete on infrastructure breadth and enterprise relationships.

Microsoft Azure

Second largest cloud provider (~23% market share), but arguably the most strategically positioned for AI because of its partnership with OpenAI. Key elements:

Azure OpenAI Service: Exclusive cloud distribution for OpenAI models (GPT-4, DALL-E, Whisper). Enterprise customers access OpenAI's models through Azure, with Azure's compliance, security, and data residency guarantees. This is a significant competitive advantage — if you want GPT-4 with enterprise-grade SLAs, you go through Azure.
Copilot: Microsoft's brand for AI-assisted features across its product line — GitHub Copilot (code generation), Microsoft 365 Copilot (Word, Excel, Outlook integration), Windows Copilot. Each uses different underlying models but the brand unification signals AI as a cross-product strategy.
ND-series VMs: GPU-equipped virtual machines with H100s and A100s, plus InfiniBand networking for distributed training.
The $13B OpenAI investment: Microsoft invested ~$13 billion in OpenAI and has a 49% profit share (up to the cap). In return, Azure is OpenAI's exclusive cloud provider, and Microsoft gets priority access to OpenAI's models for its products. This created the deepest integration between a cloud provider and a frontier model lab.

Google Cloud Platform (GCP)

Third largest (~11% market share) but uniquely positioned because Google both operates a cloud and builds frontier models (Gemini) and custom hardware (TPUs).

Vertex AI: Google's managed ML platform. Provides access to Gemini models, open models, and tools for fine-tuning, evaluation, and deployment.
TPU Cloud: On-demand access to Google's custom TPU accelerators. Competitive pricing for training workloads, especially for JAX-based code.
Gemini API: Direct API access to Google's foundation models, competing with OpenAI's API and Anthropic's API.
Vertical integration: Google designs chips (TPU), operates data centers, builds models (Gemini), and serves them through Cloud. No other company has this depth of vertical integration in AI.

How Models Get Deployed

There's a significant gap between "we trained a model" and "a user can call it at scale." Model serving — the infrastructure that turns a trained model into a reliable, fast, cost-efficient API endpoint — is its own engineering discipline.

The Inference Challenge

When a user sends a prompt to Claude or ChatGPT, the response doesn't appear instantly. The model generates tokens one at a time, where each token requires a full forward pass through the model. For a 100-token response from a model with hundreds of billions of parameters, that's hundreds of sequential operations, each involving massive matrix multiplications. The engineering challenge is doing this fast enough that the user perceives a fluid streaming response while serving millions of simultaneous users.

Key techniques:

Continuous batching: Instead of processing one request at a time, the serving system groups multiple requests together and processes them simultaneously. This keeps the GPU utilized — a single request uses only a fraction of available compute. The system dynamically adds new requests to the batch as slots open.
KV-cache management: During autoregressive generation, the model computes key-value pairs for all previous tokens. These are cached in GPU memory to avoid redundant computation. Managing this cache efficiently — deciding what to keep, what to evict, how to page between GPU memory and CPU memory — is a major optimization area. vLLM, developed at UC Berkeley, introduced PagedAttention to handle this efficiently and became the de facto open-source inference engine.²
Quantization: Reducing the numerical precision of model weights from FP16 (16-bit floating point) to INT8 or INT4 (8-bit or 4-bit integers). This shrinks the model's memory footprint and increases inference speed, with a modest reduction in quality. Techniques like GPTQ and AWQ make this practical for production deployment.
Speculative decoding: Using a small, fast "draft" model to predict several tokens ahead, then verifying them with the large model in a single forward pass. When the draft model guesses correctly (which it does most of the time for common text patterns), this effectively generates multiple tokens per large-model forward pass.

Serving Frameworks

Framework	Origin	What It Does
vLLM	UC Berkeley	PagedAttention, continuous batching. The most widely adopted open-source LLM serving engine.
TGI (Text Generation Inference)	Hugging Face	Production-grade server for text generation. Good integration with Hugging Face model hub.
TensorRT-LLM	Nvidia	Nvidia's optimized inference engine. Highest performance on Nvidia hardware but Nvidia-specific.
llama.cpp	Community (Georgi Gerganov)	CPU and Apple Silicon inference in C/C++. Quantized models on consumer hardware. What Ollama builds on.

Enterprise Adoption Patterns

How companies actually use AI is often different from the demo-stage projects that get media coverage. Enterprise adoption follows a fairly predictable progression:

Phase 1: Experimentation

Someone in the company starts using ChatGPT or Claude for their work. This happens informally, often without IT approval. The company then faces a choice: ban it (and watch people use it anyway on personal devices) or provide a sanctioned option. Most companies choose the latter, often through an enterprise agreement with OpenAI, Anthropic, or through their existing cloud provider (Azure OpenAI Service, AWS Bedrock).

Phase 2: Internal Tools

The first production applications are typically internal: document summarization, code assistance, internal search over company knowledge bases (this is where RAG enters the picture), draft generation for reports or emails. These are low-risk because mistakes affect internal users, not customers.

Phase 3: Customer-Facing Applications

External deployment requires more guardrails: content filtering, monitoring, fallback mechanisms, legal review. Customer support chatbots, product recommendations, and content generation are common first customer-facing applications. The key concern shifts from "does it work?" to "what happens when it fails?"

Phase 4: Core Workflow Integration

AI becomes embedded in critical business processes — not as a standalone tool but as a component of existing systems. Examples: automated underwriting in insurance, medical record summarization in healthcare, contract analysis in legal. This phase requires the deepest technical integration and the most careful evaluation.

Key idea: Most enterprise AI is not about training custom models. It's about deploying existing models with the right data, guardrails, and integration. The term for this is AI engineering — building reliable systems with pre-trained models as components. This is distinct from ML engineering (training models) and ML research (developing new architectures). AI engineering is where most of the jobs and most of the practical value creation are in 2025.

The RAG Pipeline in Enterprise

Retrieval-Augmented Generation (RAG) deserves special mention here because it's the dominant pattern for enterprise AI deployment. The next chapter covers RAG in technical depth, but the enterprise context matters for understanding why it's so prevalent.

The fundamental problem: a foundation model knows what was in its training data, but enterprises need it to know their proprietary information — internal documents, product catalogs, customer records, compliance policies. You have three options:

Fine-tuning: Train the model further on your data. Expensive, requires ML expertise, risks catastrophic forgetting (the model loses general capabilities), and needs to be repeated when data changes.
RAG: At query time, search your data for relevant documents, inject them into the model's context window alongside the user's question, and let the model synthesize a response. No retraining required. Data can be updated in real-time. Much cheaper.
Long context: Just put all the relevant data in the prompt. Works for small datasets but doesn't scale — context windows are large but finite, and cost scales linearly with context length.

RAG dominates because it's the pragmatic choice: it works with any foundation model, requires no model training expertise, and handles dynamic data. The engineering challenges are in the retrieval step — finding the right documents from potentially millions of candidates — which is exactly the problem the information retrieval community (Chapter 16) has been studying for decades. Your work on optimizing retrieval weights sits squarely in this space.

On-Premise vs. Cloud

Not all AI deployment happens in the cloud. Some organizations run models on their own hardware, in their own data centers. The decision depends on several factors:

Factor	Cloud	On-Premise
Data sensitivity	Data leaves your network (enterprise agreements mitigate this)	Data never leaves your network
Cost at scale	Usage-based; can be expensive at high volume	High upfront, lower marginal cost at volume
Flexibility	Scale up/down instantly	Fixed capacity; over-provisioning or under-provisioning
Latency	Network round-trip to cloud region	Local network latency only
Regulatory	Data residency requirements may constrain region choices	Full control over data location
Expertise required	Managed services reduce operational burden	Need GPU infrastructure expertise (rare, expensive)

Industries with strict data regulations — finance, healthcare, government, defense — often prefer on-premise or private cloud deployment. The open-weight model movement (Chapter 17) enables this: you can download LLaMA or Mistral, deploy it on your own hardware, and no data ever leaves your network. This is one of the strongest practical arguments for open weights.

The GPU Cloud Startups

Beyond the Big Three, a new category of cloud provider has emerged: companies that specialize in GPU compute for AI workloads.

CoreWeave: Started as a cryptocurrency mining operation, pivoted to AI compute when the economics shifted. Rapidly built H100 clusters and attracted customers who couldn't get GPU allocation from the hyperscalers fast enough. Valued at $19 billion as of early 2025.
Lambda Labs: GPU cloud focused on ML training. Competitive pricing on H100 instances.
Together AI: Cloud platform optimized for open-source model inference. Provides fast, cheap API access to LLaMA, Mistral, and other open models.
Replicate: Simplified deployment for ML models. "Push a model, get an API." Lowered the barrier to deploying open models.

These companies exist because GPU demand has outstripped what the hyperscalers can supply. When AWS tells a startup they'll have H100 access in 6 months, CoreWeave can often deliver in weeks. The question is whether these GPU-focused clouds survive as the hyperscalers expand capacity, or whether they're a temporary phenomenon born of supply constraints.

The Economics

AI compute is expensive, and the cost structure is worth understanding.

Training Costs

Training a frontier model costs $50-200 million in compute alone. This is why only a handful of organizations can train frontier models — you need either the capital (OpenAI, Anthropic via investors) or the existing infrastructure (Google, Meta). Training costs scale roughly with model size and dataset size, following the scaling laws discussed in earlier chapters.

Inference Costs

Inference is cheaper per query but more expensive in aggregate because inference runs continuously for millions of users. As of early 2025, representative API pricing:

GPT-4o: $2.50 per million input tokens, $10 per million output tokens
Claude 3.5 Sonnet: $3 per million input tokens, $15 per million output tokens
LLaMA 3 70B (via Together AI): ~$0.90 per million input tokens, ~$0.90 per million output tokens

For context: a million tokens is roughly 750,000 words, or about 10 novels. A single ChatGPT conversation might use 2,000-10,000 tokens. The math at scale: a company processing 10 million API calls per day at 1,000 tokens each would spend roughly $30,000-150,000 per day on inference alone, depending on the model.

This is why inference optimization (quantization, caching, batching, speculative decoding) is such an active area. Every percentage improvement in inference efficiency translates directly to cost savings at scale. It's also why subscription models like ChatGPT Plus ($20/month for effectively unlimited use) are loss leaders — OpenAI is subsidizing heavy users to build market share and collect feedback data.

The hyperscalers provide the infrastructure; the enterprise adoption patterns show how AI gets integrated into real organizations. But what are people actually building with these models? The final chapter in this part of the guide covers the applied landscape: RAG, tool use, agentic systems, and the emerging patterns that define how AI gets used in practice. That's where your own work — building agents, optimizing retrieval, deploying autonomous systems — fits into the bigger picture.

Previous: Chapter 18 — The Hardware Stack Next: Chapter 20 — The Applied Landscape

¹ Cloud market share data from Synergy Research Group (Q4 2024). AWS ~32%, Azure ~23%, GCP ~11%. The remaining ~34% is split among Alibaba Cloud, Oracle, IBM, and many smaller providers.

² Kwon et al. (2023), "Efficient Memory Management for Large Language Model Serving with PagedAttention." Proceedings of the 29th Symposium on Operating Systems Principles (SOSP). vLLM's PagedAttention borrows the concept of virtual memory paging from operating systems and applies it to KV-cache management, achieving near-zero waste of GPU memory for cached attention states.