Part IV — The Ecosystem

Hyperscalers and Enterprise AI

AWS, Azure, GCP, and how AI moves from research demos to production systems that serve millions of users.

What "Hyperscaler" Means

The term hyperscaler refers to companies that operate computing infrastructure at a scale where traditional approaches to data center design, networking, and operations break down and must be reinvented. There are effectively three: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). Alibaba Cloud and Oracle Cloud occupy significant positions in specific markets (China and enterprise databases, respectively), but the first three dominate the global cloud infrastructure market with a combined ~65% share.1

The scale is difficult to grasp. AWS operates millions of servers across 33 geographic regions. Azure serves 95% of Fortune 500 companies. Google runs data centers that collectively use as much electricity as a medium-sized country. When you call an API from OpenAI, Anthropic, or any other AI provider, your request almost certainly hits one of these three clouds.

Understanding the hyperscalers matters because they are the primary distribution channel for AI. The labs build the models; the hyperscalers deliver them to users. The relationship between the two determines pricing, availability, performance, and ultimately who has access to AI capabilities.

The Big Three

Amazon Web Services (AWS)

The largest cloud provider by revenue (~32% market share). AWS was first to market in 2006 with EC2 (Elastic Compute Cloud), and its first-mover advantage persists. For AI specifically:

AWS's strategy is to be the platform where AI gets deployed, regardless of which model or framework you use. They don't have a frontier foundation model of their own (Titan is not competitive with GPT-4 or Claude), so they compete on infrastructure breadth and enterprise relationships.

Microsoft Azure

Second largest cloud provider (~23% market share), but arguably the most strategically positioned for AI because of its partnership with OpenAI. Key elements:

Google Cloud Platform (GCP)

Third largest (~11% market share) but uniquely positioned because Google both operates a cloud and builds frontier models (Gemini) and custom hardware (TPUs).

How AI Reaches End Users The delivery chain from model to application Foundation Models OpenAI (GPT-4), Anthropic (Claude), Google (Gemini), Meta (LLaMA), Mistral AWS Bedrock, SageMaker Azure Azure OpenAI, Copilot GCP Vertex AI, TPU Cloud Serving Infrastructure vLLM, TGI, TensorRT-LLM, Triton — batching, caching, load balancing, autoscaling Application Layer Direct API calls, RAG pipelines, agent frameworks, fine-tuned models, embedded AI features End Users — developers, enterprises, consumers

How Models Get Deployed

There's a significant gap between "we trained a model" and "a user can call it at scale." Model serving — the infrastructure that turns a trained model into a reliable, fast, cost-efficient API endpoint — is its own engineering discipline.

The Inference Challenge

When a user sends a prompt to Claude or ChatGPT, the response doesn't appear instantly. The model generates tokens one at a time, where each token requires a full forward pass through the model. For a 100-token response from a model with hundreds of billions of parameters, that's hundreds of sequential operations, each involving massive matrix multiplications. The engineering challenge is doing this fast enough that the user perceives a fluid streaming response while serving millions of simultaneous users.

Key techniques:

Serving Frameworks

Framework Origin What It Does
vLLM UC Berkeley PagedAttention, continuous batching. The most widely adopted open-source LLM serving engine.
TGI (Text Generation Inference) Hugging Face Production-grade server for text generation. Good integration with Hugging Face model hub.
TensorRT-LLM Nvidia Nvidia's optimized inference engine. Highest performance on Nvidia hardware but Nvidia-specific.
llama.cpp Community (Georgi Gerganov) CPU and Apple Silicon inference in C/C++. Quantized models on consumer hardware. What Ollama builds on.

Enterprise Adoption Patterns

How companies actually use AI is often different from the demo-stage projects that get media coverage. Enterprise adoption follows a fairly predictable progression:

Phase 1: Experimentation

Someone in the company starts using ChatGPT or Claude for their work. This happens informally, often without IT approval. The company then faces a choice: ban it (and watch people use it anyway on personal devices) or provide a sanctioned option. Most companies choose the latter, often through an enterprise agreement with OpenAI, Anthropic, or through their existing cloud provider (Azure OpenAI Service, AWS Bedrock).

Phase 2: Internal Tools

The first production applications are typically internal: document summarization, code assistance, internal search over company knowledge bases (this is where RAG enters the picture), draft generation for reports or emails. These are low-risk because mistakes affect internal users, not customers.

Phase 3: Customer-Facing Applications

External deployment requires more guardrails: content filtering, monitoring, fallback mechanisms, legal review. Customer support chatbots, product recommendations, and content generation are common first customer-facing applications. The key concern shifts from "does it work?" to "what happens when it fails?"

Phase 4: Core Workflow Integration

AI becomes embedded in critical business processes — not as a standalone tool but as a component of existing systems. Examples: automated underwriting in insurance, medical record summarization in healthcare, contract analysis in legal. This phase requires the deepest technical integration and the most careful evaluation.

Key idea: Most enterprise AI is not about training custom models. It's about deploying existing models with the right data, guardrails, and integration. The term for this is AI engineering — building reliable systems with pre-trained models as components. This is distinct from ML engineering (training models) and ML research (developing new architectures). AI engineering is where most of the jobs and most of the practical value creation are in 2025.

The RAG Pipeline in Enterprise

Retrieval-Augmented Generation (RAG) deserves special mention here because it's the dominant pattern for enterprise AI deployment. The next chapter covers RAG in technical depth, but the enterprise context matters for understanding why it's so prevalent.

The fundamental problem: a foundation model knows what was in its training data, but enterprises need it to know their proprietary information — internal documents, product catalogs, customer records, compliance policies. You have three options:

  1. Fine-tuning: Train the model further on your data. Expensive, requires ML expertise, risks catastrophic forgetting (the model loses general capabilities), and needs to be repeated when data changes.
  2. RAG: At query time, search your data for relevant documents, inject them into the model's context window alongside the user's question, and let the model synthesize a response. No retraining required. Data can be updated in real-time. Much cheaper.
  3. Long context: Just put all the relevant data in the prompt. Works for small datasets but doesn't scale — context windows are large but finite, and cost scales linearly with context length.

RAG dominates because it's the pragmatic choice: it works with any foundation model, requires no model training expertise, and handles dynamic data. The engineering challenges are in the retrieval step — finding the right documents from potentially millions of candidates — which is exactly the problem the information retrieval community (Chapter 16) has been studying for decades. Your work on optimizing retrieval weights sits squarely in this space.

On-Premise vs. Cloud

Not all AI deployment happens in the cloud. Some organizations run models on their own hardware, in their own data centers. The decision depends on several factors:

Factor Cloud On-Premise
Data sensitivity Data leaves your network (enterprise agreements mitigate this) Data never leaves your network
Cost at scale Usage-based; can be expensive at high volume High upfront, lower marginal cost at volume
Flexibility Scale up/down instantly Fixed capacity; over-provisioning or under-provisioning
Latency Network round-trip to cloud region Local network latency only
Regulatory Data residency requirements may constrain region choices Full control over data location
Expertise required Managed services reduce operational burden Need GPU infrastructure expertise (rare, expensive)

Industries with strict data regulations — finance, healthcare, government, defense — often prefer on-premise or private cloud deployment. The open-weight model movement (Chapter 17) enables this: you can download LLaMA or Mistral, deploy it on your own hardware, and no data ever leaves your network. This is one of the strongest practical arguments for open weights.

The GPU Cloud Startups

Beyond the Big Three, a new category of cloud provider has emerged: companies that specialize in GPU compute for AI workloads.

These companies exist because GPU demand has outstripped what the hyperscalers can supply. When AWS tells a startup they'll have H100 access in 6 months, CoreWeave can often deliver in weeks. The question is whether these GPU-focused clouds survive as the hyperscalers expand capacity, or whether they're a temporary phenomenon born of supply constraints.

The Economics

AI compute is expensive, and the cost structure is worth understanding.

Training Costs

Training a frontier model costs $50-200 million in compute alone. This is why only a handful of organizations can train frontier models — you need either the capital (OpenAI, Anthropic via investors) or the existing infrastructure (Google, Meta). Training costs scale roughly with model size and dataset size, following the scaling laws discussed in earlier chapters.

Inference Costs

Inference is cheaper per query but more expensive in aggregate because inference runs continuously for millions of users. As of early 2025, representative API pricing:

For context: a million tokens is roughly 750,000 words, or about 10 novels. A single ChatGPT conversation might use 2,000-10,000 tokens. The math at scale: a company processing 10 million API calls per day at 1,000 tokens each would spend roughly $30,000-150,000 per day on inference alone, depending on the model.

This is why inference optimization (quantization, caching, batching, speculative decoding) is such an active area. Every percentage improvement in inference efficiency translates directly to cost savings at scale. It's also why subscription models like ChatGPT Plus ($20/month for effectively unlimited use) are loss leaders — OpenAI is subsidizing heavy users to build market share and collect feedback data.


The hyperscalers provide the infrastructure; the enterprise adoption patterns show how AI gets integrated into real organizations. But what are people actually building with these models? The final chapter in this part of the guide covers the applied landscape: RAG, tool use, agentic systems, and the emerging patterns that define how AI gets used in practice. That's where your own work — building agents, optimizing retrieval, deploying autonomous systems — fits into the bigger picture.