The term hyperscaler refers to companies that operate computing infrastructure at a scale where traditional approaches to data center design, networking, and operations break down and must be reinvented. There are effectively three: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). Alibaba Cloud and Oracle Cloud occupy significant positions in specific markets (China and enterprise databases, respectively), but the first three dominate the global cloud infrastructure market with a combined ~65% share.1
The scale is difficult to grasp. AWS operates millions of servers across 33 geographic regions. Azure serves 95% of Fortune 500 companies. Google runs data centers that collectively use as much electricity as a medium-sized country. When you call an API from OpenAI, Anthropic, or any other AI provider, your request almost certainly hits one of these three clouds.
Understanding the hyperscalers matters because they are the primary distribution channel for AI. The labs build the models; the hyperscalers deliver them to users. The relationship between the two determines pricing, availability, performance, and ultimately who has access to AI capabilities.
The largest cloud provider by revenue (~32% market share). AWS was first to market in 2006 with EC2 (Elastic Compute Cloud), and its first-mover advantage persists. For AI specifically:
AWS's strategy is to be the platform where AI gets deployed, regardless of which model or framework you use. They don't have a frontier foundation model of their own (Titan is not competitive with GPT-4 or Claude), so they compete on infrastructure breadth and enterprise relationships.
Second largest cloud provider (~23% market share), but arguably the most strategically positioned for AI because of its partnership with OpenAI. Key elements:
Third largest (~11% market share) but uniquely positioned because Google both operates a cloud and builds frontier models (Gemini) and custom hardware (TPUs).
There's a significant gap between "we trained a model" and "a user can call it at scale." Model serving — the infrastructure that turns a trained model into a reliable, fast, cost-efficient API endpoint — is its own engineering discipline.
When a user sends a prompt to Claude or ChatGPT, the response doesn't appear instantly. The model generates tokens one at a time, where each token requires a full forward pass through the model. For a 100-token response from a model with hundreds of billions of parameters, that's hundreds of sequential operations, each involving massive matrix multiplications. The engineering challenge is doing this fast enough that the user perceives a fluid streaming response while serving millions of simultaneous users.
Key techniques:
| Framework | Origin | What It Does |
|---|---|---|
| vLLM | UC Berkeley | PagedAttention, continuous batching. The most widely adopted open-source LLM serving engine. |
| TGI (Text Generation Inference) | Hugging Face | Production-grade server for text generation. Good integration with Hugging Face model hub. |
| TensorRT-LLM | Nvidia | Nvidia's optimized inference engine. Highest performance on Nvidia hardware but Nvidia-specific. |
| llama.cpp | Community (Georgi Gerganov) | CPU and Apple Silicon inference in C/C++. Quantized models on consumer hardware. What Ollama builds on. |
How companies actually use AI is often different from the demo-stage projects that get media coverage. Enterprise adoption follows a fairly predictable progression:
Someone in the company starts using ChatGPT or Claude for their work. This happens informally, often without IT approval. The company then faces a choice: ban it (and watch people use it anyway on personal devices) or provide a sanctioned option. Most companies choose the latter, often through an enterprise agreement with OpenAI, Anthropic, or through their existing cloud provider (Azure OpenAI Service, AWS Bedrock).
The first production applications are typically internal: document summarization, code assistance, internal search over company knowledge bases (this is where RAG enters the picture), draft generation for reports or emails. These are low-risk because mistakes affect internal users, not customers.
External deployment requires more guardrails: content filtering, monitoring, fallback mechanisms, legal review. Customer support chatbots, product recommendations, and content generation are common first customer-facing applications. The key concern shifts from "does it work?" to "what happens when it fails?"
AI becomes embedded in critical business processes — not as a standalone tool but as a component of existing systems. Examples: automated underwriting in insurance, medical record summarization in healthcare, contract analysis in legal. This phase requires the deepest technical integration and the most careful evaluation.
Retrieval-Augmented Generation (RAG) deserves special mention here because it's the dominant pattern for enterprise AI deployment. The next chapter covers RAG in technical depth, but the enterprise context matters for understanding why it's so prevalent.
The fundamental problem: a foundation model knows what was in its training data, but enterprises need it to know their proprietary information — internal documents, product catalogs, customer records, compliance policies. You have three options:
RAG dominates because it's the pragmatic choice: it works with any foundation model, requires no model training expertise, and handles dynamic data. The engineering challenges are in the retrieval step — finding the right documents from potentially millions of candidates — which is exactly the problem the information retrieval community (Chapter 16) has been studying for decades. Your work on optimizing retrieval weights sits squarely in this space.
Not all AI deployment happens in the cloud. Some organizations run models on their own hardware, in their own data centers. The decision depends on several factors:
| Factor | Cloud | On-Premise |
|---|---|---|
| Data sensitivity | Data leaves your network (enterprise agreements mitigate this) | Data never leaves your network |
| Cost at scale | Usage-based; can be expensive at high volume | High upfront, lower marginal cost at volume |
| Flexibility | Scale up/down instantly | Fixed capacity; over-provisioning or under-provisioning |
| Latency | Network round-trip to cloud region | Local network latency only |
| Regulatory | Data residency requirements may constrain region choices | Full control over data location |
| Expertise required | Managed services reduce operational burden | Need GPU infrastructure expertise (rare, expensive) |
Industries with strict data regulations — finance, healthcare, government, defense — often prefer on-premise or private cloud deployment. The open-weight model movement (Chapter 17) enables this: you can download LLaMA or Mistral, deploy it on your own hardware, and no data ever leaves your network. This is one of the strongest practical arguments for open weights.
Beyond the Big Three, a new category of cloud provider has emerged: companies that specialize in GPU compute for AI workloads.
These companies exist because GPU demand has outstripped what the hyperscalers can supply. When AWS tells a startup they'll have H100 access in 6 months, CoreWeave can often deliver in weeks. The question is whether these GPU-focused clouds survive as the hyperscalers expand capacity, or whether they're a temporary phenomenon born of supply constraints.
AI compute is expensive, and the cost structure is worth understanding.
Training a frontier model costs $50-200 million in compute alone. This is why only a handful of organizations can train frontier models — you need either the capital (OpenAI, Anthropic via investors) or the existing infrastructure (Google, Meta). Training costs scale roughly with model size and dataset size, following the scaling laws discussed in earlier chapters.
Inference is cheaper per query but more expensive in aggregate because inference runs continuously for millions of users. As of early 2025, representative API pricing:
For context: a million tokens is roughly 750,000 words, or about 10 novels. A single ChatGPT conversation might use 2,000-10,000 tokens. The math at scale: a company processing 10 million API calls per day at 1,000 tokens each would spend roughly $30,000-150,000 per day on inference alone, depending on the model.
This is why inference optimization (quantization, caching, batching, speculative decoding) is such an active area. Every percentage improvement in inference efficiency translates directly to cost savings at scale. It's also why subscription models like ChatGPT Plus ($20/month for effectively unlimited use) are loss leaders — OpenAI is subsidizing heavy users to build market share and collect feedback data.
The hyperscalers provide the infrastructure; the enterprise adoption patterns show how AI gets integrated into real organizations. But what are people actually building with these models? The final chapter in this part of the guide covers the applied landscape: RAG, tool use, agentic systems, and the emerging patterns that define how AI gets used in practice. That's where your own work — building agents, optimizing retrieval, deploying autonomous systems — fits into the bigger picture.