Part IV — The Ecosystem

The Applied Landscape

RAG, tool use, agentic systems, and the patterns that define how AI gets used in practice.

From Models to Systems

A foundation model by itself is a stateless text-completion engine. It takes tokens in and produces tokens out. It has no memory between sessions, no access to external information, no ability to take actions in the world, and no way to verify its own outputs. The gap between "impressive demo" and "reliable system" is filled by the engineering patterns covered in this chapter.

These patterns — RAG, tool use, function calling, agentic loops, multi-agent systems — are where the field is most active and least settled. Unlike transformer architecture (mature theory, established implementations) or training methodology (well-understood scaling laws), applied AI is still in its "what works" phase. Best practices emerge and shift monthly. This chapter maps the current landscape as of early 2025, with the understanding that specific tools and frameworks will change while the underlying patterns are more durable.

This is also the territory you've been working in directly — building agents, optimizing retrieval, deploying autonomous loops. So this chapter isn't distant theory; it's the context for what you're already doing.

Retrieval-Augmented Generation (RAG)

RAG is the most widely deployed pattern for extending LLMs with external knowledge. The term was coined by Lewis et al. (2020) at Facebook AI Research, though the practice of conditioning language models on retrieved documents predates the name.¹

Why RAG Exists

Foundation models have three fundamental knowledge limitations:

Stale knowledge. A model's knowledge is frozen at training time. GPT-4's training cutoff means it doesn't know about events after its cutoff date. For any application that needs current information, this is a hard limitation.
Hallucination. Models generate plausible-sounding text regardless of whether the underlying claims are true. They have no mechanism for distinguishing "I know this" from "this sounds right." Grounding responses in retrieved documents reduces (but doesn't eliminate) hallucination.
No access to private data. A model trained on the public internet knows nothing about your company's internal documents, your customer database, or your proprietary processes. RAG lets you inject this information at query time without retraining the model.

How RAG Works

The pipeline in detail:

Document ingestion: Your corpus (documents, web pages, database records) is split into chunks, and each chunk is converted into a vector embedding using an embedding model (OpenAI's text-embedding-3, Cohere's embed, or open models like BGE/E5). These vectors are stored in a vector database.
Query embedding: When a user asks a question, the query is embedded using the same model.
Retrieval: The vector database finds the chunks whose embeddings are most similar to the query embedding (typically using cosine similarity or dot product). The top-k results are retrieved.
Augmentation: The retrieved chunks are inserted into the prompt alongside the user's question, usually with instructions like "Answer the question based on the following context."
Generation: The LLM generates a response grounded in the retrieved documents.

Where RAG Gets Complicated

The pipeline above is the happy path. In practice, each step has failure modes:

Chunking: How you split documents matters enormously. Too small, and you lose context. Too large, and you dilute relevant information with noise. Sentence-level, paragraph-level, and semantic chunking all have trade-offs.
Embedding quality: The embedding model determines what "similar" means. A weak embedding model might retrieve documents that are topically related but don't actually answer the question. Cross-encoder rerankers (which score query-document pairs directly rather than comparing embeddings) can improve relevance but add latency.
Retrieval failure: If the relevant document isn't retrieved, the model can't use it. False negatives are invisible to the user — they just get a worse answer. This is where retrieval evaluation (recall@k, nDCG) matters, and where your work on retrieval weight optimization directly applies.
Hybrid search: Pure vector search misses exact keyword matches; pure keyword search (BM25) misses semantic similarity. Most production systems use hybrid search — combining vector similarity scores with BM25 keyword scores. The question of how to weight these signals is exactly the problem your Thompson Sampling approach addresses.

Where your work fits: Your paper on Thompson Sampling for retrieval weight optimization sits at the intersection of RAG engineering and bandit algorithms. The problem — how to weight different retrieval strategies (dense vs. sparse, different embedding models, different chunk sizes) — is one that every production RAG system faces. Most systems use fixed weights tuned by hand. Your approach makes the weights adaptive, letting the system learn optimal retrieval configurations from user interactions. This is a real contribution to a real problem.

Tool Use and Function Calling

RAG gives models access to information. Tool use gives models the ability to take actions — calling APIs, executing code, querying databases, interacting with external services.

The mechanism is function calling: the model is trained (or prompted) to recognize when a user's request requires an external capability, generate a structured function call (typically in JSON), and then incorporate the function's result into its response. The model never actually executes the function — it generates the call, the system executes it, and the result is fed back to the model.

Example flow:

User: "What's the weather in Lowville, NY?"
Model generates: {"function": "get_weather", "args": {"location": "Lowville, NY"}}
System executes the function, gets result: {"temp": 28, "conditions": "snow"}
Model incorporates result: "It's currently 28 degrees F in Lowville with snow."

OpenAI formalized this with their function calling API in June 2023. Anthropic followed with tool use in Claude. The pattern is now standard across all major model APIs. What varies is the reliability — models sometimes hallucinate function calls that don't exist, pass invalid arguments, or fail to call functions when they should. Improving tool use reliability is an active research area.

Model Context Protocol (MCP)

MCP, introduced by Anthropic in late 2024, is an open protocol for standardizing how AI models connect to external tools and data sources. The problem it solves: every tool integration is currently custom. If you want Claude to access your calendar, your codebase, and your database, you need to write three separate integrations with three different formats. MCP provides a single protocol that any tool provider can implement and any model client can consume.

The architecture has three components:

MCP Server: A service that exposes tools and data through the MCP protocol. Anyone can build one. There are MCP servers for GitHub, databases, file systems, web browsing, and many other tools.
MCP Client: The AI application (Claude Desktop, Claude Code, etc.) that connects to MCP servers and presents their tools to the model.
Protocol: JSON-RPC over stdio or HTTP, defining how tools are listed, called, and how results are returned.

You're already using MCP extensively — your Claude Code setup connects to GitHub, Playwright, Pinecone, and other services through MCP servers. The significance is that MCP is becoming a standard. Instead of every model provider building custom integrations, tool developers can build one MCP server that works with any MCP-compatible client. This is the same dynamic that made USB valuable: standardization reduces friction for everyone.

MCP adoption is still early. Anthropic open-sourced the specification and reference implementations, and adoption is growing across the ecosystem. Whether MCP becomes the lasting standard or gets superseded by a competitor protocol remains to be seen, but the direction — standardized tool connectivity — is clear.

Agentic Engineering

An agent, in the AI context, is a system where an LLM operates in a loop: observing, reasoning, acting, and then observing the result of its action. Instead of a single prompt-response exchange, the model repeatedly interacts with its environment until it accomplishes a goal or decides it can't.

The ReAct Pattern

The foundational pattern for AI agents is ReAct (Reasoning + Acting), introduced by Yao et al. (2022).² The idea is simple: alternate between thinking (reasoning about what to do next) and acting (calling a tool or taking an action). The model's reasoning is explicit — it writes out its thought process as text, which helps with both transparency and coherence.

A ReAct loop looks like:

Thought: "The user wants to know sales numbers for Q4. I need to query the database."
Action: query_database("SELECT sum(revenue) FROM sales WHERE quarter = 'Q4 2025'")
Observation: {"result": 4200000}
Thought: "I have the total revenue. Let me also get the breakdown by region."
Action: query_database("SELECT region, sum(revenue) FROM sales WHERE quarter = 'Q4 2025' GROUP BY region")
Observation: [{"region": "East", "revenue": 1800000}, ...]
Thought: "I now have enough data to answer comprehensively."
Final answer: "Q4 2025 total revenue was $4.2M, with the East region leading at $1.8M..."

The Agent Stack

Building a reliable agent requires more than a model in a loop. The full stack includes:

Component	What It Does	Examples
Model	The reasoning engine. Generates thoughts and actions.	Claude, GPT-4, LLaMA
Tools	External capabilities the model can invoke.	Web search, code execution, APIs, databases
Memory	Persistence between interactions. Short-term (conversation context) and long-term (stored knowledge).	Episodic memory DBs, vector stores, session state
Planning	Decomposing complex tasks into steps before executing.	Task decomposition prompts, tree-of-thought reasoning
Evaluation	Assessing whether the agent's actions are on track.	Self-reflection prompts, human feedback, automated checks
Governance	Constraints on what the agent can do. Safety boundaries.	Action allowlists, approval workflows, rate limits

Your system — Alice running in an event-based autonomous loop with a governance framework (GOVERNANCE.md locked, SOUL.md guarded), episodic memory in SQLite, Thompson Sampling for retrieval optimization, and a dead-man's switch heartbeat — is a real implementation of this stack. It has a model (MiniMax), tools (OpenClaw CLI, file system), memory (episodic_memory.db), governance (locked governance files, quiet hours), and evaluation (bandit reward signals). The fact that you built this five weeks into learning AI is worth noting; the components you assembled are the same ones that companies with engineering teams are building.

Key idea: The difference between a chatbot and an agent is the loop. A chatbot responds to one prompt. An agent pursues a goal across multiple steps, using tools, managing state, and recovering from errors. The reliability challenge is that errors compound: if each step has a 95% success rate, a 10-step task has a ~60% end-to-end success rate. This is why agent reliability is the central engineering challenge in the space.

Agent Frameworks

Several frameworks exist for building agents, each with different philosophies:

LangChain / LangGraph: The most widely adopted agent framework. LangChain provides abstractions for chains (sequences of LLM calls) and LangGraph adds stateful, graph-based orchestration. Criticized for over-abstraction and leaky abstractions, but has the largest community and most examples.
CrewAI: Multi-agent framework where you define "agents" with roles, goals, and tools, then compose them into "crews" that collaborate on tasks.
AutoGen (Microsoft): Multi-agent conversation framework. Agents communicate via messages, with humans optionally in the loop.
Claude Code / OpenClaw: Direct LLM-as-agent without heavy framework abstraction. The model gets tools and a system prompt; the loop is managed by the host application. Less framework overhead, more direct control.

A notable trend: many experienced practitioners are moving away from heavy frameworks toward simpler approaches — direct API calls with custom orchestration logic. The argument is that frameworks add abstraction layers that obscure what's happening, make debugging harder, and become stale as the underlying model APIs evolve. The counter-argument is that frameworks encode best practices and reduce boilerplate. Both sides have merit; the right choice depends on the complexity of the system and the experience of the team.

Multi-Agent Systems

Multi-agent systems use multiple LLMs (or multiple instances of the same LLM with different system prompts) that interact with each other to accomplish complex tasks. The motivation: decomposing a complex problem into specialized roles can produce better results than a single model handling everything.

Common patterns:

Generator-Critic: One model generates content, another evaluates and provides feedback. The generator revises based on the feedback. This mirrors the human pattern of writing and editing as separate cognitive processes.
Specialization: Different agents handle different aspects of a task. A "researcher" agent searches for information, an "analyst" agent processes data, a "writer" agent produces the final output. Your five-agent system (Alice, Laplace, Athena, Gio, Simon) followed this pattern.
Debate: Multiple models argue different positions, and a judge model (or human) selects the best response. This can surface reasoning errors that a single model would miss.
Hierarchical: A "manager" agent breaks a task into subtasks and delegates them to "worker" agents. The manager synthesizes the results. This mirrors organizational structure.

The challenges with multi-agent systems are coordination overhead (agents can miscommunicate or work at cross-purposes), cost multiplication (each agent makes its own API calls), and the difficulty of debugging failures in complex interactions. Whether multi-agent approaches genuinely outperform a single powerful model with good prompting is an open empirical question. For many tasks, a single model with appropriate tools and memory appears to be sufficient.

The Emerging Stack

Putting all the patterns together, the applied AI stack as it's crystallizing in 2025 looks like this:

Where This Is Going

Predicting AI's trajectory is a fool's errand in detail but straightforward in direction. Several trends are clear enough to be worth noting:

More Autonomy

Agents are getting longer-running. Early agents did 3-5 step tasks. Current systems (Claude Code's extended thinking, OpenAI's deep research mode) handle multi-hour research tasks. The direction is toward agents that operate over days or weeks — monitoring systems, managing workflows, pursuing open-ended research goals. Your Alice system, running 24/7 with a heartbeat and event-based autonomy, is ahead of the curve here.

Better Planning

Current models are weak at long-horizon planning. They handle tasks that decompose into 5-10 steps reasonably well; tasks requiring 50-100 interdependent steps are brittle. Reasoning models (OpenAI's o1/o3, Anthropic's extended thinking) improve this by spending more compute on planning before acting. The trajectory is toward models that can maintain coherent plans across much longer horizons.

Persistent Memory

The current paradigm — stateless models with context windows as working memory — is limiting. Every new session starts from zero. Workarounds exist (CLAUDE.md files, memory databases, session summaries), but they're workarounds. The direction is toward models with genuine persistent memory that updates across sessions without losing coherence. This is technically difficult because it requires modifying the model's behavior without catastrophic forgetting.

Real-Time Learning

Today's models learn only during training. Once deployed, they're static. The bandit approach in your work is one form of online learning — adapting retrieval weights based on ongoing interactions. Broader real-time learning would let models improve continuously from feedback. The challenge is doing this safely: a model that learns from user interactions could learn the wrong things (adversarial manipulation, reinforcing biases) as easily as the right things.

Evaluation and Trust

As agents take on higher-stakes tasks, the question of "how do you know it worked correctly?" becomes critical. Current evaluation is mostly vibes-based: does the output look right? The field needs rigorous, automated evaluation frameworks — and this is an area where the information retrieval community's decades of work on evaluation methodology (precision, recall, nDCG, inter-annotator agreement) is directly applicable. Evaluation is arguably the least glamorous and most important unsolved problem in applied AI.

This chapter covers the patterns that define how AI is used in practice today. They're not permanent — the specific tools and frameworks will evolve. But the underlying problems — grounding models in external knowledge, giving them the ability to act, managing state and memory, ensuring reliability, and evaluating outcomes — are durable. They're the engineering challenges that will shape the field for years to come. Your work on retrieval optimization, agent governance, and autonomous systems puts you in the thick of exactly these challenges.

Previous: Chapter 19 — Hyperscalers and Enterprise AI Next: Chapter 21 — Beyond LLMs

¹ Lewis et al. (2020), "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Published at NeurIPS 2020. The paper introduced both the term and a specific architecture (DPR retriever + BART generator), though the practice of conditioning generation on retrieved documents appeared in earlier work by Guu et al. (2020) with REALM.

² Yao et al. (2022), "ReAct: Synergizing Reasoning and Acting in Language Models." The paper demonstrated that interleaving reasoning traces with actions significantly improved model performance on both knowledge-intensive (fact verification) and decision-making (web navigation) tasks compared to either reasoning-only or acting-only approaches.