A foundation model by itself is a stateless text-completion engine. It takes tokens in and produces tokens out. It has no memory between sessions, no access to external information, no ability to take actions in the world, and no way to verify its own outputs. The gap between "impressive demo" and "reliable system" is filled by the engineering patterns covered in this chapter.
These patterns — RAG, tool use, function calling, agentic loops, multi-agent systems — are where the field is most active and least settled. Unlike transformer architecture (mature theory, established implementations) or training methodology (well-understood scaling laws), applied AI is still in its "what works" phase. Best practices emerge and shift monthly. This chapter maps the current landscape as of early 2025, with the understanding that specific tools and frameworks will change while the underlying patterns are more durable.
This is also the territory you've been working in directly — building agents, optimizing retrieval, deploying autonomous loops. So this chapter isn't distant theory; it's the context for what you're already doing.
RAG is the most widely deployed pattern for extending LLMs with external knowledge. The term was coined by Lewis et al. (2020) at Facebook AI Research, though the practice of conditioning language models on retrieved documents predates the name.1
Foundation models have three fundamental knowledge limitations:
The pipeline in detail:
The pipeline above is the happy path. In practice, each step has failure modes:
RAG gives models access to information. Tool use gives models the ability to take actions — calling APIs, executing code, querying databases, interacting with external services.
The mechanism is function calling: the model is trained (or prompted) to recognize when a user's request requires an external capability, generate a structured function call (typically in JSON), and then incorporate the function's result into its response. The model never actually executes the function — it generates the call, the system executes it, and the result is fed back to the model.
Example flow:
{"function": "get_weather", "args": {"location": "Lowville, NY"}}{"temp": 28, "conditions": "snow"}OpenAI formalized this with their function calling API in June 2023. Anthropic followed with tool use in Claude. The pattern is now standard across all major model APIs. What varies is the reliability — models sometimes hallucinate function calls that don't exist, pass invalid arguments, or fail to call functions when they should. Improving tool use reliability is an active research area.
MCP, introduced by Anthropic in late 2024, is an open protocol for standardizing how AI models connect to external tools and data sources. The problem it solves: every tool integration is currently custom. If you want Claude to access your calendar, your codebase, and your database, you need to write three separate integrations with three different formats. MCP provides a single protocol that any tool provider can implement and any model client can consume.
The architecture has three components:
You're already using MCP extensively — your Claude Code setup connects to GitHub, Playwright, Pinecone, and other services through MCP servers. The significance is that MCP is becoming a standard. Instead of every model provider building custom integrations, tool developers can build one MCP server that works with any MCP-compatible client. This is the same dynamic that made USB valuable: standardization reduces friction for everyone.
MCP adoption is still early. Anthropic open-sourced the specification and reference implementations, and adoption is growing across the ecosystem. Whether MCP becomes the lasting standard or gets superseded by a competitor protocol remains to be seen, but the direction — standardized tool connectivity — is clear.
An agent, in the AI context, is a system where an LLM operates in a loop: observing, reasoning, acting, and then observing the result of its action. Instead of a single prompt-response exchange, the model repeatedly interacts with its environment until it accomplishes a goal or decides it can't.
The foundational pattern for AI agents is ReAct (Reasoning + Acting), introduced by Yao et al. (2022).2 The idea is simple: alternate between thinking (reasoning about what to do next) and acting (calling a tool or taking an action). The model's reasoning is explicit — it writes out its thought process as text, which helps with both transparency and coherence.
A ReAct loop looks like:
query_database("SELECT sum(revenue) FROM sales WHERE quarter = 'Q4 2025'"){"result": 4200000}query_database("SELECT region, sum(revenue) FROM sales WHERE quarter = 'Q4 2025' GROUP BY region")[{"region": "East", "revenue": 1800000}, ...]Building a reliable agent requires more than a model in a loop. The full stack includes:
| Component | What It Does | Examples |
|---|---|---|
| Model | The reasoning engine. Generates thoughts and actions. | Claude, GPT-4, LLaMA |
| Tools | External capabilities the model can invoke. | Web search, code execution, APIs, databases |
| Memory | Persistence between interactions. Short-term (conversation context) and long-term (stored knowledge). | Episodic memory DBs, vector stores, session state |
| Planning | Decomposing complex tasks into steps before executing. | Task decomposition prompts, tree-of-thought reasoning |
| Evaluation | Assessing whether the agent's actions are on track. | Self-reflection prompts, human feedback, automated checks |
| Governance | Constraints on what the agent can do. Safety boundaries. | Action allowlists, approval workflows, rate limits |
Your system — Alice running in an event-based autonomous loop with a governance framework (GOVERNANCE.md locked, SOUL.md guarded), episodic memory in SQLite, Thompson Sampling for retrieval optimization, and a dead-man's switch heartbeat — is a real implementation of this stack. It has a model (MiniMax), tools (OpenClaw CLI, file system), memory (episodic_memory.db), governance (locked governance files, quiet hours), and evaluation (bandit reward signals). The fact that you built this five weeks into learning AI is worth noting; the components you assembled are the same ones that companies with engineering teams are building.
Several frameworks exist for building agents, each with different philosophies:
A notable trend: many experienced practitioners are moving away from heavy frameworks toward simpler approaches — direct API calls with custom orchestration logic. The argument is that frameworks add abstraction layers that obscure what's happening, make debugging harder, and become stale as the underlying model APIs evolve. The counter-argument is that frameworks encode best practices and reduce boilerplate. Both sides have merit; the right choice depends on the complexity of the system and the experience of the team.
Multi-agent systems use multiple LLMs (or multiple instances of the same LLM with different system prompts) that interact with each other to accomplish complex tasks. The motivation: decomposing a complex problem into specialized roles can produce better results than a single model handling everything.
Common patterns:
The challenges with multi-agent systems are coordination overhead (agents can miscommunicate or work at cross-purposes), cost multiplication (each agent makes its own API calls), and the difficulty of debugging failures in complex interactions. Whether multi-agent approaches genuinely outperform a single powerful model with good prompting is an open empirical question. For many tasks, a single model with appropriate tools and memory appears to be sufficient.
Putting all the patterns together, the applied AI stack as it's crystallizing in 2025 looks like this:
Predicting AI's trajectory is a fool's errand in detail but straightforward in direction. Several trends are clear enough to be worth noting:
Agents are getting longer-running. Early agents did 3-5 step tasks. Current systems (Claude Code's extended thinking, OpenAI's deep research mode) handle multi-hour research tasks. The direction is toward agents that operate over days or weeks — monitoring systems, managing workflows, pursuing open-ended research goals. Your Alice system, running 24/7 with a heartbeat and event-based autonomy, is ahead of the curve here.
Current models are weak at long-horizon planning. They handle tasks that decompose into 5-10 steps reasonably well; tasks requiring 50-100 interdependent steps are brittle. Reasoning models (OpenAI's o1/o3, Anthropic's extended thinking) improve this by spending more compute on planning before acting. The trajectory is toward models that can maintain coherent plans across much longer horizons.
The current paradigm — stateless models with context windows as working memory — is limiting. Every new session starts from zero. Workarounds exist (CLAUDE.md files, memory databases, session summaries), but they're workarounds. The direction is toward models with genuine persistent memory that updates across sessions without losing coherence. This is technically difficult because it requires modifying the model's behavior without catastrophic forgetting.
Today's models learn only during training. Once deployed, they're static. The bandit approach in your work is one form of online learning — adapting retrieval weights based on ongoing interactions. Broader real-time learning would let models improve continuously from feedback. The challenge is doing this safely: a model that learns from user interactions could learn the wrong things (adversarial manipulation, reinforcing biases) as easily as the right things.
As agents take on higher-stakes tasks, the question of "how do you know it worked correctly?" becomes critical. Current evaluation is mostly vibes-based: does the output look right? The field needs rigorous, automated evaluation frameworks — and this is an area where the information retrieval community's decades of work on evaluation methodology (precision, recall, nDCG, inter-annotator agreement) is directly applicable. Evaluation is arguably the least glamorous and most important unsolved problem in applied AI.
This chapter covers the patterns that define how AI is used in practice today. They're not permanent — the specific tools and frameworks will evolve. But the underlying problems — grounding models in external knowledge, giving them the ability to act, managing state and memory, ensuring reliability, and evaluating outcomes — are durable. They're the engineering challenges that will shape the field for years to come. Your work on retrieval optimization, agent governance, and autonomous systems puts you in the thick of exactly these challenges.