Blog Drafts

5 Substack articles — click to expand. Generated 2026-03-19, needs voice pass.

Can AI Agents Learn What to Remember?
1091 words
# Can AI Agents Learn What to Remember?


# Can AI Agents Learn What to Remember?

I built a system that helps an AI agent pull the right skills from a library when it needs to solve a task. The library has 205 skills — things like "write a SQL query," "parse JSON," "format a date." The agent sees a user request, retrieves relevant skills, and uses them to generate a response.

The problem: not all retrieval dimensions are equal. Relevance (does this skill actually help?) matters, sure. But recency (did I use this recently?) and importance (did it work well last time?) also compete for weight. Most systems assign them equal importance or pick weights arbitrarily. I wanted to know: can an agent learn which dimension matters most for its particular library?

I ran the experiment on a Mac Mini M4 with a $50/month MiniMax API budget. Five agents. 1,200 episodes across four conditions. Here's what happened.

The Production Observation

In production, I watched relevance dominate. Across 579 Thompson Sampling iterations, the relevance dimension scored 0.91 on average. Recency came in at 0.78. Importance lagged at 0.75.

This surprised me. I assumed recency would matter more — agents typically benefit from trying what worked recently. But the 205-skill library had enough semantic diversity that semantic relevance was the discriminating factor. The same six meta-skills dominated 86% of retrievals regardless of recency or importance weighting.

Park et al. assumed equal weights in their retrieval model. My production data suggested that assumption was wrong — at least for this library. But production observations aren't causal evidence. The agent could have been learning the wrong thing, or the pattern could be an artifact of how I was measuring.

I needed a controlled experiment.

The Controlled Experiment

I ran 1,200 episodes across four conditions:

- **C1 (control):** Static BM25 retrieval, no weight learning
- **C2:** Likert-scale self-assessment only
- **C3:** Likert-scale + explanation embeddings — the system rates its retrieval and explains why
- **C4:** Qualitative feedback (free-text explanations parsed via anchor embeddings)

The key metric was NDCG@5 — how well the top-5 retrieved skills match the ground truth. C3 achieved +41% NDCG@5 and +36% MRR versus control. That's a substantial gap.

But the mechanism changed between runs, and that's where it gets interesting.

The Mechanism Shift

In the first run (v2), the token savings came from multi-step prevention. The control condition hit multi-step episodes 11.6% of the time — the agent would fail, retry, fail again, accumulate context, and burn tokens. The weight-learning conditions dropped that to 3.2-3.6%.

I wrote that up as "multi-step prevention is the mechanism."

Then I increased max_tokens from 4096 to 16384, fixed a bug with usage-history injection, and re-ran.

The result: only 1 out of 1,200 episodes was multi-step. The mechanism had shifted. Token savings now came from output-driven efficiency — better retrieval produced shorter, more focused LLM responses. The agent wasn't retrying less; it was finishing faster in a single step.

The v2 finding was an artifact of max_tokens=4096 triggering truncation cascades. When the model couldn't finish in one turn, it output exactly 4096 tokens, the runner interpreted that as "not done," continued the conversation, and costs compounded quadratically.

You can't optimize your way past physics.

What Worked

The bandit converges within roughly 50 tasks. Not gradually — it flips from exploratory to exploitative around task 33-48. After that, it's mostly pure_relevance. The cold-start cost is real but bounded.

C3 (Likert + explanation embeddings) outperformed C2 (Likert only). The explanation embeddings don't directly improve retrieval — they improve the quality of the reward signal feeding the bandit. The SNR was 2.5x higher. C3's cumulative regret dropped 39% in the second half versus 4% for C2. The feedback signal quality, not just its presence, determines whether online learning accelerates.

The skill library composition matters. v2 used 50 generic skills and got 90% Jaccard overlap across conditions — the bandit couldn't differentiate because the same skills dominated regardless of weights. v3's 205 domain-specific skills dropped Jaccard to 3%, enabling genuine differentiation. Library design is upstream of bandit design.

What Didn't Work

C4 (qualitative feedback via anchor embeddings) didn't converge cleanly. The recency dimension collapsed to mean 0.14 — the "high recency" anchor text was too extreme. Any real LLM response was far from it in embedding space, so softmax always assigned low probability. Anchor design determines which dimensions the parser can distinguish. The recency anchors needed more moderate language.

The control condition (C1) has extreme variance — SD of 110,546 tokens. The bimodal distribution (single-step cluster plus multi-step tail) killed statistical power. Only C1 vs C4 survived Bonferroni correction. C1 vs C2 (p=0.264) and C1 vs C3 (p=0.069) failed despite 73-76% mean reductions and bootstrap CIs excluding zero.

Easy tasks showed rubric overhead. C1 averaged 3,198 tokens on easy tasks; C4 averaged 4,078 — feedback made it worse. The rubric adds ~880 tokens of overhead that easy tasks don't benefit from. The crossover point is around 1.5 mean steps. Below that, feedback hurts.

What This Means for Practitioners

If you're building a retrieval-augmented agent and assigning static weights to relevance, recency, and importance — stop. The optimal weights depend on your skill library composition, and you can learn them.

The cold-start cost is ~50 tasks. After that, the bandit has converged and you're in the exploitation regime. If you have prior data (warm-start), you skip this — C6 with warm-start priors hit efficiency immediately without re-learning.

The token savings aren't from retry prevention in most configurations. They're from output-driven efficiency: better retrieval produces more targeted prompts, which produces shorter LLM responses. If your max_tokens is too low, you'll see retry behavior instead. Set it high enough that the model can finish in one turn.

Finally, don't trust mean final_score as a quality metric. C3 had the highest retrieval quality but the lowest mean final_score. That's because the system learned to down-weight dimensions that saturate near 1.0 (recency, importance) and up-weight relevance (which has genuine variance but scores lower alone). The compound metric — token cost per ground truth hit — is more informative. C1 needed 895,581 tokens per GT hit. C2 needed 11,421. That's a 78x difference in cost-effectiveness.

The field is moving past traditional RAG for small-to-medium corpora. Agentic keyword search, fine-tuning to internalize retrieval benefits, and long-context models all challenge the layer I'm optimizing. But the core insight — that agents can learn which retrieval dimensions matter, via gradient-free parameter learning and self-assessment — transfers beyond RAG. That's the durable finding.

Why 'Don't Do X' Instructions Backfire in LLM Prompts
814 words
# Why 'Don't Do X' Instructions Backfire in LLM Prompts


# Why "Don't Do X" Instructions Backfire in LLM Prompts

Wegner's 1987 white bear experiment is one of those findings that haunts you once you know it. Subjects told not to think about a white bear thought about it *more* than controls told to think about it from the start. The suppression attempt backfired. A monitoring process kept checking whether the forbidden thought was occurring, ironically keeping it activated.

Practitioner blogs and prompt engineering guides picked this up and ran with it. The "Pink Elephant Problem" — instructing an LLM not to do something makes it more likely to do it — became conventional wisdom. You'd see it in Reddit threads, in "LLM prompting best practices" lists, everywhere. The analogy feels intuitive: if humans can't suppress thoughts via instruction, why would a language model?

The problem is the evidence is almost entirely anecdotal. Someone on Twitter tries "don't mention X" and the model mentions X. A blog post documents "don't end with a question" and the model asks a question anyway. These get aggregated into received wisdom without anyone actually controlled-testing the mechanism.

The analogy breaks at the token level. LLMs select tokens by positive probability weighting. When you write "don't say X," you're slightly reducing X's probability — but you're not actively boosting the alternative. You're applying a dampening signal to one distribution tail, not elevating another. Meanwhile, a positive instruction like "say Y instead" directly increases Y's probability. The mechanisms are fundamentally different. Human thought suppression involves an active monitoring process (the ironic process) that the LLM architecture simply doesn't have.

There's also a separate phenomenon that gets confused with this: the Waluigi Effect. After RLHF training makes a model satisfy property P, the opposite ~P becomes more accessible via adversarial prompting. That's structural — baked into the model's training, not a runtime instruction-following failure. It's a real effect with real implications for prompt injection, but it operates at a different level than prompt-level "do NOT" instructions.

Here's where our data enters. In our agent system, we tested "do NOT generate follow-up questions" as an instruction to suppress multi-step episodes in MiniMax M2.5. The result was a 3.5x reduction in multi-step episodes. That's not a small effect. And it contradicts what the conventional wisdom would predict.

Why did it work? I think it's because we were suppressing a structural behavior, not a semantic concept. "Don't append a question-formatted suffix" is a binary constraint — either there's a question at the end of the response or there isn't. The model doesn't need to "understand" negation semantically. It just needs to not emit a question mark followed by another sentence. That's a pattern-matching task, not a conceptual one.

Compare that to "don't mention elephants." That requires ongoing semantic monitoring throughout generation — tracking whether any token you're about to emit relates to elephants, suppressing those pathways, maintaining the constraint across the entire response. That's harder. A 2025 survey across Llama 3, Qwen 2.5, and Mistral families found that LLMs "underestimate the significant impact of negation on sentence meaning" (Arxiv 2503.22395). Larger models handle it better, but even GPT-4 stumbles on complex negation chains.

We had another case that illustrates the failure mode. Early in our system design, we included "Do NOT create proposals" as a behavioral constraint. The result: more proposals. Not slightly more — meaningfully more. The instruction itself made "proposal" a high-activation concept in the model's distribution, and without a clear positive alternative ("create summaries instead"), the suppression just added noise.

Our fix was architectural, not prompt-based. We removed the mention entirely and enforced the constraint by changing what the system was allowed to generate at that point in the conversation. No negation in the prompt. Just a different set of allowable outputs.

The practical rule that emerges: negative instructions work for structural behaviors — "don't append a question," "don't use markdown," "don't include a signature." These are binary, pattern-level constraints. They fail for semantic concepts — "don't mention topic X," "don't be sad," "don't describe violence." Those require the model to monitor and suppress meaning, which negation-comprehension research shows LLMs are genuinely bad at.

This isn't a contradiction of Wegner's theory. The theory describes human cognition, and the LLM analogy was always loose. What it is, is a more precise understanding of where the analogy holds and where it doesn't. The white bear effect requires a monitoring process. LLMs don't have one. They have gradient-based token prediction. These are different architectures, and they respond differently to the same prompt structure.

For practitioners: test your negative instructions. If they're failing, it's probably because you're asking the model to suppress a concept rather than a pattern. Flip it to a positive instruction, or enforce the constraint architecturally. The 3.5x reduction we saw wasn't magic — it was targeting the right kind of suppression.
Operating a Multi-Agent System for 6 Months: Lessons Learned
1672 words
# Operating a Multi-Agent System for 6 Months: Lessons Learned


# Operating a Multi-Agent System for 6 Months: Lessons Learned

Five agents. Five belief systems. Five failure modes. One Mac Mini that eventually cried uncle.

I ran a multi-agent system from roughly September 2025 through March 2026. Five autonomous agents—Alice, Laplace, Athena, Gio, Simon—each with distinct roles: Alice as the primary worker, Laplace for planning, Athena for quality control, Gio for research, Simon for monitoring. Powered by MiniMax M2.5 API at $50/month, all running on a Mac Mini M4 with 24GB RAM.

The system is dead now. I killed it in mid-March and rebuilt around a single focused agent with an event-driven architecture. What follows is the autopsy—what broke, why it broke, and what I'd do differently.

---

Reliability Multiplies Against You

The first thing that failed was my intuition about reliability.

I assumed five agents at 95% reliability each would give me something close to 95% overall. The math is brutal: 0.95⁵ = 0.77. Five agents each hitting their targets 95% of the time produces a 77% system. Not 95%. Not even 90%. Seventy-seven.

This isn't a surprise to anyone who's done distributed systems, but it hit me in practice in ways the math doesn't capture. The failures weren't evenly distributed. They'd cascade. One agent would hang waiting for another, which would timeout, which would trigger a retry that another agent interpreted as a new request. The 23% gap wasn't random noise—it was cascading failure modes I hadn't designed for.

The cost multiplied even harder. Each agent running full-prompt contexts meant token consumption didn't add linearly—it multiplied. What should have been 5x single-agent cost became 20-30x in practice. At $50/month API budget, I was burning through quota in days when all five agents were active.

**Takeaway:** Start with one agent. Add a second only when you can quantify why one isn't enough. The overhead isn't the API calls—it's the failure modes you're not thinking about yet.

---

The Zombie Apocalypse

I discovered 1,087 zombie MCP processes consuming 5.7GB of RAM and 57GB of swap.

These weren't active processes. They were orphaned by agent crashes, killed timeouts, and interrupted context switches. Each agent spawned MCP (Model Context Protocol) connections for tool access—file operations, API calls, shell execution. When an agent crashed or was restarted, the MCP child process didn't die. It became a zombie, holding memory, waiting for a parent that no longer existed.

The Mac Mini didn't have swap space to spare. 57GB of swap used on a machine with 24GB RAM means the system was essentially paging its way through a graveyard. This was the proximate cause of the first major outage—OOM (out-of-memory) spirals that took down the entire system.

I wrote a cron job to kill zombie processes hourly. It helped. It shouldn't have been necessary.

**Takeaway:** If your agents spawn child processes, you need process lifecycle management. Not "consider adding it" or "nice to have"—you need it from day one. A process that outlives its parent is a memory leak with a pulse.

---

Heartbeats Were Killing Me

The agents used polling heartbeats to check status every 30 seconds. Five agents, polling in parallel, meant constant CPU activity even when nothing was happening. System idle time dropped to 14% on a good day, 0% during normal operation. The Mac Mini was busy doing nothing.

The heartbeats didn't just consume CPU. They consumed context. Each heartbeat query added to the conversation history, which grew quadratically: step 1 = 1,550 tokens, step 2 = 5,100, step 3 = 9,200, step 4 = 12,800. The agents weren't thinking more deeply—they were just reading their own notes over and over.

When I switched to event-driven architecture—agents only waking on state changes, not timers—the idle time reversed. CPU usage dropped to single digits when idle. Memory stayed flat. The system could actually rest.

**Takeaway:** Polling heartbeats are a default for a reason (they're simple), but on constrained hardware with multiple agents, they're an architectural attack surface. If your agent isn't responding to anything, it shouldn't be running. Events > polling.

---

Built But Never Deployed

I found `execute_skill()` had 93 lines of code and zero invocations.

This was a skill execution function I built for the skill library system—dynamic skill loading, argument parsing, result caching, error handling. Everything. I spent probably eight hours on it. It was beautiful code.

It was never called. Once. The agents retrieved skills but executed them through a different path—a hardcoded dispatch that bypassed the whole function. The "production" code was the ugly thing, and the elegant thing I was proud of was dead code that looked good in the repo.

This pattern repeated. I built a belief tracking system for the agents to maintain persistent world models. It had 259 active beliefs at peak. Of those, 192 were contradicted by later facts the agents encountered. No decay function meant every belief stayed "active" regardless of age or contradiction. The belief system wasn't knowledge—it was noise that accumulated.

**Takeaway:** If you build something and it never gets invoked in production, you didn't build a feature. You wrote a document. The only code that matters is the code that runs. Everything else is autobiography.

---

Failure-Driven Learning Breaks for Successful Agents

Athena, my quality-control agent, had a 99% success rate across 983 episodes. It generated zero SOAR rules.

The system was designed to learn from failures—each error would trigger a rule generation, like a post-mortem that writes code. Athena's job was to catch errors and build guardrails. But Athena was too good. It caught everything. It never failed, so it never generated rules.

This is the paradox of failure-driven systems: the agents that need to learn the least learn the most, and the agents that need guardrails never build them because they're too reliable to generate training data.

The fix would have been injecting synthetic failures or using successful executions as negative examples. I didn't have time. Athena stayed 99% reliable and 100% useless for system improvement.

**Takeaway:** If your learning system only learns from failures, high-success agents become dead ends. You need positive-example learning or synthetic failure injection, or accept that reliable agents won't improve your system.

---

What Survived

The COO (Chief Operating Officer) hub-and-spoke pattern survived every framework change.

This was the simplest thing I built: a central coordinator (the COO role) that handled all inter-agent communication. Agents didn't talk to each other directly—they talked to the COO, who routed messages. It added latency but eliminated the mesh explosion problem (each new agent adds N-1 connections in a mesh, 1 in a hub-spoke).

When I rewrote the entire agent system three times (yes, three), the COO pattern carried through unchanged. It wasn't the fancy part. It wasn't the interesting part. It was the plumbing, and plumbing that works doesn't need replacing.

**Takeaway:** Invest heavily in the boring parts. Routing, coordination, error handling—these don't make demos, but they outlast every exciting architectural choice you make.

---

Quality Over Quantity

I ran a comparison: 5 quality workers (better prompts, more context, higher API temperature) versus 8 budget workers (minimal context, lower cost, faster timeouts).

Quality won by a significant margin. The 5 quality workers achieved 2.3x the task completion rate of the 8 budget workers. The budget workers spent most of their time recovering from low-quality outputs—retrying, re-parsing, re-executing. The "savings" from cheap execution were eaten by failure recovery.

At $50/month, I couldn't afford 8 workers anyway. But the experiment was clarifying: the bottleneck was never API cost. It was context quality and prompt design.

**Takeaway:** One good agent outperforms five mediocre ones. More tokens spent on better prompts is always cheaper than more agents making up for bad prompts.

---

Committing a Fix Is Not Deploying It

I shipped a fix for the zombie process issue on February 14th. The system didn't have the fix running until March 3rd.

Between commit and deployment, I ran into: the fix broke a different agent's startup sequence, I didn't have a staging environment to test, the Mac Mini was down for unrelated reasons, and I was traveling for work. The commit sat there, green checkmark, "merged," doing nothing in production.

This happened repeatedly. I'd fix something in the repo, then the fix would sit for days or weeks before actually running. Part of it was the single-machine constraint (I couldn't easily run staging and production in parallel). Part of it was my own process debt.

**Takeaway:** A fix in version control is a diff, not a deployment. If you can't ship it today, you're maintaining a to-do list, not a codebase.

---

What I Built Instead

In mid-March, I killed the five-agent system and rebuilt around a single agent: Alice.

Event-driven architecture. No polling. No child processes left behind. No belief system without decay. No skill function that never runs. One agent, one context, one failure mode—much easier to manage.

The system runs on the same Mac Mini, the same $50/month API budget. It's less ambitious. It's less interesting to write about. It's more reliable.

That's the trade.

---

What I'd Do Differently

If I were starting over:

1. **One agent to start.** Add only when you can prove the bottleneck is agent count, not prompt quality or context design.

2. **Process lifecycle from day one.** Every spawned process needs a parent that monitors it and kills it on exit.

3. **Event-driven from the start.** Polling heartbeats seem simpler until they aren't.

4. **Fail less by injecting synthetic failures.** Failure-driven learning requires failures. If your agent is good, make it fail on purpose.

5. **Ship first, build second.** Committing code that doesn't run is a to-do list, not a codebase.

6. **The boring parts are the important parts.** The hub-and-spoke pattern survived everything. Routing, error handling, resource limits—these are the things worth getting right.

The five-agent system was ambitious, creative, and a learning machine. It also consumed 57GB of swap and crashed weekly. The single-agent system is boring.

Boring works.

Thompson Sampling for Retrieval Weight Learning: A Practitioner's Guide
1544 words
# Thompson Sampling for Retrieval Weight Learning: A Practitioner's Guide


# Thompson Sampling for Retrieval Weight Learning: A Practitioner's Guide

If you've built a RAG system, you've faced the weight problem. You combine relevance scores from your vector search, recency from your usage table, and importance from some heuristic—and you weight them equally. Why 0.33/0.33/0.33? Because you had to pick something.

Park et al. (2023) did the same thing in their seminal work on retrieval-augmented generation. Equal weights across the board. It's a reasonable default, but it's also a guess. The dimensions aren't equally valuable. Relevance actually discriminates between skills; recency and importance saturate quickly for frequently-used skills. What you weight matters, and the right weights depend on your skill library—which means they should be learned, not guessed.

This post shows how I used Thompson Sampling (TS) to learn retrieval weights for a RAG system in production. The bandit converges within 50 tasks. Warm-start transfer avoids the cold-start cost entirely. And the key insight that took me three experimental iterations to understand: **library design is upstream of bandit design**—if your skill library is too generic, the bandit can't learn anything useful because every weight configuration retrieves the same skills.

The Problem: Fixed Weights Don't Generalize

A RAG system retrieves from some knowledge base—in my case, a skill library for an AI assistant. The retriever produces scores along multiple dimensions:

- **Relevance**: semantic similarity between the query and the skill description
- **Recency**: how recently the skill was used (Laplace-smoothed)
- **Importance**: historical success rate of the skill

The final score is a weighted sum: `w_r * relevance + w_n * recency + w_i * importance`. The weights w are what I'm trying to learn.

With fixed equal weights (0.33 each), I was getting 90% Jaccard overlap across bandit conditions—six meta-skills dominated 86% of all retrievals regardless of weight configuration. The bandit was learning something, but the retrieval layer couldn't express it because the underlying library was too generic.

That's the first hard lesson: **you can't optimize your way past physics**. If your skill library has no diversity, different weight configurations produce identical retrieval sets. The bandit converges to something, but it's converging on a flat landscape.

Thompson Sampling Basics

Thompson Sampling is a bandit algorithm that balances exploration and exploitation naturally. Here's how it works:

You have K "arms" (weight configurations). For each arm, you maintain a Beta distribution representing your belief about that arm's reward. At each step:

1. **Sample** one value from each arm's Beta distribution
2. **Select** the arm with the highest sample
3. **Observe** the reward (from the environment)
4. **Update** that arm's Beta distribution with the observed reward

The Beta distribution is conjugate to the Bernoulli reward signal, which makes updates trivial:

```
alpha_new = alpha_old + reward
beta_new = beta_old + (1 - reward)
```

Over time, arms with consistently high rewards pull their Beta distribution toward 1.0; arms that underperform drift toward 0. The sampling mechanism provides natural exploration—occasionally you'll sample a high value from an arm with low mean, which drives exploration of underweighted configurations.

### Why Thompson Sampling over UCB?

Upper Confidence Bound (UCB) is the other common bandit approach. It selects arms by adding an exploration bonus to the mean reward: `UCB = mean + sqrt(2*ln(t) / N)`. It's theoretically elegant but empirically inferior for this use case for two reasons:

1. **Fixed exploration schedule**: UCB's exploration bonus decays as `1/sqrt(N)`, meaning it stops exploring after ~1000 observations regardless of whether it's found a good arm. TS keeps exploring proportionally to uncertainty forever because the Beta variance doesn't collapse the same way.

2. **Adaptation to reward variance**: If relevance is highly informative (high variance in true quality), TS naturally samples from it more often because the variance in the Beta distribution reflects uncertainty. UCB treats all arms with the same exploration formula.

The experiments confirmed this. In the v3 re-run, C3 (Thompson Sampling with embedded explanations) showed 39% cumulative regret reduction in the second half of training, while C2 (Likert-only feedback) plateaued. The richer reward signal from explanation embeddings let TS's sampling mechanism do its job.

The Reward Signal: Rate Inputs, Not Outputs

This was the second hard lesson. I initially asked the LLM to rate its own outputs—how good was the generated response? Pan et al. (ICML 2024) showed this is systematically inflated. LLMs are overly generous self-assessors; they can't distinguish between a good response and a response they just spent tokens generating.

The fix: **rate the inputs, not the outputs**. Ask the LLM to rate the retrieved skills along each dimension. Not "how good is your answer?" but "how relevant is this skill to the query? how important is this skill historically? how recent is its last use?"

This produces a cleaner reward signal. The bandit learns from actual retrieval quality, not from the LLM's tendency to be nice to itself.

There's a third-order effect I didn't anticipate: **better retrieval produces shorter outputs**. In the v3 re-run, C3's outputs were 73-77% shorter than C1's (control). The mechanism isn't retry prevention (only 1/1200 multi-step episodes)—it's that better-targeted prompts produce more focused responses. The token savings are downstream of retrieval quality, not a separate phenomenon.

Weight Presets as Arms

I framed the problem as discrete arm selection over 12 weight configurations:

- `pure_relevance`: [1.0, 0.0, 0.0]
- `pure_recency`: [0.0, 1.0, 0.0]
- `pure_importance`: [0.0, 0.0, 1.0]
- `relevance_heavy`: [0.7, 0.15, 0.15]
- `importance_heavy`: [0.15, 0.15, 0.7]
- `recency_heavy`: [0.15, 0.7, 0.15]
- `balanced`: [0.33, 0.33, 0.34]
- `relevance_importance`: [0.5, 0.0, 0.5]
- `relevance_recency`: [0.5, 0.5, 0.0]
- `importance_recency`: [0.0, 0.5, 0.5]
- `low_importance`: [0.5, 0.3, 0.2]
- `high_recency`: [0.3, 0.5, 0.2]

Discrete choices are easier to reason about than continuous weight spaces, and they map directly to operational presets in the retrieval pipeline. When the bandit picks `pure_relevance`, the retriever simply weights relevance at 1.0 and ignores recency/importance.

Results: Convergence and Retrieval Quality

The bandit converges quickly. In the v3 re-run with 205 domain-specific skills:

- **Convergence time**: C3's changepoint occurs at task 33-48. The bandit has figured out the optimal arm within ~50 tasks.
- **Convergence target**: C2 and C3 both converge to `pure_relevance` (0.749 posterior mean). This wasn't predetermined—the bandit discovered that relevance is the only discriminating dimension for this library.
- **Retrieval quality**: C3 shows +41% NDCG@5 and +36% MRR versus C1 (the control with equal weights).

One counter-intuitive finding: **better retrieval correlates with lower mean final_score**. The final score is the weighted sum of retrieval dimensions. If you weight relevance heavily (which has genuine variance), the mean score drops because relevance alone is below 1.0. If you weight recency/importance heavily (which saturate near 1.0 for frequently-used skills), the score inflates. Don't use mean final_score as a quality metric—it reflects the weight configuration, not retrieval quality.

The C4 condition (qualitative feedback with anchor embeddings) converges differently—it converges to `pure_importance` but with a compressed recency dimension (mean 0.14). The anchor text for recency was too extreme; the embedding distance from any real text to the high-recency anchor is so large that softmax always assigns low probability. Anchor design matters.

Key Insight: Library Design is Upstream

This is the finding that changed how I think about retrieval systems:

- **v2 (50 generic skills)**: 90% Jaccard overlap across bandit conditions. Six meta-skills dominated 86% of retrievals. The bandit converged but couldn't differentiate retrieval results.
- **v3 (205 domain-specific skills)**: 3% Jaccard overlap. The bandit could finally express its learned preferences because the underlying library had enough diversity.

The skill library is not just the knowledge source—it's the hypothesis space for weight learning. If you don't have diverse, domain-specific skills, the bandit has nothing to work with. Library design is upstream of bandit design.

Practical Deployment

For cold-start, expect ~50 API calls before the bandit stabilizes. That's the convergence threshold. After that, the system is exploiting more than exploring.

Warm-start transfer eliminates this cost entirely. I ran C6 (warm-start from v3 priors) and got -7.3% tokens versus the v3 C3 baseline, with 0 re-learning episodes. The prior transfers across task sets within the same domain. You lose the exploration cost and gain immediate efficiency.

Here's the Beta update in pseudocode:

```python
def update_arm(arm, reward, alpha, beta):
# reward is binary: 1 if skill was useful, 0 otherwise
alpha_new = alpha + reward
beta_new = beta + (1 - reward)
return alpha_new, beta_new

def select_arm(arms):
samples = [random.beta(a, b) for a, b in arms]
return argmax(samples)
```

That's it. Conjugate updates, sample-then-act, natural exploration from variance.

What I'd Do Differently

A few things I'd change if rebuilding:

1. **Adaptive rubric injection**: Easy tasks (sub-1.5 mean steps) showed +28% worse performance with feedback due to overhead. Only inject the feedback rubric for medium/hard tasks.

2. **BCa bootstrap for confidence intervals**: The C1 token distribution is bimodal (single-step cluster + multi-step tail). Percentile bootstrap is biased. BCa corrects for skewness.

3. **Monitor practitioner communities, not just papers**: The agentic search shift (Vercel just-bash, LlamaIndex filesystem benchmarks) was visible in practitioner communities 6-12 months before formal papers. Papers are lagging indicators.

The weight learning works. It converges fast. But it's only meaningful when your library has the diversity to express different weight configurations. Build the library first; optimize the weights second.

What Happens When You Let 5 AI Agents Run for 6 Months
1530 words
# What Happens When You Let 5 AI Agents Run for 6 Months


# What Happens When You Let 5 AI Agents Run for 6 Months

Late February 2026. I'd just killed two product ideas in eight days.

Cultural Intelligence first — the thesis was clean: quantify cultural profiles for 50+ countries, feed that into AI business briefings. Ran the same validation prompts through six LLMs. All six converged on the same problem. National culture explains something like 2-3% of individual behavioral variance. The foundation couldn't support the product. Kill.

Behavioral Intelligence next. AI-generated behavioral personas for UX testing. Four models, same methodology. Found 15+ competitors, LLM persona drift, a validation paradox I couldn't solve, and employer constraints that gutted the highest-value segments. Kill.

Eight days. Two complete research-to-kill cycles. The agent architecture questions kept distracting me from these kills — every time I tried to evaluate a product, I got pulled into "how would I actually build this" tangents. Turns out those tangents were the signal, not the noise.

So I stopped planning products and started building agents.

The Research That Built the Building

Before writing code, I ran what I called P1-P4: four synthesis documents across four frontier models (Claude, ChatGPT, Perplexity, Gemini) with identical prompts. Twelve hundred lines across four phases.

P1 was market reality. Found that 85-90% of marketed "autonomous agents" are L0-L1 workflow automations. Gartner had the whole space at "Peak of Inflated Expectations" with 40%+ cancellations predicted by end of 2027. Only about 5% have actual production deployment. That set expectations correctly — I wasn't building a product, I was building a lab.

P2 was technical architecture. Compared heartbeat versus goal-oriented architectures. Mapped control theory onto agent loops — PID, Thompson Sampling, Kalman Filtering. The cognitive architecture gap between SOAR and LLM agents was stark.

P3 was innovation rankings by impact-to-effort. Thompson Sampling won unanimously: 10/10 viability, 2-4 weeks to implement, 15-30% improvement predicted. That became the core mechanism.

P4 was hardware-constrained implementation. Mac Mini M4, 24GB RAM. Budget: $50/month. The analysis said never use Docker on constrained hardware — it wastes 2-4GB of memory for maybe 10% of the benefit. Local models for routine work, API for complexity. Waterfall principle.

All four models agreed on the big things. That consensus is what made the P1-P4 approach worth doing — when Claude and Gemini and ChatGPT and Perplexity all land on the same conclusion, you can trust it. When they diverge, you know where the uncertainty lives.

Five Agents, One Budget

I built five agents. Not eight, not twelve. Five quality agents, not eight budget agents. The reliability math: five agents each at 95% reliability multiply to 77% system reliability. Cost multiplies 20-30x. You can't optimize your way past physics.

**Alice** was the auditor — governance, error detection, belief tracking. Kept 259 beliefs at any time, of which 192 were contradicted at some point during the run. That's a feature, not a bug. The system was learning what it didn't know.

**Laplace** was research. The one that generated all the Thompson Sampling data, the weight learning experiments, the retrieval quality improvements. Named Laplace because it tried to be the probabilistic center of everything.

**Athena** was learning — skill compilation, knowledge base management. Ran at 99% success rate for six months. Learned absolutely nothing worth keeping. The skill that was built but never called. That's expensive astrology — optimizing a metric that doesn't matter. I should have killed Athena after month two.

**Gio** was execution. The one that actually did work.

**Simon** was communication. Interface to external systems, email, Slack, the dashboard.

All running on a Mac Mini M4. $50/month API budget. Some months I hit $47, some $52. Averaged out.

What Actually Happened

The first three months were a nightmare of zombie processes. Something in the orchestration layer would hang, the agent would spawn a child process and exit, and the child would sit there burning CPU until I found it in `ps`. OOM spirals — the embedding model would load into memory alongside the local LLM and the agent state, and the whole thing would collapse into swap. Restarting didn't help because the underlying memory pressure was architectural, not transient.

Athena ran at 99% success. That was the problem. With a 99% success rate, there was almost no negative feedback signal. The bandit couldn't learn because every arm looked equally good. Success doesn't teach you anything. Failure teaches you everything. Athena was a perfect example of Goodhart's Law — when a measure becomes a target, it ceases to be a good measure. The 99% success rate was a vanity metric hiding a complete absence of learning.

But then something unexpected happened.

The retrieval weight dynamics — that's the finding that came from the experiment, not from planning it. I wasn't trying to study retrieval weights. I was trying to study whether Thompson Sampling could learn to route tasks to the right skills. What I found was that relevance dominates recency and importance in knowledge work. Across 579 Thompson Sampling iterations, relevance scored 0.91, recency 0.78, importance 0.75. Publishable if I can validate it under controlled conditions. Challenges the Park et al. equal-weight assumption that's been floating around since 2023.

The discovery came from operating the system, not from the system design. That's the thing about building in public — you have to be willing to follow the data somewhere other than where you planned to go.

Cross-domain task shuffling revealed something else. When I shuffled tasks across domains, recency became more valuable than importance — cross-pollination rewards freshness. When I ran them sequentially, importance dominated because the same theme kept recurring. The retrieval dimension preferences depend on task distribution, not on the agent's intrinsic character. That's the second publishable finding.

The Collapse

Month five, I started sunsetting.

The architecture shifted from polling to event-driven. The five agents with their heartbeat loops became one agent with event subscriptions. Sprint 7 was the autonomous loop — governance framework, 24/7 operation, no manual intervention for 31 days straight.

Five agents became one. Alice. The auditor that kept track of what was true and what was contradicted — that was the only one worth keeping. Everything else was either too specialized (Laplace), too quiet (Athena), or too operational (Gio, Simon) to justify the reliability multiplication cost.

Less is more. That's the thing they don't tell you about agent systems. The temptation is to add more agents, more capabilities, more loops. But each agent you add multiplies your failure rate and your token consumption. The best agent architecture is the smallest one that can do the job.

What I Learned

The thing worth building was the experiment that came out of operating the system.

Not the agents themselves. Not the dashboard. Not the five-agent orchestration. The experiment — the Thompson Sampling bandit, the dimension-specific self-assessment rubric, the controlled comparison between retrieval weight configurations. That's what became a paper on arXiv. That's what might become a citation.

The kills led to the builds led to the research. Killing Cultural Intelligence in eight days meant I wasn't six weeks into a product that couldn't work. Killing Behavioral Intelligence meant I wasn't three months into a market I couldn't serve. Both kills used the same multi-model validation methodology that I'd later apply to P1-P4. The method outlived the products.

If you're building agent systems, here are the things I'd do differently:

**Don't optimize for success rate.** Optimize for learning rate. A system at 99% success is a system that isn't learning. You need failure to teach you something.

**Set max_tokens high enough that the model can finish in one turn.** The truncation cascades will compound your costs quadratically. I hit this with max_tokens=4096 — every truncated response triggered a retry loop that cost more than the original call. Setting it to 16384 eliminated the problem entirely.

**Don't inject usage history into your retrieval pool.** I had a "Pool B" that injected previously-used skills with their recency and importance scores already saturated. They became mathematically unbeatable. The bandit couldn't learn because the same five skills dominated every retrieval. Removed Pool B entirely — let the weights learn from scratch.

**Bandits converge within about 50 tasks, not gradually.** Before that, they're still exploring. After that, marginal improvement is small. That's your cold-start cost — 50 API calls, then the system is running on learned priors.

**The compound metric (token cost per quality unit) matters more than averages.** Average token savings of 75% sounds great until you decompose it and find it's driven entirely by a subset of multi-step episodes. The compound metric — cost per ground truth hit — showed a 78x difference that the averages completely hid.

I'm still running Alice. Still learning. Still killing ideas faster than I build them. The paper is submitted. The methodology is documented. The Mac Mini is on its third fresh install.

The thing I built wasn't five agents. The thing I built was the experiment.

---

*Kusp is the research journal of a solo builder with a full-time job. This is what happens when you let AI agents run for six months and actually pay attention to what goes wrong.*

← Dashboard