# Can AI Agents Learn What to Remember?
I built a system that helps an AI agent pull the right skills from a library when it needs to solve a task. The library has 205 skills — things like "write a SQL query," "parse JSON," "format a date." The agent sees a user request, retrieves relevant skills, and uses them to generate a response.
The problem: not all retrieval dimensions are equal. Relevance (does this skill actually help?) matters, sure. But recency (did I use this recently?) and importance (did it work well last time?) also compete for weight. Most systems assign them equal importance or pick weights arbitrarily. I wanted to know: can an agent learn which dimension matters most for its particular library?
I ran the experiment on a Mac Mini M4 with a $50/month MiniMax API budget. Five agents. 1,200 episodes across four conditions. Here's what happened.
The Production Observation
In production, I watched relevance dominate. Across 579 Thompson Sampling iterations, the relevance dimension scored 0.91 on average. Recency came in at 0.78. Importance lagged at 0.75.
This surprised me. I assumed recency would matter more — agents typically benefit from trying what worked recently. But the 205-skill library had enough semantic diversity that semantic relevance was the discriminating factor. The same six meta-skills dominated 86% of retrievals regardless of recency or importance weighting.
Park et al. assumed equal weights in their retrieval model. My production data suggested that assumption was wrong — at least for this library. But production observations aren't causal evidence. The agent could have been learning the wrong thing, or the pattern could be an artifact of how I was measuring.
I needed a controlled experiment.
The Controlled Experiment
I ran 1,200 episodes across four conditions:
- **C1 (control):** Static BM25 retrieval, no weight learning
- **C2:** Likert-scale self-assessment only
- **C3:** Likert-scale + explanation embeddings — the system rates its retrieval and explains why
- **C4:** Qualitative feedback (free-text explanations parsed via anchor embeddings)
The key metric was NDCG@5 — how well the top-5 retrieved skills match the ground truth. C3 achieved +41% NDCG@5 and +36% MRR versus control. That's a substantial gap.
But the mechanism changed between runs, and that's where it gets interesting.
The Mechanism Shift
In the first run (v2), the token savings came from multi-step prevention. The control condition hit multi-step episodes 11.6% of the time — the agent would fail, retry, fail again, accumulate context, and burn tokens. The weight-learning conditions dropped that to 3.2-3.6%.
I wrote that up as "multi-step prevention is the mechanism."
Then I increased max_tokens from 4096 to 16384, fixed a bug with usage-history injection, and re-ran.
The result: only 1 out of 1,200 episodes was multi-step. The mechanism had shifted. Token savings now came from output-driven efficiency — better retrieval produced shorter, more focused LLM responses. The agent wasn't retrying less; it was finishing faster in a single step.
The v2 finding was an artifact of max_tokens=4096 triggering truncation cascades. When the model couldn't finish in one turn, it output exactly 4096 tokens, the runner interpreted that as "not done," continued the conversation, and costs compounded quadratically.
You can't optimize your way past physics.
What Worked
The bandit converges within roughly 50 tasks. Not gradually — it flips from exploratory to exploitative around task 33-48. After that, it's mostly pure_relevance. The cold-start cost is real but bounded.
C3 (Likert + explanation embeddings) outperformed C2 (Likert only). The explanation embeddings don't directly improve retrieval — they improve the quality of the reward signal feeding the bandit. The SNR was 2.5x higher. C3's cumulative regret dropped 39% in the second half versus 4% for C2. The feedback signal quality, not just its presence, determines whether online learning accelerates.
The skill library composition matters. v2 used 50 generic skills and got 90% Jaccard overlap across conditions — the bandit couldn't differentiate because the same skills dominated regardless of weights. v3's 205 domain-specific skills dropped Jaccard to 3%, enabling genuine differentiation. Library design is upstream of bandit design.
What Didn't Work
C4 (qualitative feedback via anchor embeddings) didn't converge cleanly. The recency dimension collapsed to mean 0.14 — the "high recency" anchor text was too extreme. Any real LLM response was far from it in embedding space, so softmax always assigned low probability. Anchor design determines which dimensions the parser can distinguish. The recency anchors needed more moderate language.
The control condition (C1) has extreme variance — SD of 110,546 tokens. The bimodal distribution (single-step cluster plus multi-step tail) killed statistical power. Only C1 vs C4 survived Bonferroni correction. C1 vs C2 (p=0.264) and C1 vs C3 (p=0.069) failed despite 73-76% mean reductions and bootstrap CIs excluding zero.
Easy tasks showed rubric overhead. C1 averaged 3,198 tokens on easy tasks; C4 averaged 4,078 — feedback made it worse. The rubric adds ~880 tokens of overhead that easy tasks don't benefit from. The crossover point is around 1.5 mean steps. Below that, feedback hurts.
What This Means for Practitioners
If you're building a retrieval-augmented agent and assigning static weights to relevance, recency, and importance — stop. The optimal weights depend on your skill library composition, and you can learn them.
The cold-start cost is ~50 tasks. After that, the bandit has converged and you're in the exploitation regime. If you have prior data (warm-start), you skip this — C6 with warm-start priors hit efficiency immediately without re-learning.
The token savings aren't from retry prevention in most configurations. They're from output-driven efficiency: better retrieval produces more targeted prompts, which produces shorter LLM responses. If your max_tokens is too low, you'll see retry behavior instead. Set it high enough that the model can finish in one turn.
Finally, don't trust mean final_score as a quality metric. C3 had the highest retrieval quality but the lowest mean final_score. That's because the system learned to down-weight dimensions that saturate near 1.0 (recency, importance) and up-weight relevance (which has genuine variance but scores lower alone). The compound metric — token cost per ground truth hit — is more informative. C1 needed 895,581 tokens per GT hit. C2 needed 11,421. That's a 78x difference in cost-effectiveness.
The field is moving past traditional RAG for small-to-medium corpora. Agentic keyword search, fine-tuning to internalize retrieval benefits, and long-context models all challenge the layer I'm optimizing. But the core insight — that agents can learn which retrieval dimensions matter, via gradient-free parameter learning and self-assessment — transfers beyond RAG. That's the durable finding.
I ran 1,200 episodes across four conditions:
- **C1 (control):** Static BM25 retrieval, no weight learning
- **C2:** Likert-scale self-assessment only
- **C3:** Likert-scale + explanation embeddings — the system rates its retrieval and explains why
- **C4:** Qualitative feedback (free-text explanations parsed via anchor embeddings)
The key metric was NDCG@5 — how well the top-5 retrieved skills match the ground truth. C3 achieved +41% NDCG@5 and +36% MRR versus control. That's a substantial gap.
But the mechanism changed between runs, and that's where it gets interesting.
The Mechanism Shift
In the first run (v2), the token savings came from multi-step prevention. The control condition hit multi-step episodes 11.6% of the time — the agent would fail, retry, fail again, accumulate context, and burn tokens. The weight-learning conditions dropped that to 3.2-3.6%.
I wrote that up as "multi-step prevention is the mechanism."
Then I increased max_tokens from 4096 to 16384, fixed a bug with usage-history injection, and re-ran.
The result: only 1 out of 1,200 episodes was multi-step. The mechanism had shifted. Token savings now came from output-driven efficiency — better retrieval produced shorter, more focused LLM responses. The agent wasn't retrying less; it was finishing faster in a single step.
The v2 finding was an artifact of max_tokens=4096 triggering truncation cascades. When the model couldn't finish in one turn, it output exactly 4096 tokens, the runner interpreted that as "not done," continued the conversation, and costs compounded quadratically.
You can't optimize your way past physics.
What Worked
The bandit converges within roughly 50 tasks. Not gradually — it flips from exploratory to exploitative around task 33-48. After that, it's mostly pure_relevance. The cold-start cost is real but bounded.
C3 (Likert + explanation embeddings) outperformed C2 (Likert only). The explanation embeddings don't directly improve retrieval — they improve the quality of the reward signal feeding the bandit. The SNR was 2.5x higher. C3's cumulative regret dropped 39% in the second half versus 4% for C2. The feedback signal quality, not just its presence, determines whether online learning accelerates.
The skill library composition matters. v2 used 50 generic skills and got 90% Jaccard overlap across conditions — the bandit couldn't differentiate because the same skills dominated regardless of weights. v3's 205 domain-specific skills dropped Jaccard to 3%, enabling genuine differentiation. Library design is upstream of bandit design.
What Didn't Work
C4 (qualitative feedback via anchor embeddings) didn't converge cleanly. The recency dimension collapsed to mean 0.14 — the "high recency" anchor text was too extreme. Any real LLM response was far from it in embedding space, so softmax always assigned low probability. Anchor design determines which dimensions the parser can distinguish. The recency anchors needed more moderate language.
The control condition (C1) has extreme variance — SD of 110,546 tokens. The bimodal distribution (single-step cluster plus multi-step tail) killed statistical power. Only C1 vs C4 survived Bonferroni correction. C1 vs C2 (p=0.264) and C1 vs C3 (p=0.069) failed despite 73-76% mean reductions and bootstrap CIs excluding zero.
Easy tasks showed rubric overhead. C1 averaged 3,198 tokens on easy tasks; C4 averaged 4,078 — feedback made it worse. The rubric adds ~880 tokens of overhead that easy tasks don't benefit from. The crossover point is around 1.5 mean steps. Below that, feedback hurts.
What This Means for Practitioners
If you're building a retrieval-augmented agent and assigning static weights to relevance, recency, and importance — stop. The optimal weights depend on your skill library composition, and you can learn them.
The cold-start cost is ~50 tasks. After that, the bandit has converged and you're in the exploitation regime. If you have prior data (warm-start), you skip this — C6 with warm-start priors hit efficiency immediately without re-learning.
The token savings aren't from retry prevention in most configurations. They're from output-driven efficiency: better retrieval produces more targeted prompts, which produces shorter LLM responses. If your max_tokens is too low, you'll see retry behavior instead. Set it high enough that the model can finish in one turn.
Finally, don't trust mean final_score as a quality metric. C3 had the highest retrieval quality but the lowest mean final_score. That's because the system learned to down-weight dimensions that saturate near 1.0 (recency, importance) and up-weight relevance (which has genuine variance but scores lower alone). The compound metric — token cost per ground truth hit — is more informative. C1 needed 895,581 tokens per GT hit. C2 needed 11,421. That's a 78x difference in cost-effectiveness.
The field is moving past traditional RAG for small-to-medium corpora. Agentic keyword search, fine-tuning to internalize retrieval benefits, and long-context models all challenge the layer I'm optimizing. But the core insight — that agents can learn which retrieval dimensions matter, via gradient-free parameter learning and self-assessment — transfers beyond RAG. That's the durable finding.
The bandit converges within roughly 50 tasks. Not gradually — it flips from exploratory to exploitative around task 33-48. After that, it's mostly pure_relevance. The cold-start cost is real but bounded.
C3 (Likert + explanation embeddings) outperformed C2 (Likert only). The explanation embeddings don't directly improve retrieval — they improve the quality of the reward signal feeding the bandit. The SNR was 2.5x higher. C3's cumulative regret dropped 39% in the second half versus 4% for C2. The feedback signal quality, not just its presence, determines whether online learning accelerates.
The skill library composition matters. v2 used 50 generic skills and got 90% Jaccard overlap across conditions — the bandit couldn't differentiate because the same skills dominated regardless of weights. v3's 205 domain-specific skills dropped Jaccard to 3%, enabling genuine differentiation. Library design is upstream of bandit design.
What Didn't Work
C4 (qualitative feedback via anchor embeddings) didn't converge cleanly. The recency dimension collapsed to mean 0.14 — the "high recency" anchor text was too extreme. Any real LLM response was far from it in embedding space, so softmax always assigned low probability. Anchor design determines which dimensions the parser can distinguish. The recency anchors needed more moderate language.
The control condition (C1) has extreme variance — SD of 110,546 tokens. The bimodal distribution (single-step cluster plus multi-step tail) killed statistical power. Only C1 vs C4 survived Bonferroni correction. C1 vs C2 (p=0.264) and C1 vs C3 (p=0.069) failed despite 73-76% mean reductions and bootstrap CIs excluding zero.
Easy tasks showed rubric overhead. C1 averaged 3,198 tokens on easy tasks; C4 averaged 4,078 — feedback made it worse. The rubric adds ~880 tokens of overhead that easy tasks don't benefit from. The crossover point is around 1.5 mean steps. Below that, feedback hurts.
What This Means for Practitioners
If you're building a retrieval-augmented agent and assigning static weights to relevance, recency, and importance — stop. The optimal weights depend on your skill library composition, and you can learn them.
The cold-start cost is ~50 tasks. After that, the bandit has converged and you're in the exploitation regime. If you have prior data (warm-start), you skip this — C6 with warm-start priors hit efficiency immediately without re-learning.
The token savings aren't from retry prevention in most configurations. They're from output-driven efficiency: better retrieval produces more targeted prompts, which produces shorter LLM responses. If your max_tokens is too low, you'll see retry behavior instead. Set it high enough that the model can finish in one turn.
Finally, don't trust mean final_score as a quality metric. C3 had the highest retrieval quality but the lowest mean final_score. That's because the system learned to down-weight dimensions that saturate near 1.0 (recency, importance) and up-weight relevance (which has genuine variance but scores lower alone). The compound metric — token cost per ground truth hit — is more informative. C1 needed 895,581 tokens per GT hit. C2 needed 11,421. That's a 78x difference in cost-effectiveness.
The field is moving past traditional RAG for small-to-medium corpora. Agentic keyword search, fine-tuning to internalize retrieval benefits, and long-context models all challenge the layer I'm optimizing. But the core insight — that agents can learn which retrieval dimensions matter, via gradient-free parameter learning and self-assessment — transfers beyond RAG. That's the durable finding.
If you're building a retrieval-augmented agent and assigning static weights to relevance, recency, and importance — stop. The optimal weights depend on your skill library composition, and you can learn them.
The cold-start cost is ~50 tasks. After that, the bandit has converged and you're in the exploitation regime. If you have prior data (warm-start), you skip this — C6 with warm-start priors hit efficiency immediately without re-learning.
The token savings aren't from retry prevention in most configurations. They're from output-driven efficiency: better retrieval produces more targeted prompts, which produces shorter LLM responses. If your max_tokens is too low, you'll see retry behavior instead. Set it high enough that the model can finish in one turn.
Finally, don't trust mean final_score as a quality metric. C3 had the highest retrieval quality but the lowest mean final_score. That's because the system learned to down-weight dimensions that saturate near 1.0 (recency, importance) and up-weight relevance (which has genuine variance but scores lower alone). The compound metric — token cost per ground truth hit — is more informative. C1 needed 895,581 tokens per GT hit. C2 needed 11,421. That's a 78x difference in cost-effectiveness.
The field is moving past traditional RAG for small-to-medium corpora. Agentic keyword search, fine-tuning to internalize retrieval benefits, and long-context models all challenge the layer I'm optimizing. But the core insight — that agents can learn which retrieval dimensions matter, via gradient-free parameter learning and self-assessment — transfers beyond RAG. That's the durable finding.