Voice-rewritten 2026-03-31. Tonight's pick highlighted.
My degree is in Intercultural & International Communications. Math & marketing minors. No CS.
The most useful thing I've learned building AI agents came from caseworker training in Albany, not a textbook.
CPS caseworkers use Likert anchoring to assess risk. Ambiguous situation — home visit, conversation with a parent — you map it against calibrated reference points. Not "is this bad?" but "compared to these specific scenarios, where does this fall?"
That's prompt engineering. I just didn't know the word yet.
In my experiment, the condition using anchored self-assessment — specific scenarios for each rating level instead of bare numbers — outperformed basic Likert by 41% in retrieval quality. Same technique. Same principle. Give the evaluator, human or LLM, concrete reference points instead of abstract scales.
Communications theory has this thing — Sapir-Whorf hypothesis — language shapes cognition. For humans, debatable. For LLMs, trivially true. Language IS their cognition. No escape hatch. No nonverbal processing. No gut feeling.
Everything I studied about how framing changes interpretation, how anchoring calibrates judgment, how context shapes meaning — it applies to LLMs more literally than it ever applied to people.
Turns out a communications degree IS an AI degree. You just have to build something to find out.
Added "don't do X" guardrails to a prompt. Got worse outputs than the baseline. Spent two hours debugging it.
Turns out there's a name for this. Wegner's white bear experiment, 1987 — tell someone not to think about white bears & they think about them more. A monitoring process keeps checking for the forbidden thought, which keeps it activated.
LLMs do something similar. But not always. I tested a batch of negative instructions across our agent system & the pattern is more specific than "negative instructions fail."
Suppress a structural behavior ("don't append a question") — works. We got a 3.5x reduction in multi-step episodes.
Suppress a semantic concept ("don't mention elephants") — backfires. Makes the concept high-activation with no alternative.
The difference is mechanical. "Don't say X" dampens X's token probability but doesn't boost an alternative. "Say Y instead" directly increases Y's probability. One pushes down, the other pulls up.
There's also a separate thing that gets confused with this — the Waluigi Effect. After RLHF makes a model satisfy property P, ~P becomes more accessible via adversarial prompting. Real effect, but structural — baked into training, not a runtime instruction failure. Different problem.
Our fix for the semantic cases was architectural. Removed the mention entirely & enforced the constraint by changing what the system was allowed to generate. No negation in the prompt. Just a different set of allowable outputs.
Rule of thumb: negative instructions work for patterns. They fail for concepts. If your "don't" is failing, flip it to a positive instruction or enforce architecturally.
Equal retrieval weights. Relevance, recency, importance — 0.33 each. Park et al. did the same thing. Seemed reasonable.
Then I ran 1,200 episodes across 4 conditions & watched the system learn that relevance should be at 0.90, not 0.33. The equal-weight assumption was costing 41% retrieval quality.
The fix was Thompson Sampling — a bandit algorithm from 1933. Each query is an experiment. The system samples from its beliefs about which weights work, observes what happens, updates. No grid search. No manual tuning. 50 queries to converge.
What broke first: my skill library. 50 generic skills meant every weight configuration retrieved the same 6 results. 90% Jaccard overlap. The bandit couldn't learn because there was nothing to differentiate. Swapped to 205 domain-specific skills & Jaccard dropped to 3%. Library design is upstream of everything.
What broke second: the reward signal. The LLM was rating "importance" highest 53% of the time regardless of actual weights. Systematic bias. The bandit converged to the wrong point because the signal was lying.
What worked: rating the inputs, not the outputs. "Were the retrieved skills useful?" not "was your response good?" Pan et al. showed output self-assessment inflates scores. Input assessment avoids the self-referential loop.
The result nobody expected — better retrieval made the LLM write shorter responses. Not fewer retries. Shorter single-step outputs. 20% token reduction from giving the model better context, not from preventing failures.
Code is open source on GitHub.
Conflates the experiment (4 conditions, retrieval weight learning) with the old 5-agent system. Numbers need fact-checking against actual experiment data before publishing.
Timeline is factually wrong. The multi-agent system ran for roughly 1 month, not 6. Core premise is false.
Same timeline fabrication as above. "6 months" claim has no basis in reality.