Substack Drafts

Two Kills in Eight Days

1,040 words Ready ▾

What happens when you run the same question through four AI models and they all tell you to stop.

I killed two product ideas in 8 days. I'm writing about it because the kills taught me more than any launch would have.

The first idea: Cultural Intelligence

The pitch was compelling. Build AI-powered cultural briefings for businesses working across borders. Cross-cultural M&A prep. Trading strategies informed by cultural psychology. Local business automation that understands regional communication norms.

I went deep. Started with Hofstede's canonical 6 cultural dimensions — power distance, individualism, uncertainty avoidance, the classics. Then expanded to 58 dimensions across the GLOBE project, World Values Survey, & Erin Meyer's Culture Map. Built cross-validation prompts. Designed the data architecture. This felt like a genuine moat. Nobody else was layering cultural dimension analysis on top of standard business intelligence.

I was excited.

Then I did something I've started doing with every major direction: ran the same fundamental question through four different AI models in parallel. ChatGPT for synthesis, Perplexity for fact-checking & academic sources, Gemini for alternative angles, Claude for critical analysis.

The question: What percentage of individual behavioral variance is explained by national culture across validated research?

Every model pointed to the same literature. The same meta-analyses. The same conclusion.

2-4%.

National culture explains 2-4% of individual behavioral variance in business contexts. Individual personality explains 30-40%. Organizational culture: 15-25%. Industry norms: 10-15%. Situational context, education, generational differences — all bigger factors.

The variance within a culture is roughly 10x larger than the variance between cultures. A Japanese executive's behavior is more shaped by their personality, their company's culture, & their specific situation than by "Japan scores high on Uncertainty Avoidance."

I had 58 dimensions that looked impressive and explained less than background noise.

Why I couldn't save it

I tried.

"Maybe if I layer cultural dimensions on top of personality assessments and organizational culture data..." That's three complex datasets where the cultural layer is the weakest signal. Complexity without value.

"Maybe cultural dimensions matter more in specific contexts — dating apps, travel planning..." That's motivated reasoning. I wanted my research to be useful. But 2-4% doesn't change because I change the use case.

A client paying $150-400 for insights that explain 2-4% of what they care about? That's expensive astrology.

Using cultural dimensions to predict market behavior while ignoring 96-98% of what drives investor decisions? That's noise trading.

This wasn't an execution problem where better implementation would fix things. This was a physics problem. And you can't optimize your way past physics.

Kill #2: Behavioral Personas

Four days later I pivoted to AI-driven personality simulation. Structured personas for sales training, negotiation prep, that kind of thing.

Ran the same multi-model validation. Found 15+ commercial platforms already in this space. More importantly, found structural problems nobody had solved — validation paradoxes (how do you verify a simulated personality is accurate?), sycophancy bias in LLM-generated personas, demographic homogeneity in training data.

Same pattern. Surface-level appeal masking structural flaws.

Killed that too.

The methodology that saved me months

Here's the practical takeaway.

Running the same fundamental question through 4 different LLMs in parallel cost me maybe $3 in API calls & a few hours per idea. That's it. And it surfaced fatal flaws before I wrote production code.

The trick isn't just using multiple models — it's what you look for in the results.

When all four models converge on the same finding with the same source literature, that's high-confidence signal. When they diverge, the divergence itself tells you something — either the question has genuine nuance or the evidence base is weak.

I'm not asking models "is my idea good?" That's fishing for validation. I'm asking the specific factual question that my entire product depends on being true. What percentage of variance does X explain? How many competitors already exist in Y? What are the known structural problems with Z?

If the answer kills the idea, the answer kills the idea. Better to find out for $3 than for $30,000.

What actually happened next

During both research sprints, I kept getting pulled toward agent architecture questions. How would I build a continuous monitoring system? What does "genuinely autonomous" mean beyond cron-based heartbeats? How do feedback loops self-correct?

I was forcing myself to stay focused on the "smarter business case." But the agent questions were what kept me working past midnight.

The two kills gave me permission to build what actually excited me. And it turned into the most productive stretch of building I've ever had — Thompson Sampling, event-driven architecture, an autonomous agent running on a Mac Mini. None of it would exist if I'd spent two more months trying to make cultural intelligence work.

The pattern I'm watching for

Two product ideas in a row where I was drawn to "quantifying human behavior for business use" & both times the signal quality turned out to be garbage. That's not coincidence — that's a bias.

Something about these ideas appeals to me on a gut level that overrides my analytical judgment. The frameworks look rigorous. The market story writes itself. But underneath, the core variable just doesn't carry enough weight.

So I've got a new rule: when I'm evaluating a product idea that involves predicting human behavior, I apply extra skepticism to the core signal assumption. The appeal of the idea is itself a warning sign.

Two kills. Eight days. And the thing I was trying to ignore became the thing worth building.

Building in public means documenting the kills. Not just the wins.

This is part of an ongoing series documenting what I'm building, what I'm learning, & what I'm killing along the way. Follow along on LinkedIn or subscribe here for the detailed breakdowns.

Thompson Sampling for Retrieval Weight Learning: A Practitioner's Guide

1,320 words Ready ▾

If you've built a RAG system, you've faced the weight problem. You combine relevance scores from your vector search, recency from your usage table, & importance from some heuristic — and you weight them equally. Why 0.33/0.33/0.33? Because you had to pick something.

Park et al. (2023) did the same thing in their seminal work on retrieval-augmented generation. Equal weights across the board. It's a reasonable default, but it's also a guess. The dimensions aren't equally valuable. Relevance actually discriminates between skills; recency & importance saturate quickly for frequently-used skills. What you weight matters, & the right weights depend on your skill library — which means they should be learned, not guessed.

This post shows how I used Thompson Sampling to learn retrieval weights for a RAG system. The bandit converges within 50 tasks. Warm-start transfer avoids the cold-start cost entirely. And the key insight that took me three experimental iterations to understand: library design is upstream of bandit design — if your skill library is too generic, the bandit can't learn anything useful because every weight configuration retrieves the same skills.

The Problem: Fixed Weights Don't Generalize

A RAG system retrieves from some knowledge base — in my case, a skill library for an AI assistant. The retriever produces scores along multiple dimensions:

Relevance: semantic similarity between the query & the skill description
Recency: how recently the skill was used (Laplace-smoothed)
Importance: historical success rate of the skill

The final score is a weighted sum: w_r * relevance + w_n * recency + w_i * importance. The weights w are what I'm trying to learn.

With fixed equal weights (0.33 each), I was getting 90% Jaccard overlap across bandit conditions — six meta-skills dominated 86% of all retrievals regardless of weight configuration. The bandit was learning something, but the retrieval layer couldn't express it because the underlying library was too generic.

That's the first hard lesson: you can't optimize your way past physics. If your skill library has no diversity, different weight configurations produce identical retrieval sets. The bandit converges to something, but it's converging on a flat landscape.

Thompson Sampling Basics

Thompson Sampling is a bandit algorithm that balances exploration & exploitation naturally. You have K "arms" (weight configurations). For each arm, you maintain a Beta distribution representing your belief about that arm's reward. At each step:

Sample one value from each arm's Beta distribution
Select the arm with the highest sample
Observe the reward
Update that arm's Beta distribution

The Beta distribution is conjugate to the Bernoulli reward signal, so updates are trivial:

alpha_new = alpha_old + reward
beta_new = beta_old + (1 - reward)

Over time, arms with consistently high rewards pull their Beta distribution toward 1.0; arms that underperform drift toward 0. The sampling mechanism provides natural exploration — occasionally you'll sample a high value from an arm with low mean, which drives exploration of underweighted configurations.

Why Thompson Sampling over UCB?

Upper Confidence Bound (UCB) is the other common approach. It selects arms by adding an exploration bonus to the mean reward: UCB = mean + sqrt(2*ln(t) / N). Theoretically elegant but empirically inferior for this use case:

Fixed exploration schedule: UCB's bonus decays as 1/sqrt(N), meaning it stops exploring after ~1000 observations regardless of whether it's found a good arm. TS keeps exploring proportionally to uncertainty.
Adaptation to reward variance: If relevance is highly informative, TS naturally samples from it more often because the variance in the Beta distribution reflects uncertainty. UCB treats all arms the same.

The experiments confirmed this. C3 (Thompson Sampling with embedded explanations) showed 39% cumulative regret reduction in the second half of training, while C2 (Likert-only feedback) plateaued.

The Reward Signal: Rate Inputs, Not Outputs

Second hard lesson. I initially asked the LLM to rate its own outputs — how good was the generated response? Pan et al. (ICML 2024) showed this is systematically inflated. LLMs are overly generous self-assessors; they can't distinguish between a good response & a response they just spent tokens generating.

The fix: rate the inputs, not the outputs. Ask the LLM to rate the retrieved skills along each dimension. Not "how good is your answer?" but "how relevant is this skill to the query? how important is this skill historically?"

This produces a cleaner reward signal. The bandit learns from actual retrieval quality, not from the LLM's tendency to be nice to itself.

There's a third-order effect I didn't anticipate: better retrieval produces shorter outputs. C3's outputs were 73-77% shorter than C1's (control). The mechanism isn't retry prevention (only 1/1200 multi-step episodes) — it's that better-targeted prompts produce more focused responses. The token savings are downstream of retrieval quality, not a separate phenomenon.

Weight Presets as Arms

I framed the problem as discrete arm selection over 12 weight configurations:

pure_relevance: [1.0, 0.0, 0.0]
pure_recency: [0.0, 1.0, 0.0]
balanced: [0.33, 0.33, 0.34]
relevance_heavy: [0.7, 0.15, 0.15]
... and 8 more combinations

Discrete choices are easier to reason about than continuous weight spaces, & they map directly to operational presets in the retrieval pipeline.

Results: Convergence & Retrieval Quality

The bandit converges quickly. In the v3 re-run with 205 domain-specific skills:

Convergence time: C3's changepoint occurs at task 33-48. The bandit has figured out the optimal arm within ~50 tasks.
Convergence target: C2 & C3 both converge to pure_relevance (0.749 posterior mean). This wasn't predetermined — the bandit discovered that relevance is the only discriminating dimension for this library.
Retrieval quality: C3 shows +41% NDCG@5 & +36% MRR versus C1 (control with equal weights).

One counter-intuitive finding: better retrieval correlates with lower mean final_score. If you weight relevance heavily (which has genuine variance), the mean score drops because relevance alone is below 1.0. If you weight recency/importance heavily (which saturate near 1.0), the score inflates. Don't use mean final_score as a quality metric.

Key Insight: Library Design is Upstream

This is the finding that changed how I think about retrieval systems:

v2 (50 generic skills): 90% Jaccard overlap across conditions. Six meta-skills dominated 86% of retrievals.
v3 (205 domain-specific skills): 3% Jaccard overlap. The bandit could finally express its learned preferences.

The skill library is not just the knowledge source — it's the hypothesis space for weight learning. Library design is upstream of bandit design.

Practical Deployment

For cold-start, expect ~50 API calls before the bandit stabilizes. After that, the system is exploiting more than exploring.

Warm-start transfer eliminates this cost entirely. I ran C6 (warm-start from v3 priors) & got -7.3% tokens versus the v3 C3 baseline, with 0 re-learning episodes. The prior transfers across task sets within the same domain.

def update_arm(arm, reward, alpha, beta):
    alpha_new = alpha + reward
    beta_new = beta + (1 - reward)
    return alpha_new, beta_new

def select_arm(arms):
    samples = [random.beta(a, b) for a, b in arms]
    return argmax(samples)

That's it. Conjugate updates, sample-then-act, natural exploration from variance.

What I'd Do Differently

Adaptive rubric injection: Easy tasks showed +28% worse performance with feedback due to overhead. Only inject the feedback rubric for medium/hard tasks.
BCa bootstrap for confidence intervals: The C1 token distribution is bimodal. Percentile bootstrap is biased. BCa corrects for skewness.
Monitor practitioner communities, not just papers: The agentic search shift was visible in practitioner communities 6-12 months before formal papers. Papers are lagging indicators.

The weight learning works. It converges fast. But it's only meaningful when your library has the diversity to express different weight configurations. Build the library first; optimize the weights second.

Code is open source. Full experiment data (1,200+ episodes) available on GitHub.

Why "Don't Do X" Instructions Backfire

Needs Work

Strong analysis of Wegner's white bear effect applied to LLM prompting. Verify the "40 prompts across 3 models" claim before publishing — the rewritten LinkedIn version replaced this with "batch of negative instructions" which may be more accurate.

Can AI Agents Learn What to Remember?

Needs Work

Conflates the experiment (4 conditions, retrieval weight learning) with the old 5-agent system. The experiment was a controlled study, not an observation of multi-agent behavior. Needs reframing.

Operating a Multi-Agent System for 6 Months

Kill

Timeline is factually wrong. The multi-agent system ran for roughly 1 month, not 6. Core premise is false — cannot be salvaged without a complete rewrite that undermines the thesis.

What Happens When You Let 5 AI Agents Run for 6 Months

Kill

Same timeline fabrication. Additionally, the kill stories are redundant with the Cultural Intelligence post ("Two Kills in Eight Days"), which covers the same narrative better.