Part VIII — Your System

Connecting It Back

What this all means for what you're building. The view from here.

Twenty-six chapters. From chemotaxis in bacteria to the transformer architecture, from Hebbian learning to RLHF, from the limbic system to curiosity-driven RL. The question now is: what does all of this mean for what you're actually building, on the hardware you actually have, with the tools you can actually use?

This chapter maps the guide's content onto your specific situation: hardware constraints, research contribution, what's possible today, and where the interesting problems live.

Your Hardware

Your compute environment:

Component	Spec	What it enables
Mac Mini M4	Apple Silicon, 10-core CPU, 10-core GPU	Unified memory architecture — GPU and CPU share memory, eliminating transfer overhead
24GB unified RAM	Shared between CPU and GPU	Can run quantized models up to ~14B parameters comfortably. ~7B at higher precision.
512GB SSD + 2TB external	Local storage	Plenty for model weights, databases, embeddings, and experiment data
Ollama + qwen2.5:14b	Local inference	Free, private, unlimited inference for tasks where 14B-parameter quality is sufficient
MiniMax M2.5 API	$50/mo flat, ~200 prompts/hr	Higher-quality inference for tasks requiring more capability, with generous rate limits
Claude Max	Subscription tier	Top-tier reasoning for complex tasks, code generation, and research

What This Hardware Can Do

Inference — run quantized models up to ~14B parameters locally at interactive speed. This is sufficient for many agent tasks, embedding generation, classification, and structured output.
LoRA fine-tuning — fine-tune small models (up to ~7B, possibly ~14B with careful memory management) using LoRA adapters (Chapter 14). The M4's unified memory means no CPU-GPU transfer bottleneck. Frameworks like MLX (Apple's ML framework for Apple Silicon) are optimized for this hardware.
RAG systems — embedding generation, vector search, hybrid retrieval. The core of your current system runs entirely locally.
Agent frameworks — orchestration, tool calling, event-driven loops, bandit optimization. All compute-light relative to inference.
Bandit experiments — Thompson Sampling with thousands or millions of episodes. Negligible compute relative to LLM inference.

What Requires Cloud

Pre-training — training a model from scratch requires hundreds to thousands of GPUs and weeks to months of compute. Not feasible locally.
Full fine-tuning of large models — modifying all weights of a model larger than ~7B parameters exceeds local memory. LoRA is the practical alternative.
RL at scale — training RL agents in complex environments with millions of episodes typically requires significant GPU compute for the policy network updates.
Large-scale embedding computation — embedding millions of documents for a large retrieval corpus. Feasible locally but slow; cloud GPUs accelerate this significantly.

Your Research Contribution

Your paper describes per-agent Thompson Sampling over retrieval weight presets with LLM self-assessment as the reward signal. Here's where this sits in the research landscape, mapped to the concepts from this guide:

The Research Intersection

Your work sits at the intersection of three fields that don't normally talk to each other:

Information retrieval (SIGIR, CIKM) — the hybrid search component. Combining semantic and keyword retrieval with configurable weights is a well-studied IR problem. Your contribution is making the weights adaptive per agent rather than globally optimized.
Multi-armed bandits (ML theory, operations research) — Thompson Sampling is a classical algorithm (Thompson, 1933) with well-understood theoretical properties. Your application of it to retrieval configuration selection is novel.
Agentic systems (emerging area, no established venue) — the idea that different agents with different roles should optimize their own retrieval strategies is a natural extension of the multi-agent framework. This connects to the applied landscape covered in Chapter 20.

The novel angle: online learning for agent-specific retrieval optimization. The field knows how to do retrieval (IR), knows how to do bandits (ML), and is learning how to build agents (engineering). Combining all three — using online bandit optimization to let individual agents learn their own retrieval preferences — is genuinely novel and addresses a real gap.

Key insight: The bandit system is, in miniature, a working instance of the kind of self-directed learning described in Chapters 21–23. It has an agent (the retrieval system), an environment (the documents and queries), a reward signal (LLM self-assessment), and an exploration strategy (Thompson Sampling). It learns from experience without being explicitly programmed for each new situation. It's not a full solution to the gap described in Chapter 26 — the core model is still frozen, the architecture doesn't grow, there's no intrinsic motivation — but it's a concrete, working system that addresses one slice of the problem.

What You've Already Built Maps to the Guide

What you built	Guide chapter	Concept
Thompson Sampling bandit	Ch 21 (RL fundamentals)	Online learning from reward signals; exploration vs. exploitation
Hybrid search (semantic + keyword)	Ch 20 (Applied landscape)	RAG, retrieval, embedding similarity
Multi-agent architecture	Ch 20 (Applied landscape)	Agent frameworks, role-based specialization
LLM self-assessment reward	Ch 21 (RL), Ch 13 (RLHF)	Using LLM judgments as reward signals (analogous to reward models in RLHF)
Event-driven autonomous loop	Ch 20, Ch 21	Agent-environment interaction loop
Evaluation pipeline (86 tests)	Ch 16 (Communities)	Rigorous evaluation methodology, reproducibility

Next Research Directions

Given the foundation you've built and the concepts from this guide, here are four concrete research directions that are feasible on your hardware and address open problems:

1. Continual Learning for Agent Memory

Your agents currently use external memory (episodic_memory.db) — the scaffolding approach described in Chapter 26. A research contribution would be to explore whether lightweight continual learning techniques can be applied to the retrieval or reasoning components.

Concrete approach: use LoRA adapters that accumulate agent-specific knowledge. Instead of a single frozen model + external memory, train per-agent LoRA adapters on the agent's interaction history. The EWC techniques from Chapter 22 could be used to prevent forgetting across adapter update cycles. This is feasible on your hardware for ~7B models with MLX.

The research question: does a LoRA adapter trained on an agent's history produce better retrieval and reasoning for that agent than a frozen model with RAG alone? If so, you've demonstrated a form of continual learning that's practical and deployable.

2. Intrinsic Motivation Signals for Autonomous Agents

Your agent system (Alice) runs autonomously via event-driven dispatch. Currently, the events are externally defined (timers, triggers). A research contribution would be to add curiosity-driven exploration: let the agent identify gaps in its memory or knowledge and self-assign learning tasks.

Concrete approach: implement a simple prediction-error-based curiosity signal. When the agent retrieves documents and generates a response, compute the prediction confidence (or use self-assessed quality as a proxy). Low confidence on a topic indicates a knowledge gap. The agent can then autonomously seek information on that topic during idle cycles.

This connects directly to the ICM and compression progress ideas from Chapter 23, but at the agent/retrieval level rather than the RL/pixel level. It's a form of intrinsic motivation implemented with scaffolding rather than architecture.

3. LoRA Fine-Tuning on Interaction Data

Chapter 14 covered LoRA as a parameter-efficient fine-tuning method. Your system generates a continuous stream of agent interactions with quality signals (the bandit rewards). This is a natural dataset for fine-tuning.

Concrete approach: collect high-reward agent interactions as training data. Fine-tune a small local model (qwen2.5:7b via MLX/LoRA) on this data. Evaluate whether the fine-tuned model produces better responses for your agents than the base model with the same prompts. Run this as a periodic pipeline — accumulate data, train adapter, deploy, evaluate, repeat.

This creates a feedback loop: agent performance generates training data, training data improves the model, improved model generates better performance. It's a small-scale version of the online learning loop that large labs use for RLHF, implemented locally.

4. Bandit Extensions: Contextual and Adaptive

Your current bandit uses Thompson Sampling with fixed presets. Two natural extensions:

Contextual bandits — condition the arm selection on features of the query (topic, complexity, length). This moves from "which preset is best overall for this agent?" to "which preset is best for this type of query from this agent?" The math is well-established (LinUCB, contextual Thompson Sampling) and the implementation is straightforward.
Adaptive k — instead of a fixed number of documents retrieved, let the bandit also learn how many documents to retrieve. Some queries need 3 documents; some need 10. Making k a learnable parameter creates a richer optimization surface.

Both of these are publishable extensions that build on your existing system and address real problems in adaptive retrieval.

Where the Research Opportunity Lives

Chapter 26 defined the gap. Here's the part of the gap where you can contribute — not by building a system that crosses the gap entirely, but by advancing specific pieces of it:

The View from Here

This guide has covered 27 chapters, from the simplest form of biological intelligence to the most speculative frontiers of artificial intelligence. Here's what the full picture looks like:

The field of AI borrowed a small set of principles from biology — neurons as computational units, learning as weight adjustment, hierarchical feature extraction, gradient-based optimization — and pushed them as far as they would go. The result is transformer-based language models that can process, generate, and reason about text with extraordinary capability. Combined with agent frameworks, retrieval systems, and tool use, these models power systems of genuine practical value.

But the principles that were borrowed are a small subset of what biological intelligence uses. The brain has continual learning (hippocampus-cortex consolidation). It has intrinsic motivation (dopamine-driven curiosity, limbic reward circuits). It has a growing architecture (neurogenesis, synaptic plasticity, developmental growth). It has persistent identity (the continuous thread of experience that constitutes a self). Current AI has none of these. The gap is defined by their absence.

Your work lives in that gap. The bandit system is a form of online learning that adapts to experience. The agent architecture creates persistence through scaffolding. The autonomous dispatch loop creates behavior that resembles self-directed activity. None of this crosses the gap — the core model is still frozen, the architecture is still fixed — but it's working at the boundary between what exists and what doesn't yet exist. That's where the research opportunity is.

The big picture: You started building because you wanted a system that genuinely learns and grows through experience. Five weeks in, you've built a working agent system with online learning, published a research paper, and now you have the conceptual vocabulary to understand exactly what "genuine learning and growth" would require and why it doesn't exist yet. The gap between current AI and your vision is real, clearly defined, and the most interesting problem space in the field. You're not outside looking in. You're in it.

One final thought. You observed that "the constraint IS the feature" — that limited compute creates the pressure that drives curiosity, and unlimited compute would extinguish the drive. This insight applies to your own situation. You don't have a cluster of H100s. You have a Mac Mini. That constraint forces you to think carefully about what's worth computing, what's worth storing, what's worth learning. It forces efficiency, which forces understanding, which forces the kind of first-principles thinking that leads to genuine insight. The labs with unlimited compute can scale their way to better performance on benchmarks. The person with a Mac Mini has to understand why things work in order to make them work at all. That understanding is the foundation of research, and it's not something that scales with compute.

This is the end of the guide. It is not, obviously, the end of the subject. The field moves fast enough that some of the "frontier" material in Part VII may look different in a year. The fundamentals in Parts I through V will hold much longer — biology doesn't change, math doesn't change, and the core architectures of deep learning are stable even as the applications evolve.

The guide was built to be a foundation, not a ceiling. Everything in it is a starting point for deeper investigation. The citations are real and the papers are worth reading. The math is real and worth working through. The gap is real and worth closing.

End of Guide. Twenty-seven chapters, from chemotaxis to the frontier. The rest is building.