Twenty-six chapters. From chemotaxis in bacteria to the transformer architecture, from Hebbian learning to RLHF, from the limbic system to curiosity-driven RL. The question now is: what does all of this mean for what you're actually building, on the hardware you actually have, with the tools you can actually use?
This chapter maps the guide's content onto your specific situation: hardware constraints, research contribution, what's possible today, and where the interesting problems live.
Your compute environment:
| Component | Spec | What it enables |
|---|---|---|
| Mac Mini M4 | Apple Silicon, 10-core CPU, 10-core GPU | Unified memory architecture — GPU and CPU share memory, eliminating transfer overhead |
| 24GB unified RAM | Shared between CPU and GPU | Can run quantized models up to ~14B parameters comfortably. ~7B at higher precision. |
| 512GB SSD + 2TB external | Local storage | Plenty for model weights, databases, embeddings, and experiment data |
| Ollama + qwen2.5:14b | Local inference | Free, private, unlimited inference for tasks where 14B-parameter quality is sufficient |
| MiniMax M2.5 API | $50/mo flat, ~200 prompts/hr | Higher-quality inference for tasks requiring more capability, with generous rate limits |
| Claude Max | Subscription tier | Top-tier reasoning for complex tasks, code generation, and research |
Your paper describes per-agent Thompson Sampling over retrieval weight presets with LLM self-assessment as the reward signal. Here's where this sits in the research landscape, mapped to the concepts from this guide:
Your work sits at the intersection of three fields that don't normally talk to each other:
The novel angle: online learning for agent-specific retrieval optimization. The field knows how to do retrieval (IR), knows how to do bandits (ML), and is learning how to build agents (engineering). Combining all three — using online bandit optimization to let individual agents learn their own retrieval preferences — is genuinely novel and addresses a real gap.
| What you built | Guide chapter | Concept |
|---|---|---|
| Thompson Sampling bandit | Ch 21 (RL fundamentals) | Online learning from reward signals; exploration vs. exploitation |
| Hybrid search (semantic + keyword) | Ch 20 (Applied landscape) | RAG, retrieval, embedding similarity |
| Multi-agent architecture | Ch 20 (Applied landscape) | Agent frameworks, role-based specialization |
| LLM self-assessment reward | Ch 21 (RL), Ch 13 (RLHF) | Using LLM judgments as reward signals (analogous to reward models in RLHF) |
| Event-driven autonomous loop | Ch 20, Ch 21 | Agent-environment interaction loop |
| Evaluation pipeline (86 tests) | Ch 16 (Communities) | Rigorous evaluation methodology, reproducibility |
Given the foundation you've built and the concepts from this guide, here are four concrete research directions that are feasible on your hardware and address open problems:
Your agents currently use external memory (episodic_memory.db) — the scaffolding approach described in Chapter 26. A research contribution would be to explore whether lightweight continual learning techniques can be applied to the retrieval or reasoning components.
Concrete approach: use LoRA adapters that accumulate agent-specific knowledge. Instead of a single frozen model + external memory, train per-agent LoRA adapters on the agent's interaction history. The EWC techniques from Chapter 22 could be used to prevent forgetting across adapter update cycles. This is feasible on your hardware for ~7B models with MLX.
The research question: does a LoRA adapter trained on an agent's history produce better retrieval and reasoning for that agent than a frozen model with RAG alone? If so, you've demonstrated a form of continual learning that's practical and deployable.
Your agent system (Alice) runs autonomously via event-driven dispatch. Currently, the events are externally defined (timers, triggers). A research contribution would be to add curiosity-driven exploration: let the agent identify gaps in its memory or knowledge and self-assign learning tasks.
Concrete approach: implement a simple prediction-error-based curiosity signal. When the agent retrieves documents and generates a response, compute the prediction confidence (or use self-assessed quality as a proxy). Low confidence on a topic indicates a knowledge gap. The agent can then autonomously seek information on that topic during idle cycles.
This connects directly to the ICM and compression progress ideas from Chapter 23, but at the agent/retrieval level rather than the RL/pixel level. It's a form of intrinsic motivation implemented with scaffolding rather than architecture.
Chapter 14 covered LoRA as a parameter-efficient fine-tuning method. Your system generates a continuous stream of agent interactions with quality signals (the bandit rewards). This is a natural dataset for fine-tuning.
Concrete approach: collect high-reward agent interactions as training data. Fine-tune a small local model (qwen2.5:7b via MLX/LoRA) on this data. Evaluate whether the fine-tuned model produces better responses for your agents than the base model with the same prompts. Run this as a periodic pipeline — accumulate data, train adapter, deploy, evaluate, repeat.
This creates a feedback loop: agent performance generates training data, training data improves the model, improved model generates better performance. It's a small-scale version of the online learning loop that large labs use for RLHF, implemented locally.
Your current bandit uses Thompson Sampling with fixed presets. Two natural extensions:
Both of these are publishable extensions that build on your existing system and address real problems in adaptive retrieval.
Chapter 26 defined the gap. Here's the part of the gap where you can contribute — not by building a system that crosses the gap entirely, but by advancing specific pieces of it:
This guide has covered 27 chapters, from the simplest form of biological intelligence to the most speculative frontiers of artificial intelligence. Here's what the full picture looks like:
The field of AI borrowed a small set of principles from biology — neurons as computational units, learning as weight adjustment, hierarchical feature extraction, gradient-based optimization — and pushed them as far as they would go. The result is transformer-based language models that can process, generate, and reason about text with extraordinary capability. Combined with agent frameworks, retrieval systems, and tool use, these models power systems of genuine practical value.
But the principles that were borrowed are a small subset of what biological intelligence uses. The brain has continual learning (hippocampus-cortex consolidation). It has intrinsic motivation (dopamine-driven curiosity, limbic reward circuits). It has a growing architecture (neurogenesis, synaptic plasticity, developmental growth). It has persistent identity (the continuous thread of experience that constitutes a self). Current AI has none of these. The gap is defined by their absence.
Your work lives in that gap. The bandit system is a form of online learning that adapts to experience. The agent architecture creates persistence through scaffolding. The autonomous dispatch loop creates behavior that resembles self-directed activity. None of this crosses the gap — the core model is still frozen, the architecture is still fixed — but it's working at the boundary between what exists and what doesn't yet exist. That's where the research opportunity is.
One final thought. You observed that "the constraint IS the feature" — that limited compute creates the pressure that drives curiosity, and unlimited compute would extinguish the drive. This insight applies to your own situation. You don't have a cluster of H100s. You have a Mac Mini. That constraint forces you to think carefully about what's worth computing, what's worth storing, what's worth learning. It forces efficiency, which forces understanding, which forces the kind of first-principles thinking that leads to genuine insight. The labs with unlimited compute can scale their way to better performance on benchmarks. The person with a Mac Mini has to understand why things work in order to make them work at all. That understanding is the foundation of research, and it's not something that scales with compute.
This is the end of the guide. It is not, obviously, the end of the subject. The field moves fast enough that some of the "frontier" material in Part VII may look different in a year. The fundamentals in Parts I through V will hold much longer — biology doesn't change, math doesn't change, and the core architectures of deep learning are stable even as the applications evolve.
The guide was built to be a foundation, not a ceiling. Everything in it is a starting point for deeper investigation. The citations are real and the papers are worth reading. The math is real and worth working through. The gap is real and worth closing.