Chapter 21: Beyond LLMs

The previous twenty chapters have built a picture that converges, in one way or another, on the transformer architecture and its descendants. But the transformer is a particular answer to a particular question: how do you model sequential data with long-range dependencies? It says nothing about how to learn from reward signals, how to evolve architectures, how to build internal models of physics, or how to plan in uncertain environments. For those problems, you need different paradigms entirely.

This chapter covers the major alternatives and complements to language modeling. Some of these — particularly reinforcement learning — are older than deep learning itself. Others, like JEPA, are active proposals for what might come after the current paradigm. What unites them is a fundamentally different relationship between the system and its data: instead of learning to predict the next token in a static dataset, these systems learn by acting in an environment and observing the consequences.

Reinforcement Learning: The Fundamentals

Reinforcement learning (RL) is the branch of machine learning concerned with how an agent should take actions in an environment to maximize cumulative reward. Unlike supervised learning, where you hand the model input-output pairs and say "learn this mapping," RL gives the agent a goal signal and lets it figure out how to achieve it through trial and error.

The formal framework is the Markov Decision Process (MDP), which has five components:

The "Markov" part means that the future depends only on the current state, not the full history. This is obviously a simplification — in the real world, history matters — but it makes the math tractable. The agent's goal is to learn a policy π(s) — a mapping from states to actions — that maximizes the expected sum of discounted future rewards.

If you know Q(s, a) for every state-action pair, the optimal policy is trivial: just pick the action with the highest Q-value in each state. The entire challenge of RL is estimating these values — or learning a good policy directly — from experience.

Q-Learning

Q-learning, introduced by Watkins in 1989, is the foundational value-based RL algorithm.¹ The idea is to maintain a table of Q-values — one entry for each (state, action) pair — and update them as the agent interacts with the environment.

In natural language: after taking action a in state s, receiving reward r, and landing in state s', the agent adjusts its Q-value toward the sum of the immediate reward plus the best Q-value it can get from the next state. The learning rate α controls how much to adjust. The term in brackets — r + γ max Q(s', a') − Q(s, a) — is the temporal difference error: the gap between what the agent expected and what it actually got. This is the same prediction-error signal that dopamine neurons implement in the brain (Chapter 2).

Q-learning is off-policy: it learns about the optimal policy regardless of what policy the agent is actually following. This means the agent can explore randomly while still learning the optimal behavior. The catch is that Q-learning requires a table entry for every state-action pair, which makes it impossible for environments with continuous states or very large state spaces.

Policy Gradient Methods

Instead of learning value functions and deriving a policy from them, policy gradient methods learn the policy directly. The policy is parameterized — typically as a neural network — and its parameters are adjusted to increase the expected reward.

The core idea: if an action led to high reward, increase the probability of taking that action in similar states. If it led to low reward, decrease it. The math formalizes this as gradient ascent on the expected reward. The key result is the policy gradient theorem (Sutton et al., 2000), which shows that the gradient of the expected reward with respect to the policy parameters can be estimated from sampled trajectories.²

The simplest version, REINFORCE (Williams, 1992), has a well-known problem: high variance. Because reward signals are noisy and delayed, the gradient estimates bounce around, making learning slow and unstable. This led to the development of actor-critic methods.

Actor-Critic Methods

Actor-critic architectures combine both approaches: a policy network (the actor) that decides what to do, and a value network (the critic) that evaluates how good the current state is. The critic reduces variance by providing a baseline — instead of asking "was this action good?", it asks "was this action better than expected?"

The advantage — A(s, a) = Q(s, a) − V(s) — measures how much better an action was compared to the average. This is what the critic provides, and it's what the actor uses to update. Actions with positive advantage get reinforced; actions with negative advantage get suppressed. This two-network structure is the basis for most modern RL algorithms.

Deep RL: Neural Networks Meet Decision-Making

The combination of deep neural networks with RL was transformative. Two landmark results define the era.

DQN: Atari from Pixels

In 2015, Mnih et al. at DeepMind published a paper showing that a single neural network could learn to play 49 different Atari games from raw pixel input, achieving human-level performance on many of them.³ The architecture, called Deep Q-Network (DQN), replaced the Q-table with a convolutional neural network that takes screen pixels as input and outputs Q-values for each possible action.

DQN was the proof of concept that deep learning could handle the perceptual complexity of RL environments. It took raw pixels — no hand-crafted features, no game-specific knowledge — and learned to play. But Atari games are relatively simple environments with discrete actions. The next breakthrough went further.

AlphaGo: Defeating Human Intuition

In March 2016, DeepMind's AlphaGo defeated Lee Sedol, one of the strongest Go players in history, 4 games to 1.⁴ Go has a state space of approximately 10¹⁷⁰ possible board positions — vastly more than atoms in the observable universe — making brute-force search impossible.

The follow-up, AlphaGo Zero (Silver et al., 2017), eliminated the supervised learning phase entirely — it learned solely from self-play, starting from random play, and surpassed all previous versions.⁵ AlphaZero (2018) generalized this to chess and shogi as well, learning all three games from scratch with the same algorithm.

The lesson from AlphaGo Zero is significant: given a well-defined environment with clear rules and outcomes, RL plus neural networks can discover strategies that exceed human expert play without any human knowledge. The catch is the "well-defined environment" part. The real world doesn't come with a rules engine and a clear win condition.

PPO: The Workhorse of Modern RL

Proximal Policy Optimization (PPO), introduced by Schulman et al. in 2017, is the algorithm that made RL practical for a much wider range of problems.⁶ It's an actor-critic method that solves a specific instability problem: when policy gradient updates are too large, performance can collapse catastrophically.

PPO constrains how much the policy can change in a single update by clipping the objective function. In simplified terms: it lets the policy improve, but prevents it from changing so much that it moves into untested territory and breaks. The details involve a "clipped surrogate objective" that caps the policy ratio between updates, but the intuition is straightforward: take cautious steps.

PPO matters for the LLM story because it's the algorithm used in RLHF (Reinforcement Learning from Human Feedback) — the process that turns a pre-trained language model into one that follows instructions, avoids harmful outputs, and generally behaves as humans prefer. In RLHF, the "environment" is the sequence of tokens generated so far, the "action" is the next token, and the "reward" comes from a reward model trained on human preferences. PPO optimizes the language model's policy to maximize this reward while staying close to the original pre-trained model. This is how ChatGPT, Claude, and similar systems are shaped after pre-training.

Key idea: RL isn't just an alternative to supervised learning — it's the mechanism used to align LLMs with human preferences. RLHF, the technique that makes language models helpful and safe, is PPO applied to text generation. The bandit system Alfonso built for retrieval weight optimization is a simplified version of the same principle: learning a policy from reward signals.

World Models: Learning to Imagine

Standard RL learns a policy by interacting with the real environment, which can be slow and expensive — especially when interactions involve physical robots or other costly systems. World models take a different approach: learn a model of the environment, then plan and train inside the model.

Ha and Schmidhuber's 2018 paper "World Models" demonstrated this approach on car racing and other tasks.⁷ The architecture has three components:

The remarkable result: once the memory model is trained on real environment interactions, the controller can be trained entirely inside the learned model — the agent "dreams" its training data. Ha and Schmidhuber showed that agents trained entirely in the dream world could transfer to the real environment and perform well. This is dramatically more sample-efficient than standard RL.

The connection to biological dreaming is suggestive. There's evidence that mammals replay and consolidate experiences during sleep (Chapter 2), and Schmidhuber has explicitly drawn the parallel. Whether the analogy runs deep or is merely superficial is an open question.

JEPA: Predicting in Representation Space

Yann LeCun's Joint Embedding Predictive Architecture (JEPA), proposed in a 2022 position paper, represents a fundamentally different approach to learning from the world.⁸ LeCun argues that current approaches — both generative models (predicting pixels) and contrastive learning (pushing apart unrelated examples) — have deep limitations for building world models that support planning.

The core idea: instead of predicting raw observations (pixels, audio waveforms), predict in representation space. Two encoders map different views of the input into embeddings, and a predictor learns to predict one embedding from the other.

Why does this matter? Consider predicting what you'll see if you turn your head 90 degrees. A generative model would need to predict every pixel — the exact pattern of light on every surface, shadows, reflections. Most of that detail is irrelevant for planning. A JEPA-style model would predict a high-level representation — "there will be a wall with a door" — abstracting away the irrelevant details.

LeCun argues this is how biological perception works: the brain doesn't predict pixels, it predicts abstract features. And it's how any practical world model for planning needs to work — at the right level of abstraction, discarding details that don't matter for the decision at hand.

JEPA is still largely a research direction rather than a proven architecture. Meta AI has published I-JEPA (for images) and V-JEPA (for video) as concrete implementations, with promising but preliminary results. Whether this approach will deliver on the broader vision of world models for planning remains to be seen. LeCun's position paper explicitly frames it as a long-term research agenda, not a near-term product.

Neuroevolution: Evolving Intelligence

All the methods discussed so far optimize a fixed network architecture via gradient descent. Neuroevolution takes a radically different approach: use evolutionary algorithms to evolve both the architecture and weights of neural networks.

The most influential neuroevolution algorithm is NEAT (NeuroEvolution of Augmenting Topologies), introduced by Stanley and Miikkulainen in 2002.⁹ NEAT starts with minimal networks — just inputs connected to outputs — and gradually adds neurons and connections through mutation. It tracks the evolutionary history of each gene to enable meaningful crossover between networks with different structures. This "complexification" approach mirrors biological evolution: start simple, add complexity only when it provides a fitness advantage.

The downside: neuroevolution is generally much less sample-efficient than gradient-based methods. Evolution works well when evaluation is cheap (simulation) and the search space is manageable, but it can't compete with backpropagation for training large networks on large datasets. In practice, neuroevolution has found its niche in architecture search — evolving the structure of networks that are then trained with conventional gradient descent — and in RL environments where the reward signal is sparse or deceptive.

Comparing Paradigms

None of these paradigms is "the answer." The most powerful systems tend to combine them: AlphaGo used supervised learning, RL, and tree search. RLHF combines supervised pre-training with RL fine-tuning. World models use supervised learning to build the environment model and RL to train the controller inside it. The future likely involves deeper integration rather than any single paradigm winning.

The paradigms in this chapter share a property that language modeling lacks: they involve an agent interacting with an environment over time. This temporal dimension — acting, observing consequences, adjusting — is what makes RL and its variants fundamentally different from next-token prediction. But all of them face a problem that the next chapter addresses directly: what happens when the environment changes? What happens when you need to learn new things without forgetting old ones?

Paradigm	Learns from	Strengths	Weaknesses
Supervised (LLMs)	Static datasets of input-output pairs	Scales massively, powerful representations	No interaction, no goal-seeking, frozen after training
Reinforcement learning	Reward signals from environment interaction	Learns from consequences, can optimize complex goals	Sample-inefficient, reward shaping is hard
World models	Environment dynamics (learned model)	Sample-efficient, enables planning	Model errors compound, hard to learn accurate models
JEPA	Self-supervised prediction in representation space	Abstracts away irrelevant detail	Early-stage research, limited demonstrations
Neuroevolution	Population fitness over generations	Discovers architectures, no gradients needed	Slow, less sample-efficient than gradient methods

¹ Watkins, C.J.C.H. (1989). "Learning from Delayed Rewards." PhD thesis, King's College, Cambridge. The convergence proof was published in Watkins and Dayan (1992), "Q-learning," Machine Learning 8:279–292.

² Sutton, R.S., McAllester, D., Singh, S., and Mansour, Y. (2000). "Policy Gradient Methods for Reinforcement Learning with Function Approximation." Advances in Neural Information Processing Systems 12.

³ Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning." Nature 518:529–533.

⁴ Silver, D. et al. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature 529:484–489.

⁵ Silver, D. et al. (2017). "Mastering the game of Go without human knowledge." Nature 550:354–359.

⁶ Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.

⁷ Ha, D. and Schmidhuber, J. (2018). "World Models." arXiv:1803.10122. Published in an interactive format at worldmodels.github.io.

⁸ LeCun, Y. (2022). "A Path Towards Autonomous Machine Intelligence." Version 0.9.2, June 27, 2022. Published as an OpenReview preprint.

⁹ Stanley, K.O. and Miikkulainen, R. (2002). "Evolving Neural Networks through Augmenting Topologies." Evolutionary Computation 10(2):99–127.