Part VII — The Frontier

Continual Learning

Why neural networks forget, and what's being tried to fix it.

Chapter 21 introduced reinforcement learning, world models, and other paradigms where agents learn from ongoing interaction with an environment. But there's a problem hiding in all of them — a problem so fundamental that it undermines the entire premise of a system that "learns and grows through experience." The problem is this: when a neural network learns something new, it tends to destroy what it knew before.

This is called catastrophic forgetting, and it is arguably the single biggest obstacle between current AI and anything that could genuinely be described as growing through experience. A system that can't accumulate knowledge over time isn't learning in any meaningful long-term sense — it's just being repeatedly retrained on different snapshots of the world.

The Problem

Consider a standard neural network trained on Task A until it performs well. Now train it on Task B. After training on B, test it on A again. In most cases, performance on A has collapsed — not degraded slightly, but collapsed. The weights that encoded Task A have been overwritten by the gradients from Task B.

This isn't a bug in any particular architecture or training algorithm. It's a consequence of how gradient descent works. When you compute gradients for Task B and update the weights, the update doesn't know or care about Task A. It moves the weights in whatever direction reduces the loss on B, regardless of what that does to previously learned representations. There is no mechanism in standard backpropagation that says "these weights are important for something you learned earlier — don't touch them."

Catastrophic Forgetting Train on Task A Weights encode A Performance: 95% Train on Task B Weights shift toward B Performance on B: 93% Test on Task A Weights no longer encode A Performance: 25% In weight space: Optimal for A Optimal for B Gradient descent moves weights from A's optimum to B's No force keeps them near A — nothing protects old knowledge Ideal: weights that work for both (may or may not exist)

The term "catastrophic" is precise. This isn't the gradual forgetting that humans experience — where memories fade slowly and older knowledge gives way to newer knowledge in a graceful decay. It's a sudden, wholesale destruction of previously learned representations. McCloskey and Cohen documented this in 1989, and French provided a thorough analysis in 1999.1 The problem has been known for over three decades and is still not solved.

Why It Happens: The Geometry of Learning

To understand catastrophic forgetting at a deeper level, think about what a trained neural network actually is: a point in a very high-dimensional weight space. Training on Task A finds a point where the loss on A is low. This point sits in some loss basin — a region of weight space where performance on A is good.

When you start training on Task B, gradient descent moves the weights in the direction that reduces the loss on B. But the loss landscape for B has a different geometry than the loss landscape for A. The optimal region for B is generally somewhere else in weight space. As the weights move toward B's optimal region, they leave A's optimal region, and performance on A degrades.

The fundamental issue is that gradient descent has no concept of protected knowledge. Every weight is equally available for modification by every gradient step. In a network with millions or billions of parameters, different tasks tend to use overlapping subsets of weights — the representations that are useful for one task are often partially useful for another. This is what makes transfer learning work, but it's also what makes forgetting so destructive: the shared representations get pulled in different directions by different tasks.

Approach 1: Elastic Weight Consolidation

Elastic Weight Consolidation (EWC), introduced by Kirkpatrick et al. at DeepMind in 2017, attacks the problem directly.2 The core idea: after learning Task A, identify which weights are most important for A, and penalize changes to those weights when learning Task B.

"Importance" is measured using the Fisher information matrix — a statistical tool that quantifies how sensitive the model's output is to each parameter. Weights with high Fisher information for Task A are the ones whose values matter most for A's performance. When training on B, EWC adds a penalty term to the loss function that grows quadratically as these important weights move from their Task A values.

In intuitive terms: EWC puts elastic bands on the important weights. They can still move — the network can still learn B — but moving them costs something, and the more important they were for A, the stiffer the elastic. The result is that the network finds a solution for B that stays as close as possible to the important weights for A.

Elastic Weight Consolidation Network weights after Task A w1 w2 w3 w4 w5 High Fisher info — important for A Low Fisher info — free to change During Task B training Loss = LB(θ) + λ/2 ∑ Fii − θi*)2 Task B loss Penalty for moving important weights Fi = Fisher information of weight i — how much Task A depends on it

EWC works — to a degree. It significantly reduces forgetting compared to naive sequential training. But it has limitations:

Variants of EWC (online EWC, Synaptic Intelligence, Memory Aware Synapses) address some of these issues, but the core tension remains: protecting old knowledge constrains the network's ability to learn new things. With enough tasks, any fixed-capacity network saturates.

Approach 2: Progressive Networks

Progressive networks, introduced by Rusu et al. at DeepMind in 2016, take a completely different approach: don't modify old weights at all.3 Instead, freeze the network trained on Task A and add a new column of neurons for Task B. The new column can read from the frozen columns via lateral connections, but the old columns are never modified.

Progressive Networks Column 1 (Task A — frozen) Column 2 (Task B — training) lateral connections (read from A, don't modify A) Column 3 (Task C — future) Each new task gets its own column. Old columns stay frozen. The network only grows.

Progressive networks have a clear advantage: they guarantee zero forgetting, because old weights are never touched. And the lateral connections enable positive forward transfer — knowledge from previous tasks can help with new ones, since the new column can use features from the old columns.

The equally clear disadvantage: the network grows linearly with the number of tasks. After 100 tasks, you have 100 columns. This doesn't scale. It's also unclear how to share knowledge between tasks efficiently — the lateral connections are a relatively coarse mechanism. But as a proof of concept, progressive networks demonstrated that architectural approaches to continual learning are viable.

Approach 3: Replay Methods

The most intuitive solution to forgetting is also the simplest: replay. When training on Task B, mix in examples from Task A. This way, the gradients from A and B are balanced, and the network doesn't drift away from A's solution.

There are several variants:

Replay methods work well in practice — they're among the most effective continual learning techniques available — but they don't solve the fundamental problem. They mitigate forgetting by essentially doing multi-task training with a memory buffer. If the buffer is large enough, this converges to joint training on all tasks, which isn't continual learning at all — it's just batch training with extra steps.

How Biology Handles It

The contrast with biological memory systems is illuminating. Brains don't have the catastrophic forgetting problem — at least, not in the same way. You can learn to ride a bike at age 7 and still ride one at age 50, even though you've learned thousands of other skills in between. How?

The leading theory involves the complementary learning systems framework, proposed by McClelland, McNaughton, and O'Reilly in 1995.5 The idea is that the brain uses two systems with fundamentally different learning properties:

Key idea: The brain solves catastrophic forgetting by separating fast learning (hippocampus) from slow integration (neocortex), with replay during sleep as the bridge. Current neural networks try to do both with a single system — fast learning that immediately overwrites existing knowledge. The dual-system architecture that biology uses is a fundamentally different solution.

Several continual learning methods have been explicitly inspired by this biological architecture. Generative replay mirrors hippocampal replay. Some recent work on "sleep-like" consolidation phases in neural networks has shown promise, though the results remain preliminary.

It's worth noting that biological memory isn't perfect either. Humans do forget — the forgetting curve documented by Ebbinghaus in 1885 is real. But biological forgetting is typically gradual and preferentially affects less important or less rehearsed memories, not catastrophic. The mechanisms that protect important memories (emotional tagging by the amygdala, spaced rehearsal, sleep consolidation) have no direct analog in standard neural networks.

The Capacity Problem

Even if you solve the forgetting problem, there's a deeper issue: capacity. A neural network with a fixed number of parameters can only encode a finite amount of information. As you add more tasks, eventually the network doesn't have enough representational capacity to hold all of them, regardless of how cleverly you manage the weights.

This is where progressive networks have the right intuition, even if their implementation is naive: the capacity needs to grow with the knowledge. A system that genuinely learns over time needs some mechanism for expanding its representational capacity — adding parameters, adding modules, adding structure. This connects directly to the question of growing architectures (Chapter 26) and is one of the deepest open problems in the field.

Biological brains grow new neurons in specific regions (the hippocampus, notably) throughout life — a process called neurogenesis. Whether this is primarily about adding capacity or about something else (pattern separation, memory encoding) is debated. But the existence of the mechanism suggests that biological systems, too, face a version of the capacity problem and have evolved at least a partial solution.

Current State of the Field

Continual learning remains one of the hardest open problems in machine learning. Here's an honest assessment of where things stand:

Approach Works on Breaks on
EWC / regularization Small numbers of tasks (2–10), task boundaries known Many tasks, gradual distribution shift, high-dimensional tasks
Progressive nets Any number of tasks (zero forgetting guaranteed) Scaling — model size grows linearly with tasks
Replay methods Practical settings with moderate numbers of tasks Privacy constraints (can't store old data), very long task sequences
Architecture search / growth Adapting structure to new tasks dynamically Computational cost, no clear method for deciding what to grow

In practice, the dominant approach in industry is to simply retrain from scratch when the data distribution changes significantly. This is expensive — pre-training a large language model costs millions of dollars — but it works. Fine-tuning (Chapter 13) offers a partial solution: you can adapt a pre-trained model to new data without retraining from scratch. LoRA (Chapter 14) makes this even cheaper. But fine-tuning always risks forgetting, and in practice, careful LoRA fine-tuning still degrades performance on tasks the model was originally good at.

The honest summary: we don't have continual learning that works at scale. The best methods mitigate the problem but don't solve it. A system that genuinely grows through experience — accumulating knowledge without forgetting, expanding its capacity as needed, integrating new information with old — remains an open research challenge.


Catastrophic forgetting is the structural reason why current AI systems are "frozen" after training. But even if we solved it, there's a separate question: what would drive a system to seek out new knowledge in the first place? Biological learners don't just passively accumulate information — they're driven by curiosity, prediction error, and an intrinsic desire to reduce uncertainty. The next chapter explores whether artificial systems can have anything like that.

Next: Chapter 23 — Intrinsic Motivation and Curiosity. Prediction error as drive, curiosity-driven learning, open-ended learning, and the insight that the constraint might be the feature.

1 McCloskey, M. and Cohen, N.J. (1989). "Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem." Psychology of Learning and Motivation 24:109–165. French, R.M. (1999). "Catastrophic forgetting in connectionist networks." Trends in Cognitive Sciences 3(4):128–135.

2 Kirkpatrick, J. et al. (2017). "Overcoming catastrophic forgetting in neural networks." Proceedings of the National Academy of Sciences 114(13):3521–3526.

3 Rusu, A.A. et al. (2016). "Progressive Neural Networks." arXiv:1606.04671.

4 Lopez-Paz, D. and Ranzato, M. (2017). "Gradient Episodic Memory for Continual Learning." Advances in Neural Information Processing Systems 30.

5 McClelland, J.L., McNaughton, B.L., and O'Reilly, R.C. (1995). "Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory." Psychological Review 102(3):419–457.