Chapter 2 ended with a list of architectural lessons the brain offers: learning is weight adjustment, hierarchy builds abstraction, uniform algorithms can handle diverse inputs. This chapter is the story of how people tried to turn those lessons into machines — and how they repeatedly got stuck on the gap between biological plausibility and mathematical tractability.
The central tension of early AI research was this: biological neurons are extraordinarily complex. A single synapse involves dozens of molecular cascades, timing-dependent dynamics, neuromodulatory context, and structural plasticity that unfolds over timescales from milliseconds to years. To build anything computable, researchers had to simplify. The question was always what to keep and what to discard — and the answers they chose shaped the entire trajectory of the field.
The first formal model of a neuron came from an unlikely pair: Warren McCulloch, a neurophysiologist, and Walter Pitts, a self-taught logician who was, at the time, a teenager living in the University of Chicago library. Their 1943 paper, "A Logical Calculus of the Ideas Immanent in Nervous Activity," proposed a radically simplified neuron — the McCulloch-Pitts (MCP) neuron — that reduced the biological neuron to a binary threshold unit.1
The model works as follows:
What McCulloch and Pitts showed is that networks of these simple units can compute any Boolean logic function — AND, OR, NOT, and combinations thereof. Since any computable function can be expressed in Boolean logic (a result from mathematical logic), networks of MCP neurons are, in principle, universal computers. This was the first formal proof that neuron-like elements, wired together, could compute.
What the model kept from biology:
What the model dropped:
The MCP neuron was a proof of concept, not a practical tool. It demonstrated that computation could emerge from neuron-like units, but it couldn't learn. To get from "neurons can compute" to "neurons can learn to compute," two more ideas were needed.
The learning mechanism came from Donald Hebb, a Canadian psychologist. In his 1949 book The Organization of Behavior, Hebb proposed a principle for how synaptic connections change with experience:2
The shorthand — "neurons that fire together, wire together" — isn't Hebb's phrasing (it was coined later by Carla Shatz in 1992), but it captures the idea. If neuron A consistently fires just before neuron B, the connection from A to B should strengthen. Activity-dependent plasticity.
Expressed mathematically, the simplest Hebbian learning rule for updating the weight w between neurons i and j is:
This is a correlation-based rule: strengthen connections between co-active neurons. It has the virtue of being local — each synapse only needs information about its own pre- and postsynaptic neurons. No central controller needs to know what the whole network is doing. This is biologically plausible (a synapse can plausibly "know" only about its own two neurons) and computationally simple.
But Hebbian learning has problems. The most obvious: weights only increase. If two neurons happen to fire together, the connection strengthens, which makes them more likely to fire together again, which strengthens the connection further. This positive feedback loop leads to runaway excitation — all weights saturate at their maximum value. The network becomes useless. Biological brains solve this through LTD, homeostatic plasticity, and other regulatory mechanisms (covered in Chapter 2). The pure mathematical form of Hebb's rule needed similar stabilization mechanisms, which wouldn't be worked out until later.
Still, the conceptual contribution was enormous. McCulloch and Pitts showed that neuron-like units could compute. Hebb showed, in principle, how connections could be shaped by experience. The missing piece was a concrete, trainable system that combined both ideas.
Frank Rosenblatt, a psychologist at Cornell, built it. In 1958, he described the perceptron — the first machine that could genuinely learn from data.3 Not just execute a pre-wired computation like the MCP neuron, but adjust its own weights based on whether its answers were right or wrong.
The perceptron is a single artificial neuron with three key improvements over McCulloch-Pitts:
The learning rule is simple: present an input, compute the output. If the output is correct, do nothing. If it's wrong, update each weight by adding (or subtracting) the input value, scaled by a learning rate:
Geometrically, a perceptron is a linear classifier. It learns a hyperplane (in two dimensions, a line) that separates inputs into two categories. The weight vector defines the orientation of the hyperplane, and the bias defines its position. Training adjusts both until the hyperplane correctly separates all the training examples — if such a separation exists.
And here's where Rosenblatt proved something remarkable: the perceptron convergence theorem. If the training data is linearly separable (meaning a hyperplane can perfectly separate the two classes), the perceptron learning rule is guaranteed to find that hyperplane in a finite number of steps. This isn't a heuristic — it's a mathematical proof. The algorithm converges.4
Rosenblatt built a physical machine — the Mark I Perceptron — at Cornell in 1960, using photocells for input and potentiometers for adjustable weights. It learned to classify simple visual patterns. The Navy, which funded the project, issued a press release predicting that perceptrons would one day "walk, talk, see, write, reproduce itself and be conscious of its existence." The New York Times ran the story on the front page.
The hype was extreme, and it set the stage for an equally extreme backlash.
Marvin Minsky and Seymour Papert, both at MIT, published Perceptrons: An Introduction to Computational Geometry in 1969.5 The book was a rigorous mathematical analysis of what single-layer perceptrons could and could not compute. Its most famous result was devastating in its simplicity.
Consider the XOR function (exclusive or): output 1 if exactly one of two inputs is 1, otherwise output 0.
| x1 | x2 | XOR |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Plot these four points on a 2D grid, with class 0 as one color and class 1 as another. You'll see that the two classes are arranged diagonally — (0,0) and (1,1) in one class, (0,1) and (1,0) in the other. No single straight line can separate them. The data is not linearly separable.
Since a perceptron can only learn linear decision boundaries, it cannot learn XOR. This is a mathematical certainty, not a training problem — no amount of data, no learning rate, no number of epochs will fix it.
Minsky and Papert went further. They proved that many useful functions — connected regions, symmetry detection, parity — are beyond the reach of single-layer perceptrons. The proofs were rigorous and the implications were clear: if you can't compute XOR, you can't compute most things that matter.
The obvious solution was already known: add more layers. A two-layer perceptron (with a hidden layer between input and output) can solve XOR easily — the first layer transforms the inputs into a linearly separable representation, and the second layer classifies them. Minsky and Papert acknowledged this in their book. But they argued, correctly at the time, that nobody had a practical learning algorithm for multi-layer networks. The perceptron convergence theorem only applied to single-layer networks. How do you adjust the weights of hidden neurons when you don't know what they "should" have computed?
This was not a minor point. It was the central obstacle, and it would take nearly two decades to resolve.
The impact of Perceptrons was amplified by several factors. Minsky was arguably the most influential figure in AI — he'd co-founded the MIT AI Lab and had the ear of funding agencies. The book gave skeptics of neural network research rigorous ammunition. DARPA and other funders, already frustrated by overpromising from perceptron advocates, used it as justification to redirect money away from neural network research and toward symbolic approaches.
Between roughly 1969 and 1982, neural network research entered what's now called the first AI winter — though the term wasn't used at the time. Funding dried up. Graduate students were advised against working on neural networks. The field didn't vanish completely — important work continued, particularly by Stephen Grossberg (adaptive resonance theory), Teuvo Kohonen (self-organizing maps), and James Anderson (associative memory networks) — but it was marginalized, operating at the fringes of respectability.6
The AI winter was not purely about Minsky and Papert. It was also about a real limitation meeting unrealistic expectations. The Navy had predicted conscious machines. Researchers had promised human-level vision and language within a decade. When a rigorous mathematical proof showed that the flagship model couldn't compute XOR — a function so trivial a child understands it intuitively — the credibility collapse was swift.
While neural networks were in the wilderness, an alternative paradigm dominated AI research: symbolic AI, also called GOFAI (Good Old-Fashioned AI), a term coined by philosopher John Haugeland in 1985.7
The two approaches represent fundamentally different theories of what intelligence is and how to build it.
Symbolic AI treats intelligence as symbol manipulation. The core idea: intelligent behavior consists of taking structured representations (symbols, rules, logical statements) and manipulating them according to formal rules. A chess program represents the board as a data structure, pieces as symbols, and legal moves as rules — then searches through possible futures to find the best move.
Key features of the symbolic approach:
The crown jewels of symbolic AI were expert systems — programs like MYCIN (1976, medical diagnosis), DENDRAL (1965, chemical structure analysis), and R1/XCON (1980, computer configuration). These systems encoded hundreds or thousands of domain-specific rules and could outperform human specialists within narrow domains.8
Connectionism treats intelligence as emergent from networks of simple units. There are no explicit rules — behavior arises from the pattern of weighted connections between nodes. Knowledge isn't stored in any single place; it's distributed across the network's weights.
Key features of the connectionist approach:
| Dimension | Symbolic AI | Connectionism |
|---|---|---|
| Knowledge | Explicit rules, symbols | Distributed across weights |
| How it learns | Human encodes rules | Adjusts weights from data |
| Strengths | Interpretable, precise, composable | Learns patterns, handles noise, generalizes |
| Weaknesses | Brittle, can't handle ambiguity, requires expert | Opaque, needs data, hard to reason about |
| Fails when | Domain is messy, rules are unclear | Data is scarce, logic must be exact |
| Biological analog | Conscious reasoning, language | Perception, pattern recognition, intuition |
The debate wasn't purely academic — it determined who got funded, who got hired, and what got built. Through the 1970s and into the mid-1980s, symbolicism dominated. Expert systems attracted commercial investment. DARPA funded logic-based programs. Japan launched the Fifth Generation Computer Systems project in 1982, committing hundreds of millions of dollars to symbolic AI based on logic programming.
The irony is that the problems symbolic AI struggled with most — perception, language understanding, flexible common-sense reasoning — turned out to be exactly the problems connectionism was best suited for. And the problems connectionism struggled with — logical reasoning, planning, compositionality — were symbolic AI's strengths. The two paradigms were, in hindsight, complementary. Modern systems like large language models are, in a meaningful sense, connectionist architectures that have learned to manipulate symbols.
The algorithm that brought neural networks back from the dead was backpropagation — short for "backward propagation of errors." It solved the problem that Minsky and Papert had identified as intractable: how to train multi-layer networks.
The core idea is elegant. In a multi-layer network, the output layer's error is easy to compute — you compare what the network predicted against the correct answer. But what about hidden layers? A hidden neuron doesn't have a "correct" output to compare against. Backpropagation solves this by using the chain rule of calculus to propagate the error signal backward through the network, computing how much each weight in each layer contributed to the final error, and then adjusting each weight proportionally.
The math will be covered in full in Chapter 6 (The Training Loop), but the essential intuition is: if you can express the entire network as a differentiable function from inputs to outputs, you can compute the gradient of the error with respect to every weight in the network — no matter how deep. Then you adjust each weight in the direction that reduces the error. That's gradient descent applied to every weight simultaneously.
The algorithm has a tangled history. Paul Werbos described it in his 1974 PhD thesis at Harvard, but the work went largely unnoticed.9 Seppo Linnainmaa had described the general method of automatic differentiation (which backpropagation is a special case of) as early as 1970.10 David Parker independently derived it in 1985. But the paper that brought it into the mainstream was Rumelhart, Hinton, and Williams (1986), "Learning representations by back-propagating errors," published in Nature.11 This paper combined the algorithm with compelling demonstrations — including solving XOR — and reached a wide audience.
What made the 1986 paper matter wasn't just the algorithm. It was the demonstration that multi-layer networks trained with backpropagation could learn internal representations — the hidden layers discovered useful features that the designers never specified. The network wasn't just memorizing input-output pairs; it was learning to represent the structure of the data in its hidden activations. This was the connectionist dream realized: intelligence emerging from adjustment of weights, without explicit programming.
The 1986 paper ignited a connectionist revival. Funding returned. The PDP (Parallel Distributed Processing) Research Group at UCSD, led by Rumelhart and James McClelland, published a landmark two-volume set in the same year that laid out the theoretical foundations for connectionist cognitive science.12 Neural networks were back.
By the late 1980s, the artificial neural network had crystallized into a form recognizable as the ancestor of modern deep learning. It's worth pausing here to take stock of the translation from biology to math — what survived, what didn't, and what was added that biology never had.
The connectionist revival of the late 1980s didn't last forever. By the early 1990s, neural networks hit practical limits. Training deep networks (more than two or three hidden layers) was notoriously unstable — gradients either vanished (shrank exponentially as they propagated backward through layers, so early layers learned nothing) or exploded (grew exponentially, causing weights to diverge). The vanishing gradient problem was formally characterized by Sepp Hochreiter in his diploma thesis (1991) and later in Hochreiter and Schmidhuber's landmark LSTM paper (1997).14
At the same time, alternative machine learning methods — support vector machines (SVMs), kernel methods, decision trees, random forests — offered strong performance with cleaner theoretical guarantees and less finicky training. Through the 1990s and early 2000s, these methods often outperformed neural networks on practical tasks and became the standard tools of machine learning.
Neural networks didn't disappear again, but they lost their position as the dominant paradigm. The second resurgence — triggered by Hinton's deep belief networks (2006), the ImageNet moment (2012), and the GPU computing revolution — would come later. Those developments belong to Part III of this guide.
This chapter covered the first fifty years of the journey from biological neurons to artificial ones — a period that established every foundational idea modern deep learning rests on: the formal neuron, learned weights, layered networks, gradient-based training, and the ongoing tension between biological inspiration and mathematical convenience.
But everything covered so far has been qualitative and conceptual. The perceptron convergence theorem was mentioned but not derived. Backpropagation was described in intuition but not in math. Gradient descent was invoked as a phrase, not as an equation. To actually understand how neural networks work — not as metaphors but as systems — you need the mathematical tools: vectors and matrices for representing data and weights, derivatives and gradients for optimization, probability for handling uncertainty.
That's where Chapter 4 begins.
Next: Chapter 4 — Mathematical Foundations. Linear algebra, calculus, probability, and information theory — each concept tied to where it appears in AI. The actual math behind vectors, gradients, loss functions, and everything the network computes.1 McCulloch, W. S. and Pitts, W. (1943), "A Logical Calculus of the Ideas Immanent in Nervous Activity." Bulletin of Mathematical Biophysics 5:115-133. Pitts was approximately 17-18 years old at the time of this work; his exact birthdate is disputed. He had been living around the University of Chicago and studying with Rudolf Carnap and others.
2 Hebb, D. O. (1949), The Organization of Behavior: A Neuropsychological Theory. New York: Wiley. The quote given is from Chapter 4, "The First Stage of Perception: Growth of the Assembly."
3 Rosenblatt, F. (1958), "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." Psychological Review 65(6):386-408.
4 The convergence theorem was presented in Rosenblatt's 1962 book, Principles of Neurodynamics. Earlier proofs of convergence for similar algorithms had appeared in work by Agmon (1954) and Motzkin and Schoenberg (1954) in the context of solving systems of linear inequalities.
5 Minsky, M. and Papert, S. (1969), Perceptrons: An Introduction to Computational Geometry. MIT Press. An expanded edition with a new foreword addressing multi-layer networks was published in 1988.
6 The characterization of this period as a "winter" is somewhat retrospective. Work continued in relative isolation. Grossberg's Adaptive Resonance Theory dates from the 1970s. Kohonen's Self-Organizing Maps were published in 1982. Anderson and others at Brown maintained a connectionist research program throughout.
7 Haugeland, J. (1985), Artificial Intelligence: The Very Idea. MIT Press. Haugeland coined "GOFAI" specifically to distinguish logic-and-symbol-based AI from emerging connectionist approaches.
8 MYCIN: Shortliffe (1976). DENDRAL: Feigenbaum, Buchanan, and Lederberg, development began 1965. R1/XCON: McDermott (1982), deployed at DEC for computer system configuration. R1 reportedly saved DEC $40 million per year at its peak.
9 Werbos, P. J. (1974), Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University. Werbos has noted that the work was largely ignored for over a decade.
10 Linnainmaa, S. (1970), "The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors." Master's thesis, University of Helsinki. This described what is now called reverse-mode automatic differentiation, the general mathematical technique underlying backpropagation.
11 Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986), "Learning representations by back-propagating errors." Nature 323:533-536.
12 Rumelhart, D. E. and McClelland, J. L. (eds.) (1986), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. 2 vols. MIT Press. These volumes became the foundational texts of the connectionist movement.
13 The biological credit assignment problem remains open. Proposed biological alternatives to backpropagation include feedback alignment (Lillicrap et al., 2016), predictive coding (Rao and Ballard, 1999), and target propagation (Lee et al., 2015), but none has been definitively demonstrated in biological neural circuits.
14 Hochreiter, S. (1991), "Untersuchungen zu dynamischen neuronalen Netzen." Diploma thesis, Technische Universitat Munchen. The more widely cited treatment is Hochreiter, S. and Schmidhuber, J. (1997), "Long Short-Term Memory." Neural Computation 9(8):1735-1780, which both characterized the vanishing gradient problem and proposed the LSTM architecture as a solution.