Part I — Intelligence in Nature

From Neuroscience to AI

How biology became math, what survived the translation, and why the field nearly died before it began.

The Translation Problem

Chapter 2 ended with a list of architectural lessons the brain offers: learning is weight adjustment, hierarchy builds abstraction, uniform algorithms can handle diverse inputs. This chapter is the story of how people tried to turn those lessons into machines — and how they repeatedly got stuck on the gap between biological plausibility and mathematical tractability.

The central tension of early AI research was this: biological neurons are extraordinarily complex. A single synapse involves dozens of molecular cascades, timing-dependent dynamics, neuromodulatory context, and structural plasticity that unfolds over timescales from milliseconds to years. To build anything computable, researchers had to simplify. The question was always what to keep and what to discard — and the answers they chose shaped the entire trajectory of the field.

The McCulloch-Pitts Neuron (1943)

The first formal model of a neuron came from an unlikely pair: Warren McCulloch, a neurophysiologist, and Walter Pitts, a self-taught logician who was, at the time, a teenager living in the University of Chicago library. Their 1943 paper, "A Logical Calculus of the Ideas Immanent in Nervous Activity," proposed a radically simplified neuron — the McCulloch-Pitts (MCP) neuron — that reduced the biological neuron to a binary threshold unit.1

The model works as follows:

  1. A neuron receives binary inputs (0 or 1) from other neurons.
  2. Each input is either excitatory (contributes toward firing) or inhibitory (prevents firing entirely — a single inhibitory input vetoes the output).
  3. If the sum of excitatory inputs meets or exceeds a fixed threshold, and no inhibitory input is active, the neuron fires (outputs 1). Otherwise, it outputs 0.
  4. Time proceeds in discrete steps. All neurons compute simultaneously and instantaneously at each step.
McCulloch-Pitts Neuron Binary inputs, threshold activation, binary output Inputs x₁ +1 x₂ +1 x₃ inhibitory Σ ≥ θ Threshold if sum ≥ θ: 1 0 or 1 Sum excitatory inputs (if no inhibitory active) θ = threshold (e.g., 2) Example: θ = 2. If x₁ = 1, x₂ = 1, x₃ = 0 → sum = 2 ≥ 2 → output = 1. If x₃ = 1 → output = 0 (vetoed).

What McCulloch and Pitts showed is that networks of these simple units can compute any Boolean logic function — AND, OR, NOT, and combinations thereof. Since any computable function can be expressed in Boolean logic (a result from mathematical logic), networks of MCP neurons are, in principle, universal computers. This was the first formal proof that neuron-like elements, wired together, could compute.

What the model kept from biology:

What the model dropped:

The MCP neuron was a proof of concept, not a practical tool. It demonstrated that computation could emerge from neuron-like units, but it couldn't learn. To get from "neurons can compute" to "neurons can learn to compute," two more ideas were needed.

Hebb's Rule (1949)

The learning mechanism came from Donald Hebb, a Canadian psychologist. In his 1949 book The Organization of Behavior, Hebb proposed a principle for how synaptic connections change with experience:2

Hebb's postulate: "When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased."

The shorthand — "neurons that fire together, wire together" — isn't Hebb's phrasing (it was coined later by Carla Shatz in 1992), but it captures the idea. If neuron A consistently fires just before neuron B, the connection from A to B should strengthen. Activity-dependent plasticity.

Expressed mathematically, the simplest Hebbian learning rule for updating the weight w between neurons i and j is:

Δwij = η · xi · xj

where η (eta) is a learning rate, xi is the output of neuron i, and xj is the output of neuron j. When both neurons are active (both outputs positive), the weight increases. When they're uncorrelated, the weight doesn't change.

This is a correlation-based rule: strengthen connections between co-active neurons. It has the virtue of being local — each synapse only needs information about its own pre- and postsynaptic neurons. No central controller needs to know what the whole network is doing. This is biologically plausible (a synapse can plausibly "know" only about its own two neurons) and computationally simple.

But Hebbian learning has problems. The most obvious: weights only increase. If two neurons happen to fire together, the connection strengthens, which makes them more likely to fire together again, which strengthens the connection further. This positive feedback loop leads to runaway excitation — all weights saturate at their maximum value. The network becomes useless. Biological brains solve this through LTD, homeostatic plasticity, and other regulatory mechanisms (covered in Chapter 2). The pure mathematical form of Hebb's rule needed similar stabilization mechanisms, which wouldn't be worked out until later.

Still, the conceptual contribution was enormous. McCulloch and Pitts showed that neuron-like units could compute. Hebb showed, in principle, how connections could be shaped by experience. The missing piece was a concrete, trainable system that combined both ideas.

Rosenblatt's Perceptron (1958)

Frank Rosenblatt, a psychologist at Cornell, built it. In 1958, he described the perceptron — the first machine that could genuinely learn from data.3 Not just execute a pre-wired computation like the MCP neuron, but adjust its own weights based on whether its answers were right or wrong.

The perceptron is a single artificial neuron with three key improvements over McCulloch-Pitts:

  1. Weighted inputs — each input connection has a real-valued weight (not just +1 or inhibitory). The neuron computes a weighted sum: z = w1x1 + w2x2 + ... + wnxn + b, where b is a bias term (effectively a learnable threshold).
  2. A learning rule — weights are updated based on errors. If the perceptron outputs the wrong answer, each weight is adjusted in the direction that would have produced the right answer.
  3. Continuous weights, binary output — the weights are real numbers that change during training. The output is still binary: if the weighted sum exceeds zero, output 1; otherwise, output 0.

The learning rule is simple: present an input, compute the output. If the output is correct, do nothing. If it's wrong, update each weight by adding (or subtracting) the input value, scaled by a learning rate:

Perceptron update rule:

If the perceptron predicts 0 but should have predicted 1:
    wi ← wi + η · xi

If the perceptron predicts 1 but should have predicted 0:
    wi ← wi − η · xi

This nudges the decision boundary so the current input would be classified correctly.

Geometrically, a perceptron is a linear classifier. It learns a hyperplane (in two dimensions, a line) that separates inputs into two categories. The weight vector defines the orientation of the hyperplane, and the bias defines its position. Training adjusts both until the hyperplane correctly separates all the training examples — if such a separation exists.

And here's where Rosenblatt proved something remarkable: the perceptron convergence theorem. If the training data is linearly separable (meaning a hyperplane can perfectly separate the two classes), the perceptron learning rule is guaranteed to find that hyperplane in a finite number of steps. This isn't a heuristic — it's a mathematical proof. The algorithm converges.4

Rosenblatt built a physical machine — the Mark I Perceptron — at Cornell in 1960, using photocells for input and potentiometers for adjustable weights. It learned to classify simple visual patterns. The Navy, which funded the project, issued a press release predicting that perceptrons would one day "walk, talk, see, write, reproduce itself and be conscious of its existence." The New York Times ran the story on the front page.

The hype was extreme, and it set the stage for an equally extreme backlash.

The XOR Problem and Minsky-Papert (1969)

Marvin Minsky and Seymour Papert, both at MIT, published Perceptrons: An Introduction to Computational Geometry in 1969.5 The book was a rigorous mathematical analysis of what single-layer perceptrons could and could not compute. Its most famous result was devastating in its simplicity.

Consider the XOR function (exclusive or): output 1 if exactly one of two inputs is 1, otherwise output 0.

x1 x2 XOR
000
011
101
110

Plot these four points on a 2D grid, with class 0 as one color and class 1 as another. You'll see that the two classes are arranged diagonally — (0,0) and (1,1) in one class, (0,1) and (1,0) in the other. No single straight line can separate them. The data is not linearly separable.

Why XOR Breaks a Single Perceptron AND (linearly separable) 0 0 0 1 x₁ x₂ (0,0) (1,0) (0,1) (1,1) One line separates the classes XOR (not linearly separable) 0 1 1 0 ? x₁ x₂ (0,0) (1,0) (0,1) (1,1) No single line can separate the classes

Since a perceptron can only learn linear decision boundaries, it cannot learn XOR. This is a mathematical certainty, not a training problem — no amount of data, no learning rate, no number of epochs will fix it.

Minsky and Papert went further. They proved that many useful functions — connected regions, symmetry detection, parity — are beyond the reach of single-layer perceptrons. The proofs were rigorous and the implications were clear: if you can't compute XOR, you can't compute most things that matter.

The obvious solution was already known: add more layers. A two-layer perceptron (with a hidden layer between input and output) can solve XOR easily — the first layer transforms the inputs into a linearly separable representation, and the second layer classifies them. Minsky and Papert acknowledged this in their book. But they argued, correctly at the time, that nobody had a practical learning algorithm for multi-layer networks. The perceptron convergence theorem only applied to single-layer networks. How do you adjust the weights of hidden neurons when you don't know what they "should" have computed?

This was not a minor point. It was the central obstacle, and it would take nearly two decades to resolve.

The First AI Winter

The impact of Perceptrons was amplified by several factors. Minsky was arguably the most influential figure in AI — he'd co-founded the MIT AI Lab and had the ear of funding agencies. The book gave skeptics of neural network research rigorous ammunition. DARPA and other funders, already frustrated by overpromising from perceptron advocates, used it as justification to redirect money away from neural network research and toward symbolic approaches.

Between roughly 1969 and 1982, neural network research entered what's now called the first AI winter — though the term wasn't used at the time. Funding dried up. Graduate students were advised against working on neural networks. The field didn't vanish completely — important work continued, particularly by Stephen Grossberg (adaptive resonance theory), Teuvo Kohonen (self-organizing maps), and James Anderson (associative memory networks) — but it was marginalized, operating at the fringes of respectability.6

The AI winter was not purely about Minsky and Papert. It was also about a real limitation meeting unrealistic expectations. The Navy had predicted conscious machines. Researchers had promised human-level vision and language within a decade. When a rigorous mathematical proof showed that the flagship model couldn't compute XOR — a function so trivial a child understands it intuitively — the credibility collapse was swift.

Key idea: The first AI winter teaches a recurring lesson in the field: the gap between "this is theoretically possible" and "we know how to train it" is often decades wide. Multi-layer networks could, in principle, solve the XOR problem and much more. But without a practical learning algorithm for those networks, the theory was stranded. Capability without trainability is just architecture.

Symbolicism vs. Connectionism

While neural networks were in the wilderness, an alternative paradigm dominated AI research: symbolic AI, also called GOFAI (Good Old-Fashioned AI), a term coined by philosopher John Haugeland in 1985.7

The two approaches represent fundamentally different theories of what intelligence is and how to build it.

Symbolic AI (GOFAI)

Symbolic AI treats intelligence as symbol manipulation. The core idea: intelligent behavior consists of taking structured representations (symbols, rules, logical statements) and manipulating them according to formal rules. A chess program represents the board as a data structure, pieces as symbols, and legal moves as rules — then searches through possible futures to find the best move.

Key features of the symbolic approach:

The crown jewels of symbolic AI were expert systems — programs like MYCIN (1976, medical diagnosis), DENDRAL (1965, chemical structure analysis), and R1/XCON (1980, computer configuration). These systems encoded hundreds or thousands of domain-specific rules and could outperform human specialists within narrow domains.8

Connectionism

Connectionism treats intelligence as emergent from networks of simple units. There are no explicit rules — behavior arises from the pattern of weighted connections between nodes. Knowledge isn't stored in any single place; it's distributed across the network's weights.

Key features of the connectionist approach:

Dimension Symbolic AI Connectionism
Knowledge Explicit rules, symbols Distributed across weights
How it learns Human encodes rules Adjusts weights from data
Strengths Interpretable, precise, composable Learns patterns, handles noise, generalizes
Weaknesses Brittle, can't handle ambiguity, requires expert Opaque, needs data, hard to reason about
Fails when Domain is messy, rules are unclear Data is scarce, logic must be exact
Biological analog Conscious reasoning, language Perception, pattern recognition, intuition

The debate wasn't purely academic — it determined who got funded, who got hired, and what got built. Through the 1970s and into the mid-1980s, symbolicism dominated. Expert systems attracted commercial investment. DARPA funded logic-based programs. Japan launched the Fifth Generation Computer Systems project in 1982, committing hundreds of millions of dollars to symbolic AI based on logic programming.

The irony is that the problems symbolic AI struggled with most — perception, language understanding, flexible common-sense reasoning — turned out to be exactly the problems connectionism was best suited for. And the problems connectionism struggled with — logical reasoning, planning, compositionality — were symbolic AI's strengths. The two paradigms were, in hindsight, complementary. Modern systems like large language models are, in a meaningful sense, connectionist architectures that have learned to manipulate symbols.

The Backpropagation Breakthrough (1986)

The algorithm that brought neural networks back from the dead was backpropagation — short for "backward propagation of errors." It solved the problem that Minsky and Papert had identified as intractable: how to train multi-layer networks.

The core idea is elegant. In a multi-layer network, the output layer's error is easy to compute — you compare what the network predicted against the correct answer. But what about hidden layers? A hidden neuron doesn't have a "correct" output to compare against. Backpropagation solves this by using the chain rule of calculus to propagate the error signal backward through the network, computing how much each weight in each layer contributed to the final error, and then adjusting each weight proportionally.

The math will be covered in full in Chapter 6 (The Training Loop), but the essential intuition is: if you can express the entire network as a differentiable function from inputs to outputs, you can compute the gradient of the error with respect to every weight in the network — no matter how deep. Then you adjust each weight in the direction that reduces the error. That's gradient descent applied to every weight simultaneously.

The algorithm has a tangled history. Paul Werbos described it in his 1974 PhD thesis at Harvard, but the work went largely unnoticed.9 Seppo Linnainmaa had described the general method of automatic differentiation (which backpropagation is a special case of) as early as 1970.10 David Parker independently derived it in 1985. But the paper that brought it into the mainstream was Rumelhart, Hinton, and Williams (1986), "Learning representations by back-propagating errors," published in Nature.11 This paper combined the algorithm with compelling demonstrations — including solving XOR — and reached a wide audience.

What made the 1986 paper matter wasn't just the algorithm. It was the demonstration that multi-layer networks trained with backpropagation could learn internal representations — the hidden layers discovered useful features that the designers never specified. The network wasn't just memorizing input-output pairs; it was learning to represent the structure of the data in its hidden activations. This was the connectionist dream realized: intelligence emerging from adjustment of weights, without explicit programming.

Key idea: Backpropagation required one crucial thing that biology doesn't obviously provide: differentiability. The entire network must be a smooth, differentiable function so that gradients can flow. This forced a design choice that separated artificial from biological neurons permanently: the hard binary threshold (fire/don't fire) was replaced with smooth activation functions like the sigmoid, which outputs a continuous value between 0 and 1. This makes calculus possible — you can compute how a tiny change in any weight changes the output. Real neurons don't work this way. But the math requires it.

The 1986 paper ignited a connectionist revival. Funding returned. The PDP (Parallel Distributed Processing) Research Group at UCSD, led by Rumelhart and James McClelland, published a landmark two-volume set in the same year that laid out the theoretical foundations for connectionist cognitive science.12 Neural networks were back.

What Was Kept, What Was Dropped, What Was Gained

By the late 1980s, the artificial neural network had crystallized into a form recognizable as the ancestor of modern deep learning. It's worth pausing here to take stock of the translation from biology to math — what survived, what didn't, and what was added that biology never had.

Kept from biology

Dropped from biology

Gained (not in biology)

Key idea: The relationship between artificial and biological neural networks is best understood as inspiration, not imitation. The analogy was productive — it suggested the right architecture (networks of simple units with adjustable connections). But the training mechanism (backpropagation) is a mathematical invention with no clear biological counterpart, and the features that make artificial networks work (differentiability, global loss functions, gradient descent) are precisely the features that biology lacks. By the late 1980s, the two fields had diverged into distinct disciplines — neuroscience studying biological brains, machine learning studying what works computationally, regardless of biological fidelity.

The Second Pause

The connectionist revival of the late 1980s didn't last forever. By the early 1990s, neural networks hit practical limits. Training deep networks (more than two or three hidden layers) was notoriously unstable — gradients either vanished (shrank exponentially as they propagated backward through layers, so early layers learned nothing) or exploded (grew exponentially, causing weights to diverge). The vanishing gradient problem was formally characterized by Sepp Hochreiter in his diploma thesis (1991) and later in Hochreiter and Schmidhuber's landmark LSTM paper (1997).14

At the same time, alternative machine learning methods — support vector machines (SVMs), kernel methods, decision trees, random forests — offered strong performance with cleaner theoretical guarantees and less finicky training. Through the 1990s and early 2000s, these methods often outperformed neural networks on practical tasks and became the standard tools of machine learning.

Neural networks didn't disappear again, but they lost their position as the dominant paradigm. The second resurgence — triggered by Hinton's deep belief networks (2006), the ImageNet moment (2012), and the GPU computing revolution — would come later. Those developments belong to Part III of this guide.

The Timeline

From Biology to Backpropagation 1943 McCulloch-Pitts neuron — formal proof that neural networks can compute any Boolean function. 1949 Hebb's Organization of Behavior — learning as synaptic weight change. "Fire together, wire together." 1958 Rosenblatt's Perceptron — first trainable classifier. Convergence theorem proved. Massive hype. 1969 Minsky & Papert's Perceptrons — XOR impossibility. Neural network funding collapses. AI Winter (~1969-1982) 1974 Werbos describes backpropagation in PhD thesis. Goes largely unnoticed. 1986 Rumelhart, Hinton & Williams publish backprop in Nature. Connectionist revival begins. XOR solved. Multi-layer networks become trainable.

Looking Forward

This chapter covered the first fifty years of the journey from biological neurons to artificial ones — a period that established every foundational idea modern deep learning rests on: the formal neuron, learned weights, layered networks, gradient-based training, and the ongoing tension between biological inspiration and mathematical convenience.

But everything covered so far has been qualitative and conceptual. The perceptron convergence theorem was mentioned but not derived. Backpropagation was described in intuition but not in math. Gradient descent was invoked as a phrase, not as an equation. To actually understand how neural networks work — not as metaphors but as systems — you need the mathematical tools: vectors and matrices for representing data and weights, derivatives and gradients for optimization, probability for handling uncertainty.

That's where Chapter 4 begins.

Next: Chapter 4 — Mathematical Foundations. Linear algebra, calculus, probability, and information theory — each concept tied to where it appears in AI. The actual math behind vectors, gradients, loss functions, and everything the network computes.

1 McCulloch, W. S. and Pitts, W. (1943), "A Logical Calculus of the Ideas Immanent in Nervous Activity." Bulletin of Mathematical Biophysics 5:115-133. Pitts was approximately 17-18 years old at the time of this work; his exact birthdate is disputed. He had been living around the University of Chicago and studying with Rudolf Carnap and others.

2 Hebb, D. O. (1949), The Organization of Behavior: A Neuropsychological Theory. New York: Wiley. The quote given is from Chapter 4, "The First Stage of Perception: Growth of the Assembly."

3 Rosenblatt, F. (1958), "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." Psychological Review 65(6):386-408.

4 The convergence theorem was presented in Rosenblatt's 1962 book, Principles of Neurodynamics. Earlier proofs of convergence for similar algorithms had appeared in work by Agmon (1954) and Motzkin and Schoenberg (1954) in the context of solving systems of linear inequalities.

5 Minsky, M. and Papert, S. (1969), Perceptrons: An Introduction to Computational Geometry. MIT Press. An expanded edition with a new foreword addressing multi-layer networks was published in 1988.

6 The characterization of this period as a "winter" is somewhat retrospective. Work continued in relative isolation. Grossberg's Adaptive Resonance Theory dates from the 1970s. Kohonen's Self-Organizing Maps were published in 1982. Anderson and others at Brown maintained a connectionist research program throughout.

7 Haugeland, J. (1985), Artificial Intelligence: The Very Idea. MIT Press. Haugeland coined "GOFAI" specifically to distinguish logic-and-symbol-based AI from emerging connectionist approaches.

8 MYCIN: Shortliffe (1976). DENDRAL: Feigenbaum, Buchanan, and Lederberg, development began 1965. R1/XCON: McDermott (1982), deployed at DEC for computer system configuration. R1 reportedly saved DEC $40 million per year at its peak.

9 Werbos, P. J. (1974), Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University. Werbos has noted that the work was largely ignored for over a decade.

10 Linnainmaa, S. (1970), "The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors." Master's thesis, University of Helsinki. This described what is now called reverse-mode automatic differentiation, the general mathematical technique underlying backpropagation.

11 Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986), "Learning representations by back-propagating errors." Nature 323:533-536.

12 Rumelhart, D. E. and McClelland, J. L. (eds.) (1986), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. 2 vols. MIT Press. These volumes became the foundational texts of the connectionist movement.

13 The biological credit assignment problem remains open. Proposed biological alternatives to backpropagation include feedback alignment (Lillicrap et al., 2016), predictive coding (Rao and Ballard, 1999), and target propagation (Lee et al., 2015), but none has been definitively demonstrated in biological neural circuits.

14 Hochreiter, S. (1991), "Untersuchungen zu dynamischen neuronalen Netzen." Diploma thesis, Technische Universitat Munchen. The more widely cited treatment is Hochreiter, S. and Schmidhuber, J. (1997), "Long Short-Term Memory." Neural Computation 9(8):1735-1780, which both characterized the vanishing gradient problem and proposed the LSTM architecture as a solution.