> Backpropagation_
Hinton made backpropagation practical for multi-layer networks.
> DEEP DIVE_
The story of backpropagation — the algorithm that makes modern deep learning possible — is a story of an idea discovered, forgotten, rediscovered, and finally recognized. The basic concept was first described by Henry J. Kelley in 1960 in the context of control theory, and independently by Stuart Dreyfus in 1962. In 1969, Arthur Bryson and Yu-Chi Ho included it in their textbook on optimization. Paul Werbos described its application to neural networks in his 1974 PhD thesis at Harvard, but the thesis was largely ignored. The idea simply could not get traction in the frozen landscape of the first AI winter, when neural networks were considered a dead end thanks to Minsky and Papert.
Everything changed in 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning Representations by Back-propagating Errors" in the journal Nature. The paper was just two and a half pages long, but it demonstrated clearly and convincingly that multi-layer neural networks could be trained using backpropagation to learn complex, non-linear functions. This was the answer to Minsky and Papert's devastating 1969 critique: yes, single-layer perceptrons could not solve the XOR problem, but multi-layer networks trained with backpropagation could solve XOR and vastly more complex problems besides. The key insight was the chain rule of calculus: by propagating error signals backward through the network, layer by layer, you could compute how much each weight contributed to the overall error and adjust it accordingly.
The 1986 paper was part of a broader intellectual movement known as the PDP (Parallel Distributed Processing) group, led by Rumelhart and James McClelland. Their two-volume book, "Parallel Distributed Processing: Explorations in the Microstructure of Cognition," published the same year, was a manifesto for connectionism — the idea that intelligence emerges from the interactions of many simple processing units, rather than from explicit rules and symbols. It was a direct challenge to the symbolic AI paradigm that had dominated the field since Dartmouth. The PDP volumes became unlikely bestsellers in academic publishing, selling tens of thousands of copies and inspiring a new generation of researchers to work on neural networks.
Yet backpropagation's triumph was not immediate. The algorithm had serious practical limitations in the 1980s. Training deep networks — those with many layers — was painfully slow on the hardware available at the time. Worse, deep networks suffered from the "vanishing gradient problem": as error signals were propagated backward through many layers, they shrank exponentially, meaning that early layers learned almost nothing. Networks with more than two or three hidden layers were effectively untrainable. It would take another 20 years — until Geoffrey Hinton's 2006 breakthrough with deep belief networks — before researchers found ways to train truly deep networks. But the 1986 paper had established the principle. Backpropagation was the engine; it just needed a faster car to put it in. That car would eventually come in the form of GPUs, big data, and a series of clever architectural innovations that turned a two-page Nature paper into a trillion-dollar industry.