2015Breaking the Depth Barrier

> ResNet — Deeper Than Human_

152 layers. 3.57% error vs human 5.1%. Machines see better.

> DEEP DIVE_

By 2015, deep learning researchers had established a seemingly obvious principle: deeper networks should be more powerful. More layers meant more capacity to learn complex representations. But in practice, simply stacking more layers hit a wall. Networks deeper than about 20 layers would actually perform worse than shallower ones, not because of vanishing gradients (that problem had largely been solved) but because of a mysterious "degradation problem." A 56-layer network would produce higher training error than a 20-layer network, which made no theoretical sense. If the extra layers could at least learn the identity function, the deeper network should be no worse. Something was fundamentally wrong with how deep networks learned.

Kaiming He, then a researcher at Microsoft Research Asia in Beijing, proposed a solution of breathtaking simplicity: skip connections. Instead of requiring each layer to learn the desired mapping directly, ResNet's residual blocks learned only the difference (the "residual") between the input and the desired output. Each block included a shortcut connection that allowed the input to bypass one or more layers and be added directly to the block's output. If the optimal thing for a layer to do was nothing, it simply needed to drive its weights toward zero, letting the input pass through unchanged. This was far easier to optimize than learning an identity mapping from scratch.

The results at the 2015 ImageNet competition were staggering. ResNet, with a depth of 152 layers, more than eight times deeper than any previous competitive network, won first place with a top-5 error rate of 3.57%. For the first time, a machine had surpassed the estimated human error rate of approximately 5.1% on the ImageNet classification task. The result made headlines around the world. A computer could now identify objects in photographs more accurately than a person. Kaiming He and his co-authors, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, published their paper "Deep Residual Learning for Image Recognition," which would go on to become one of the most cited papers in the history of computer science.

The impact of skip connections extended far beyond image classification. The residual learning principle was adopted across virtually every domain of deep learning. Transformers, the architecture behind GPT and BERT, use residual connections around every attention and feed-forward layer. Modern speech recognition, machine translation, and protein structure prediction all rely on architectures that incorporate residual connections. ResNet did not just win a competition; it solved the fundamental problem of training very deep networks, unlocking an era in which depth became a reliable path to power. He's insight, that it is easier to learn a small correction than to learn a full transformation, turned out to be one of the most consequential ideas in the history of neural network design.

> Ask CLIO about this topic

← Previous

2014 — GANs — AI That Creates

2016 — AlphaGo