1998Memory Unlocked

> LSTM_

Hochreiter and Schmidhuber solved the vanishing gradient problem.

> DEEP DIVE_

In 1997, Sepp Hochreiter and Jurgen Schmidhuber, working at the Technical University of Munich, published a paper introducing the Long Short-Term Memory (LSTM) network — a specialized neural network architecture designed to solve one of the most fundamental problems in deep learning. Standard recurrent neural networks (RNNs), which processed sequential data by feeding their outputs back as inputs, suffered from the "vanishing gradient problem": when trained using backpropagation through time, the error signals that flowed backward through many time steps shrank exponentially, making it impossible for the network to learn long-range dependencies. If you wanted a network to understand that the word at the beginning of a paragraph was relevant to the word at the end, a standard RNN simply could not maintain the connection.

The LSTM's solution was architecturally elegant: it introduced a "memory cell" — a dedicated unit that could store information over long periods — protected by three "gates." The forget gate decided what information to discard from the cell. The input gate decided what new information to store. The output gate decided what information to send to the next layer. Each gate was controlled by a sigmoid neural network that learned, during training, when to open and when to close. The result was a network that could selectively remember or forget information across hundreds or even thousands of time steps — a feat that standard RNNs found mathematically impossible.

Despite the elegance of its design, the LSTM spent nearly a decade in relative obscurity. The late 1990s and early 2000s were dominated by support vector machines, random forests, and other methods that worked well on the structured, tabular data that most machine learning practitioners dealt with. Sequential data — text, speech, time series — was a niche concern. Hochreiter and Schmidhuber continued refining the architecture, and a small community of researchers explored its applications, but major conferences and journals showed limited interest. The LSTM was a solution waiting for its problem to become important.

That problem arrived with the smartphone era. When Apple launched Siri in 2011, the underlying speech recognition system used LSTM networks. When Google rebuilt its voice recognition pipeline, it used LSTMs. When Google Translate was overhauled in 2016 with a neural machine translation system that dramatically improved translation quality, LSTMs were at its core. The architecture that had been published in a technical report in Munich in 1997 and largely ignored for a decade was suddenly processing billions of queries per day. Hochreiter and Schmidhuber's paper became one of the most cited in all of computer science. The LSTM demonstrated a recurring pattern in AI history: the most important ideas often arrive years or decades before the world is ready for them, and the researchers who develop them must endure long periods of obscurity before vindication arrives.

> Ask CLIO about this topic

← Previous

1997 — Deep Blue Defeats Kasparov

2006 — Hinton's Revival