1992Reinforcement Learning Arrives

> TD-Gammon_

Self-play: 1.5M games to reach world champion level.

> DEEP DIVE_

In 1992, Gerald Tesauro, a researcher at IBM's Thomas J. Watson Research Center, created TD-Gammon — a neural network that taught itself to play backgammon at a world-class level through pure self-play. The program used temporal difference (TD) learning, a form of reinforcement learning in which the network learned by comparing its predictions of future game states with what actually happened. Tesauro gave TD-Gammon no human expert knowledge about backgammon strategy beyond the basic rules of the game. The network started as a complete novice and played against itself approximately 1.5 million games over the course of several months. By the end, it was one of the best backgammon players on the planet.

What made TD-Gammon remarkable was not just its playing strength but the strategies it discovered. Backgammon had been played seriously for decades, and human experts had developed well-established theories about opening moves, positional play, and endgame technique. TD-Gammon's self-play produced strategies that contradicted conventional human wisdom. In particular, it favored certain opening moves and positional configurations that experts had long considered weak or risky. When top human players studied TD-Gammon's approach, many of them were initially skeptical — then gradually convinced. Several of TD-Gammon's innovations were adopted by the human backgammon community and are now considered standard play. The machine had not just learned from humans; it had taught them something new.

The technical foundation of TD-Gammon was elegant. The neural network took as input a representation of the current board state — the positions of all 30 checkers across 24 points — and output an estimated probability of winning from that position. During self-play, after each move, the network compared its prediction for the current state with its prediction for the next state and adjusted its weights to minimize the difference. This is the essence of temporal difference learning: you learn by bootstrapping, using your own future predictions as a training signal. The approach was rooted in Richard Sutton's 1988 TD(lambda) framework, but Tesauro was the first to demonstrate that it could scale to a complex, real-world game.

TD-Gammon's influence extended far beyond backgammon. It was a proof of concept that reinforcement learning combined with neural networks could master complex strategic domains without human supervision. This lineage runs directly to DeepMind's AlphaGo, which defeated the world champion Go player Lee Sedol in 2016, and to AlphaGo Zero, which — like TD-Gammon — learned entirely from self-play without any human expert data. When Demis Hassabis and David Silver designed AlphaGo, they explicitly cited TD-Gammon as a key inspiration. The 1.5 million games of backgammon that TD-Gammon played against itself in 1992 were the distant ancestors of the millions of games of Go that AlphaGo Zero would play against itself a quarter-century later. Tesauro had demonstrated a principle that would reshape AI: given the right learning algorithm, a machine can discover strategies that no human has ever imagined.

> Ask CLIO about this topic

← Previous

1989 — LeNet

1995 — A.L.I.C.E.