2018The Pre-training Era

> GPT-1 and BERT_

The "pre-train → fine-tune" paradigm was established.

> DEEP DIVE_

The year 2018 marked the dawn of the pre-training era, a paradigm shift that transformed natural language processing from a collection of specialized tools into something approaching general-purpose language understanding. In June, OpenAI released GPT-1 (Generative Pre-trained Transformer), a model with 117 million parameters that was first pre-trained on a large corpus of text using unsupervised learning and then fine-tuned on specific tasks. The key insight was that a model trained to predict the next word in a sequence could, in the process, learn rich representations of language that transferred effectively to downstream tasks like sentiment analysis, question answering, and textual entailment.

Four months later, in October 2018, Google dropped a bombshell of its own: BERT (Bidirectional Encoder Representations from Transformers). While GPT-1 was unidirectional, reading text from left to right, BERT was bidirectional, able to consider context from both directions simultaneously during pre-training. BERT's training objective was elegantly creative: it masked random words in a sentence and trained the model to predict the masked words using both left and right context. With 340 million parameters in its large variant, BERT achieved state-of-the-art results on eleven NLP benchmarks simultaneously, often by wide margins. Google integrated BERT into its search engine, calling it the biggest improvement to search in five years.

The concept of transfer learning, borrowing knowledge from one task to apply to another, had existed in computer vision since AlexNet. But GPT-1 and BERT demonstrated that the same principle applied to language with even more dramatic effect. A single pre-trained model could be fine-tuned for dozens of different language tasks with minimal task-specific modification. This eliminated the need to design custom architectures for each new problem and democratized NLP: researchers with modest compute budgets could fine-tune a pre-trained model rather than training from scratch.

Also in 2018, Google demonstrated Duplex at its I/O conference, an AI system that could make phone calls to book restaurant reservations and hair appointments, complete with natural-sounding "um"s and "mm-hmm"s that made it nearly indistinguishable from a human caller. The demo was simultaneously thrilling and unsettling, raising immediate questions about disclosure and consent. Together, GPT-1, BERT, and Duplex signaled that the era of "foundation models," large pre-trained models that could be adapted to a wide range of tasks, had begun. The implications were still years from being fully understood, but the trajectory was unmistakable: language AI was about to change everything.

> Ask CLIO about this topic

← Previous

2017 — Transformer

2019 — GPT-2 — "Too Dangerous"