AI Essential Reading: The Bitter Lesson | 5Y View

五源资本·February 22, 2024·3·0

The most important lesson learned in AI research over the past 70 years.

Recently, as public discussion of OpenAI's text-to-video model Sora continues to heat up, an OpenAI engineer's daily schedule has also drawn attention. One item on it: studying a 2019 classic by Richard S. Sutton, the godfather of reinforcement learning and a Canadian computer scientist — The Bitter Lesson.

The article argues that the biggest detour AI research has taken over the past 70 years has been its excessive emphasis on existing human experience and knowledge. The real path forward, Sutton contends, lies in abandoning domain-specific human knowledge and instead leveraging massive compute — the true direction toward AGI. We've selected this piece in the hope it offers you something worth thinking about :)

By: Richard S. Sutton | Translated by: Zhezhou Zhou

Originally published March 13, 2019. Click "read more" for the original.

The Bitter Lesson

Richard S. Sutton

The biggest lesson that can be read from 70 years of AI research is that general methods leveraging computation are ultimately the most effective, and by a large margin — because of Moore's Law, the cost per unit of computation continues to drop exponentially. Most AI research has proceeded as though the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance), but, viewed over a slightly longer time horizon, available compute is bound to increase dramatically. To seek short-term improvements, researchers try to draw on existing human knowledge within the field, but in the long run, the only thing that matters is leveraging compute. Methods built on human knowledge tend to be complex and poorly suited to taking advantage of general-purpose computation. There are many examples of AI researchers learning this bitter lesson slowly — reviewing some of the most prominent ones over the years is instructive:

In computer chess, the method that defeated world champion Garry Kasparov in 1997 was based on massive deep search. At the time, many computer chess researchers were disappointed — they had been working on methods that leveraged human understanding of chess's special structure. When a simpler search-based approach, combined with specialized hardware and software, proved far more successful, these human-knowledge-based chess researchers did not accept defeat graciously. They argued that "brute-force" search might have won this time, but it wasn't a general strategy, and it wasn't how people played chess. These researchers wanted to win by emulating human thought processes, and they were disappointed when it didn't work.

In computer Go, a similar research trajectory played out, just 20 years later. The initial massive effort went toward avoiding brute-force search and instead finding ways to leverage human knowledge (a thousand years of game records) or special features of the game — but all of these efforts proved irrelevant. Worse, once large-scale search was implemented effectively, these efforts were actively harmful. Also worth noting: the method of learning a value function through self-play proved crucial for Go and many other games, even though learning played no significant role in the program that first defeated the world chess champion in 1997. Learning and search are the two most important categories of techniques for leveraging large-scale computation in AI research. In computer Go, as in computer chess, researchers' initial efforts focused on using human understanding (to reduce search), with greater success coming later through search and learning.

In speech recognition, an early competition sponsored by DARPA in the 1970s saw contestants bending over backward to deploy a range of human-knowledge-based tricks — understanding of words, phonemes, the human vocal tract, and so on. On the other side were more statistical methods, which did more computation using hidden Markov models (HMMs). Once again, statistical methods triumphed over human-knowledge-based approaches. This catalyzed a major shift in natural language processing, where statistics and computation gradually came to dominate the field over subsequent decades. The more recent rise of deep learning in speech recognition represents the latest step in this ongoing trajectory. Deep learning methods rely even less on human knowledge, use far more computation, and learn on massive training sets to produce better speech recognition systems. As in games, researchers kept trying to build systems that worked the way they thought their own minds worked — they tried to encode that knowledge into their systems — but this proved counterproductive and a tremendous waste of researchers' time, because through Moore's Law, large-scale computation became feasible and could be put to much better use.

In computer vision, a similar pattern holds. Early approaches understood vision as finding edges, generalized cylinders, or processing based on SIFT features. But all of this has been discarded today. Modern deep neural networks use only the concepts of convolution and certain invariances, and perform far better.

This is an important lesson. Across the entire field of AI, we still haven't thoroughly learned it, because we keep making the same mistakes. To see this clearly and effectively resist it, we must understand the seductive appeal of these errors. We must learn the bitter lesson that building in how we think we think does not work.

This bitter lesson is based on the historical observation that 1) AI researchers often try to build knowledge into their agent systems, 2) this always helps in the short term and is personally satisfying to the researchers, but 3) in the long run it always plateaus and even impedes further progress, and 4) the eventual breakthrough progress comes through an opposing approach — massive scaling of search and learning.

One thing to take from this bitter lesson is the great power of general-purpose methods — methods that continue to scale with increasing available compute. The two methods that seem to scale indefinitely in this way are search and learning.

The second general point to learn from this bitter lesson is that the actual contents of human minds are tremendously complex; we should stop trying to find simple ways to think about their contents, such as simple ways to think about space, objects, multiple agents, or symmetry. All of these are arbitrary, intrinsically complex aspects of the external world. They should not be built in; their complexity is endless — instead, we should build meta-methods that can discover and capture this arbitrary complexity. The key to these methods is that they can find good approximations, but the algorithm should be about our methods (such as learning), not the knowledge we have already acquired. We want AI agents to be able to discover as we humans do, not to have our discoveries integrated into their systems.

5Y Capital seeks out, supports, and inspires solitary entrepreneurs, providing everything from spiritual backing to operational support. We believe that if the "crazy" you that others see starts to be believed in, the world will become a different place.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG