It’s no secret that building a gigantic language model (LLM) requires huge amounts of data. In conventional LLM training, you are given mountains of text and encouraged to guess every word before it appears. With each prediction, LLM makes petite adjustments to enhance your chances of guessing correctly. The end result is something that has some statistical “understanding” of what is proper language and what is not.
However, an LLM that has only undergone so-called “initial training” is not yet particularly useful. For example, when asked about a joke, to humor its correspondent, the trained GPT-2 model simply repeated the question three times. When asked who the American president was, he replied: “The answer is no, the president is not the president.” It’s clear that teaching an LLM to do what people want requires more than that.
One way to adapt such models to user expectations is reinforcement learning from human feedback (RLHF). OpenAI, an American startup, introduced this technique in a preprint published in March 2022. It was the main ingredient in his recipe for ChatGPT, which was released eight months later.
RLHF usually consists of three stages. First, volunteers are asked to choose which of two potential LLM responses may better fit the given prompt. This is then repeated many thousands of times. This dataset is then used to train a second LLM, which effectively replaces the human. This so-called reward model, designed to assign higher scores to human expected responses and lower scores to everything else, is then used to train the original LLM. Finally, a machine learning technique called reinforcement learning adjusts the knobs and levers of the original LLM to support reinforce the behaviors that earn it its reward.
This way of doing RLHF is quite complicated – using two separate LLMs takes time and money, and the algorithm used for reinforcement learning is, to quote Rafael Rafailov of Stanford University, “quite painful.” This means that apart from Google’s OpenAI and their rivals, no one has really exploited its full potential.
Now it turns out that the same results can be achieved with a fraction of the effort. Dr. Rafailov and his colleagues, including Archit Sharma and Eric Mitchell, presented this alternative in December 2023 at the NeurIPS artificial intelligence conference. Their method, Direct Preference Optimization (DPO), is based on a satisfying mathematical trick.
This trick is based on the observation that for each reward model there is a specific theoretical LLM that allows you to get full marks and each LLM also has a theoretical reward model that gives it flying colors. (Just as, to put it more prosaically, every pair of pants has a theoretical person on whom they will fit perfectly, and every person has a theoretical pair of pants that will fit best.) This observation that every LLM has a hidden reward model within it, allowed researchers to directly tinker with this model. In the senior system, LLM learned from a reward model that learned from data. Now LLM can learn directly from data.
According to the authors, removing the middleman makes DPO three to six times more effective than RLHF and can better perform tasks such as text summarization. Ease of apply is already allowing smaller companies to deal with the alignment problem, says Dr. Sharma. A year ago, only a few of the world’s leading models, such as Google’s Gemini and OpenAI’s GPT-4, could afford to apply RLHF. However, as of March 12, eight of the ten highest-ranked LLM firms in the industry ranking used DPO services. Mistral, it is used by a French startup that wants to compete with OpenAI. Meta, the social media giant, has integrated it with its native LLM.
More improvements are sure to come. First, the consensus is that the large AI labs have improved their proprietary algorithms since they stopped publishing details in 2022. However, the problem of getting LLM to do what a person would want and expect is still a long way from being solved and dusted. After all, even other people struggle sometimes.
© 2024, The Economist Newspaper Ltd. All rights reserved.
From The Economist, published under license. Original content can be found at www.economist.com
Posted: May 13, 2024, 7:00 PM EST