This hunger for the up-to-date only intensified. In March, Anthropic released Claude 3, which beat out previous top OpenAI and Google models in various rankings. On April 9, OpenAI regained the crown (in some cases) by improving its model. On April 18, Meta released Llama 3, which according to preliminary results is the most powerful open model to date. OpenAI will likely make a splash this year with the release of GPT-5, which may have capabilities beyond any current vast language model (LLM). If the rumors are to be believed, the next generation of models will be even more extraordinary – for example, they will be able to perform multi-step tasks instead of just responding to prompts, or carefully analyze complicated questions rather than spitting out the first algorithmically available answer.
For those who think this is just tech hype, consider this: investors are deadly solemn about backing next-gen models. Training GPT-5 and other next-generation models is expected to cost billions of dollars. OpenAI is also reportedly working with tech giant Microsoft to build a up-to-date $100 billion data center. Based on numbers alone, it seems like there will be unlimited, exponential growth in the future. This dovetails with a view shared by many AI researchers called the “scaling hypothesis,” namely that the architecture of current LLMs is on the way to unlocking phenomenal progress. To exceed human capabilities, according to this hypothesis, more data and more powerful computer chips are needed.
But look closer at the technical frontier and you’ll notice some daunting obstacles.
Beauty is not enough
Data can be the fastest bottleneck. The Epoch AI research team estimates that the well of high-quality text data on the public Internet will parched up by 2026. This has researchers scrambling for ideas. Some labs are turning to the private network, buying data from brokers and news services. Others turn to the immense amounts of audio and visual data available on the Internet, which can be used to train increasingly larger models over decades. Video can be particularly useful in teaching AI models about the physics of the world around them. If the model can observe the ball flying through the air, it will be easier for it to calculate the mathematical equation that describes the projectile’s motion. Leading models such as GPT-4 and Gemini are now “multimodal”, capable of processing different types of data.
If the data can no longer be found, you can create it. Companies like Scale AI and Surge AI have built vast networks of people to generate and annotate data, including graduate students solving problems in math or biology. One executive at a leading artificial intelligence startup estimates that it costs artificial intelligence labs hundreds of millions of dollars a year. A cheaper approach is to generate “synthetic data”, where one LLM creates billions of pages of text to train a second model. Although this method can cause problems: models trained in this way may lose previous knowledge and generate uncreative answers. A more fruitful way to train AI models on synthetic data is to allow them to learn through collaboration or competition. Scientists call this “playing your own game.” In 2017, Google DeepMind, the search giant’s artificial intelligence lab, developed a model called AlphaGo that, after training against itself, beat the human world champion in the game of Go. Google and other companies are now using similar techniques in their latest LLMs.
Extending ideas like independent play to up-to-date domains is a heated topic of research. However, most real-world problems – from running a business to being a good doctor – are more complicated than the game and do not involve clear winning moves. Therefore, for such complicated domains, data to train models is still needed from people who can distinguish between good and bad quality responses. This in turn slows down performance.
More silicon, but make it fashionable
Better hardware is another path to more proficient models. Graphics processing units (GPUs), originally designed for video games, have become the chips of choice for most AI developers thanks to their ability to perform intensive computations in parallel. One way to unlock up-to-date possibilities could be to employ chips designed specifically for AI models. Silicon Valley chipmaker Cerebras released a product in March with 50 times more transistors than the largest graphics processor. Building a model is typically made more complex by the need to constantly load data into and out of the GPUs while training the model. The giant Cerebras chip, on the other hand, has built-in memory.
Recent models that can take advantage of these developments will be more reliable and better able to cope with complex user requests. One way this can happen is through larger “context windows”, which is the amount of text, image or video that a user can enter into the model when making requests. Enlarging context windows to allow users to submit additional relevant information also appears to be an effective way to reduce hallucinations, the tendency for AI models to confidently answer questions based on made-up information.
But while some modelers race for more resources, others see signs that the scaling hypothesis is running into difficulties. Physical limitations – such as lack of memory or rising energy costs – place practical constraints on the design of larger models. What’s more concerning is that it’s not clear whether expanding context windows will be enough to ensure further progress. Yann LeCun, an AI star currently at Meta, is one of many who believe that the limitations of current AI models cannot be fixed with more of the same solutions.
Some scientists are therefore turning to a long-standing source of inspiration in the field of artificial intelligence – the human brain. The average adult can reason and plan much better than the best LLM, even though it uses less energy and much less data. “Artificial intelligence needs better learning algorithms, and we know this is possible because the brain has them,” says Pedro Domingos, a computer scientist at the University of Washington.
One problem, he says, is the algorithm they employ to learn LLMs, called backpropagation. All LLMs are neural networks arranged in layers that receive input data and transform it to predict outcomes. When LLM is in the learning phase, it compares its predictions with the version of reality available in the training data. If these values differ, the algorithm makes petite adjustments to each layer of the network to improve future predictions. This makes it computationally intensive and incremental.
The neural networks in today’s LLMs are also inefficiently organized. As of 2017, most AI models employ a neural network architecture known as a transformer (the “T” in GPT), which establishes relationships between spaced bits of data in a dataset. Previous Approaches Have Been Arduous If, for example, a transformer-based model was asked to write lyrics to a song, it could riff on lines from many verses earlier in its code, while a more primitive model would forget everything. at the beginning until the song reaches the end. Transformers can also be run on multiple processors at the same time, which significantly reduces their training time.
However, Albert Gu, a computer scientist at Carnegie Mellon University, believes that the time of transformers may soon come to an end. Scaling up context windows is highly computationally unskilled: as the input data is doubled, the amount of computation required to process it quadruples. With Tri Dao of Princeton University, Dr. Gu developed an alternative architecture called Mamba. If, analogously, the Transformer reads all the pages of a book at once, Mamba reads them sequentially, updating his worldview as he progresses. This is not only more proficient, but also more similar to the way human understanding works.
LLM people also need lend a hand with better reasoning and planning. Andrej Karpathy, a researcher formerly at OpenAI, explained in a recent keynote that current LLMs are only capable of “system 1” thinking. In humans, it is an automatic way of thinking associated with making quick decisions. However, thinking in “system 2” is slower, more conscious and requires iteration. In the case of artificial intelligence systems, this may require algorithms capable of something called search – the ability to outline and explore many different courses of action before selecting the best one. This would be similar to the way AI models in a game can choose the best moves after exploring several options.
Advanced planning via search is the subject of many ongoing efforts. For example, Dr. LeCun of Meta is trying to program the ability to reason and predict directly into an artificial intelligence system. In 2022, he proposed a framework called “Joint Embedding Predictive Architecture” (JEPA), which is trained to predict larger pieces of text or images in a single step than current generative AI models. This allows it to focus on the global features of the dataset. For example, when analyzing animal images, a JEPA-based model can more quickly focus on size, shape and color, rather than individual patches of fur. One can hope that by abstracting elements, JEPA learns more effectively than generative models that get distracted by irrelevant details.
Experiments with approaches such as Mamba and JEPA remain the exception. Until data and computing power become insurmountable obstacles, transformer-based models will remain in vogue. But as engineers employ them in increasingly complicated applications, human expertise will be needed to label data. This may mean slower progress than before. For a up-to-date generation of AI models to wow the world, as ChatGPT did in 2022, fundamental breakthroughs may be necessary.
© 2024, The Economist Newspaper Narrow. All rights reserved.
From The Economist, published under license. Original content can be found at www.economist.com