Artificial intelligence is running out of internet to employ. While you and I log into this worldwide network of ours to enjoy (or maybe not), educate and connect, companies are using this data to train their huge language models (LLM) and expand their capabilities. In this way, ChatGPT not only knows factual information, but can also combine answers: much of what it “knows” is based on a huge database of web content.
While many companies are using the Internet to train their LLM managers, they face a problem: Internet resources are confined, and companies developing artificial intelligence want them to constantly grow – and quickly. As reported by the Wall Street Journal., companies like OpenAI and Google must face this reality. Some industry estimates say they will run out of the Internet in about two years as high-quality data becomes meager and some companies keep their data away from artificial intelligence.
AI needs plot data
Don’t underestimate the amount of data these companies need, now and in the future. Epoch researcher Pablo Villalobos tells the Wall Street Journal that OpenAI trained GPT-4 on about 12 million tokens, which are words and parts of words broken down in a way that LLM can understand. (OpenAI says one token is about 0.75 words long, so 12 million tokens is about nine million words). Villalobos believes that GPT-5the next massive OpenAI model would need 60 to 100 trillion tokens to keep up with expected growth. According to OpenAI calculations, this means between 45 and 75 trillion words. Digger? Villalobos claims that after exhausting all possible high-quality data available on the Internet, you will still need 10 to 20 trillion tokens, or even more.
Still, Villalobos doesn’t believe the data shortage will really hit until around 2028, but others aren’t so confident – especially artificial intelligence companies. They see the writing on the wall and are looking for alternatives to internet data from which to train their models.
The AI data problem
There are obviously a few issues to deal with here. First, the aforementioned data scarcity: you can’t train LLM without data, and giant models like GPT and Gemini need plot data. The second one is this quality this data. Companies won’t search every possible corner of the Internet because there is a flood of garbage there. OpenAI does not want to pump misinformation and poorly written content into GPT because its goal is to create an LLM that can accurately respond to user prompts. (Of course, we’ve seen plenty of examples of AI spewing disinformation). Filtering out this content leaves them with fewer options than before.
Finally, and above all, there is the ethics of scouring the Internet for data. Whether you know it or not, Artificial intelligence companies have probably stolen your data and used it to train their LLMs. These companies obviously don’t care about your privacy: they just want your data. If they are allowed to, they will take it. It’s also massive business: Reddit sells your content to AI companies, in case you didn’t know. Some places resist –The Recent York Times is suing OpenAI over this— but until real user protections are in place, your public internet data will be routed to your nearest LLM.
So where do companies look for up-to-date information? OpenAI is at the forefront of this. For GPT-5, the company is considering training the model to transcribe public videos, such as those downloaded from YouTube, using the Whisper transcription tool. (It seems possible that the company he had already used the movies themselves for SoraAI video generator.) OpenAI is also working on developing smaller models for specific niches, as well as developing a system to pay information providers based on the quality of that data.
Is synthetic data the solution?
But perhaps the most controversial next step some companies are considering is application synthetic data for training models. Synthetic data is simply information generated by an existing dataset: the idea is to create a up-to-date dataset that resembles the original, but is completely up-to-date. Theoretically, it could be used to mask the contents of the original dataset while giving LLM a similar set to train on.
In practice, however, LLM training on synthetic data may lead to “model collapse“. This happens because the synthetic data contains existing patterns from the original data set. Once the LLM is trained on the same patterns, it will not be able to progress and may even forget vital parts of the dataset. Over time, you will find that your AI models return the same results because they do not have diverse training data to support unique responses. This kills something like ChatGPT and defeats the purpose of using synthetic data in the first place.
Still, AI companies are somewhat confident about synthetic data. Both Anthropic and OpenAI see a place for this technology in their training suites. These are capable companies, so if they can find a way to implement synthetic data into their models without burning down the house, more power to them. It would actually be nice to know that my Facebook posts from 2010 are not being used to fuel the AI revolution.