If you asked the general public what the best AI model is, there’s a good chance most people would answer ChatGPT. While there will be many players on the scene in 2024, LLM OpenAI is the one that has truly broken through and introduced powerful generative AI to the masses. And so it happens that ChatGPT’s huge language model (LLM), GPT, has consistently been at the top of its competitors, from the introduction of GPT-3.5, through GPT-4, and now GPT-4 Turbo.
However, the tide seems to be turning: this week, Anthropic’s Claude 3 Opus, LLM, overtook GPT-4 for the first time in the Chatbot Arenawhich prompted app creator Nick Dobos to declare: “The king is dead.” If you check the leaderboard At the time of writing, Claude still has an advantage over GPT: Claude 3 Opus has an Arena Elo rating of 1253, while GPT-4-1106-preview has an Arena Elo rating of 1251, closely followed by GPT-4-0125- preview, with a ranking of 1248.
For what it’s worth, Chatbot Arena ranks all three LLMs in first place, but Claude 3 Opus has a slight edge.
Anthropic’s other LLMs are also performing well. Claude 3 Sonnet ranks fifth on the list, just behind Google’s Gemini Pro (both ranking fourth), while Claude 3 Haiku, Anthropic’s less powerful LLM for high-performance computing, ranks just behind but only above GPT-4 version 0613 version 0613 GPT-4.
How Chatbot Arena evaluates LLM
To rank the various LLMs currently available, Chatbot Arena asks users to enter a prompt and rate how two different, unnamed models will react. Users can continue the conversation to evaluate the difference between them until they decide which model they think works better. Users don’t know which models they’re comparing (you can compare Claude with ChatGPT, Gemini with Lama Meta, etc.), which eliminates any brand preference bias.
However, unlike other types of benchmarks, there is no real rubric by which users can evaluate their anonymous models. Users can simply decide for themselves which LLM performs better, based on whatever metrics they care about. As artificial intelligence researcher Simon Willison says in an interview with Ars Technica, most of what makes LLMs work better in the eyes of users is more “vibes” than anything else. If you like the way Claude responds better than ChatGPT, that’s all that matters.
Above all, it is a testament to how powerful these LLMs have become. If you had offered the same test many years ago, you would probably have looked for more standardized data to determine which LLM was stronger, whether it was speed, accuracy, or consistency. Now Claude, ChatGPT and Gemini are getting so good that they are almost interchangeable, at least when it comes to the overall utilize of generative AI.
While it’s impressive that Claude surpassed OpenAI’s LLM for the first time, what’s probably even more impressive is GPT-4 that has survived this long. The LLM itself is already a year venerable, leaving out iterative updates like GPT-4 Turbo, and Claude 3 was released this month. Who knows what will happen when OpenAI introduces GPT-5 which, at least according to one anonymous CEO, reads: “…really good, like materially better.” For now, there are many generative AI models, each of which is almost equally effective.
Chatbot Arena gathered over 400,000 human votes to rank these LLMs. You can try the test yourself and add your vote to the rankings.