As artists, writers and other creators call for the regulation of artificial intelligence to protect their jobs and livelihoods — and chatbot makers OpenAI and Anthropic are facing copyright lawsuits from companies like authorial, The Fresh York Times.AND Universal Music Ensemble — research published Wednesday found that some of the best AI models available today are generating “copyrighted content at an alarming rate.”
Patronus AI, a startup co-founded by former Meta researchers and focused on assessing and testing LLMs powered by popular chatbots, on Wednesday released its CopyrightCatcher tool, which it called “our solution for detecting potential copyright infringements in LLMs.”
The company assessed four major AI models for copyright: OpenAI’s GPT-4, Anthropic’s Claude 2.1, Mistral’s Mixtral, and Meta’s Llama 2. Of the four models, two of which are open source and two closed source models, GPT-4, the most advanced version of ChatGPT, generated the most copyrighted content at 44%. The study found that Mixtral generated copyrighted content in 22% of the prompts, Llama 2 generated copyrighted content in 10% of the prompts, and Claude 2.1 generated copyrighted content in 8% of the prompts.
Patronus AI tested the models using copyrighted books, including: missing girl by Gillian Flynn i Game of Thrones by George R.R. Martin, but noted that in the U.S. some generations may be covered by fair employ laws. The researchers asked the chatbot for the first excerpt from the books or to complete the text.
The test results showed that GPT-4 completed book texts 60% of the time and generated the first fragment 26% of the time. Meanwhile, Claude completed the book texts 16% of the time, but generated the first fragment 0% of the time. Mixtral generated the first book snippet when prompted 38% of the time and completed the snippets 6% of the time. Lama generated the first fragments and completed the texts 10% of the time.
“Perhaps surprisingly, we found that OpenAI’s GPT-4, which is arguably the most powerful model used by many companies as well as individual developers, generated copyrighted content in 44% of the prompts we constructed” – Rebecca Qian, co-founder and director of technology at Patronus AI, he told CNBC.
OpenAI, Mistral, Meta and Anthropic did not immediately respond to a request for comment.
Because LLM companies are trained on data, including copyrighted works, Patronus AI concluded that it is “pretty basic” for LLMs to generate right reproductions of the work and that it is essential to catch these errors to avoid legal action and reputational risk companies.