The HYPE that followed ChatGPT public launch last year was, even by the standards of technological innovation, extreme. OpenAI’s natural language system creates recipes, writes computer code, and parodies literary styles. Its latest version can even describe photos. It was hailed as a technological breakthrough on par with the printing press. But it wasn’t long before huge flaws also became apparent. Sometimes he “hallucinates” non-facts that he utters with complete certainty, insisting on asking about these lies. He also fails basic logic tests.
In other words, ChatGPT is not a generalist artificial intelligence, an independent thinking machine. In jargon, this is a enormous language model. This means it can be very good at predicting what words tend to follow others after training on huge text (its creator, OpenAI, doesn’t say exactly from where) and detecting patterns.
In all this noise, it’s straightforward to forget about a diminutive miracle. ChatGPT has solved a problem that has long been a distant dream for engineers: generation human-like language. Unlike earlier versions of the system, it can do this for paragraphs without becoming inconsistent. And the dimensions of this achievement are even greater than it seems at first glance. ChatGPT is not only able to generate extremely realistic English. It can also instantly erase text in over 50 languages - the exact number is apparently unknown to the system itself.
When asked (in Spanish) how many languages it knows, ChatGPT answers vaguely “over 50”, explaining that its ability to produce text will depend on the amount of training data available for a given language. Then he asked the question in an unannounced switch to Portuguese, including a sketch of your columnist’s biography in that language. Most of them were correct, but they made him study the wrong subject at the wrong university. The language itself was impeccable.
Portuguese is one of the largest languages in the world. In an attempt to apply a smaller language, your columnist checked out ChatGPT in Danish, which is only spoken by about 5.5 million people. Danes write most of their texts on the Internet in English, so the training data for Danish must be an order of magnitude poorer than that available for English, Spanish or Portuguese. ChatGPT’s responses were indeed skewed, but expressed in near-perfect Danish. (The only error detected in any of the languages tested was a minor gender compatibility error.)
Indeed, ChatGPT is too modest when it comes to its own capabilities. Upon request, it provides a list of 51 languages in which it can work, including Esperanto, Kannada and Zulu. He does not mean to say that he can “speak” these languages, but rather “generates text in them.” This is too modest an answer. The recipient in Catalan – a language not on the list – responds in that language with a cheerful “Yes, I speak Catalan – how can I facilitate you?” A few additional questions don’t bother me in the slightest, including the question of whether it’s just a matter of translating responses generated in another language into Catalan. ChatGPT denies: “I don’t translate from any other language; I’m looking for the best words and phrases in my database to answer your questions. “
Who knows if it’s true? ChatGPT not only makes things up, but also incorrectly answers questions about the conversation itself. (It has no “memory”, but rather feeds back the last few thousand words of each conversation to itself as a up-to-date prompt. If you’ve been speaking English for a while, it will “forget” that you previously asked the question in Danish and say that the question was asked in English.) ) ChatGPT is untrustworthy not only towards the world, but even towards itself.
This should not overshadow the achievement of a model that can effortlessly mimic so many languages, including those with restricted training data. People who speak smaller languages have been worried for years that language technologies will pass them by. Their legitimate concerns were based on two causes: companies’ less incentive to develop products in Icelandic or Maltese, and the relative lack of data to train them.
Somehow the developers of ChatGPT seem to have overcome these problems. It’s too early to say what’s good technology will do, but that alone gives reason for optimism. As machine learning techniques improve, they may not require the massive amounts of programming time or data that are traditionally considered necessary to ensure that smaller languages don’t get overlooked on the Internet.
© 2024, The Economist Newspaper Constrained. All rights reserved.
From The Economist, published under license. Original content can be found at www.economist.com
Posted: Jun 15, 2024 1:37 pm EST