“Misusing” Large Language Models and the Future of MT

20Dec

“Misusing” Large Language Models and the Future of MT

Listen to This Blog

Large language models have been in the news a lot in November and December and the coverage has been mixed, to put it mildly. Meta posted its Galactica model on November 15 but took it down just three days later in the face of intense criticism. By contrast, when OpenAI released ChatGPT two weeks later, on November 30, the response was much more positive. Examining why the reactions were so different provides insight into the potential and limitations of machine translation (MT) as well as cautions about how far we can use and trust AI-driven solutions.

Before going on, I should note that everything I write about below applies doubly for non-English language models, which tend to have both significantly less training data and less attention from developers. The reliability of these models drops significantly when you leave the English-speaking realm, increasing the risks from polluted data sets, hallucination, and the other factors I touch upon.

Galactica and the Overabundance of Hype

When Meta released Galactica, it did so with what usage revealed as overblown claims about its capabilities. Trained on large quantities of scientific texts, Meta said that it could “summarize academic papers, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.” The company also stated that Galactica could “potentially store, combine, and reason about” the information in its training data. Setting aside the heavy lifting of the word “potentially“ in the second quote – a nice way of making a claim without actually making it – these are truly impressive statements: Taken at face value, they imply that we now have a scientific truth engine, capable of generating new authoritative information.

But it didn’t take long for researchers and the public to find that, in the absence of actual data, it was given to simply making things up. They also found that Galactica would produce biased output, although I think that some of the best known examples were more ambiguous than claimed. In the face of intense pressure, Meta took Galactica down, although not without some Twitter sniping back and forth between the company and its critics, whom Meta’s Chief Scientist Yann LeCun accused of “casually misusing” the model, with another defender concluding that critics were the reason “we can’t have nice things.”

As a note, when it launched Galactica, the company did state that it was prone to “hallucination,” without much explanation of what that means. My own initial reaction to this was that Meta was using a technical term well understood by AI researchers to refer to phenomena such as those that happen when image generation tools add extra limbs to images, but it turns out that the company had been using it in an even more general sense to refer to cases where it produced output not seen in its training data. If that is really what it meant, then it blurs or even erases the boundary between hallucination and Galactica just doing what it’s supposed to do, especially if it is actually “combining” and “reasoning” about information.

01_bagpipes

Fortunately for those who want to ~~casually misuse~~ evaluate Galactica, Hugging Face has preserved a copy of the model, so I visited it and tried several queries. In one of them I asked it to tell me about the (nonexistent) “Lommel Hypothesis,” and it produced the following, complete with citations to a fictitious source:

02_lommel-hypothesis...

The result is word salad: It is clearly making things up as statements of fact since there is no Lommel Hypothesis. Is this hallucination? I’d argue it is not, but even if it is, Galactica included no clear way to tell whether its output was real or not. And the danger here is that the output is convincing enough word salad that a non-expert like me might not recognize that it is wrong, just that it is a little strange.

Ultimately, what did in Galactica was that Meta overhyped its capabilities. Had the company been a little less enthusiastic and given more careful and realistic statements of limitations, the reaction would have been very different. Instead, there was little to tell people that Galactica was not the dawn of a new era, and so the reaction was swift and harsh.

Yes, the reaction to Galactica was not what Meta hoped for, but it was ultimately far more merciful than what would have happened had people deliberately set out to break the model rather than prod it. The risks that surfaced here would only increase with time as more determined bad actors tried to break it, political factions subverted it for their own ends, or gray hat hackers attempted to “fix” the problems. Ultimately, Meta benefits here from honest probing rather than the determined assault that could have happened.

ChatGPT and the Advantage of Caution

Perhaps having the advantage of a later release date that gave it reason and time to temper its own claims, OpenAI’s ChatGPT has largely avoided the negative reaction that Galactica received. The company was more circumspect in its claims about what this model could do, but initial reports frequently included descriptors such as “stunned,” “amazed,” and “impressed.” Very quickly users showcased output that could easily be mistaken for good high school-level essays, and results were strikingly fluent.

When I posed the same question about my namesake proposition to ChatGPT, it responded that it was “not familiar with the Lommel Hypothesis” and asked me to provide “some more information or context about it.” This answer is much less dangerous than Galactica’s, because it is up-front about its limitations, and it demonstrates that OpenAI was considering the implications of producing false information.

However, a week later, I asked it about this again and its answer changed to the following:

03_lommel-hypothesis...

What changed in the time between the first and second queries? Ultimately, we will probably never know, but my guess is that my asking about it ended up inserting “Lommel Hypothesis” into the model where it did not initially exist and then it started to return results where previously there had been none. Lending credence to this surmise, when I asked it “What is the Zerflzxbladländer Hypothesis?” it responded that it did not appear to be a “well-known or widely accepted scientific theory” and that it was “unable to provide any information on it.” More confusingly, when I asked about the Lommel Hypothesis in another session, ChatGPT reverted to its previous response that it did not know the Lommel Hypothesis.

So, at the end of the day, I am not certain that ChatGPT is really better at distinguishing between reality and nonsense than Galactica once it has been exposed to nonsense. Perhaps in another week, ChatGPT will “think” that the “Zerflzxbladländer Hypothesis” is a real thing and give me an explanation of what it supposedly means. If so, this does not bode well for its future as it will slowly degrade under the weight of user-generated nonsense.

On the other hand, OpenAI says that ChatGPT is not connected to the internet or any other current source of truth, does not retain user input, and that its training data dates to 2021, which leaves it unable – at least for now – to evolve. So perhaps the inconsistent response to the Lommel Hypothesis was just so much stochastic noise. Either way, this reliance on resources from a fixed point in time is a substantial limitation, one reminiscent of machine translation systems that improve only once every 12 months when they are retrained.

What Can We Learn for the Language Industry?

Based on my probing of Galactica and ChatGPT, I have a few observations to make about language, the language industry, and machine translation:

1. Large language models make the “trust problem” worse. Despite the expectation that large language models would lead to the next wave of dramatic improvement in MT, they introduce some serious risks. One of the biggest challenges for MT now is that it is not reliable. Although the development of responsive and responsible MT should improve this, large language models that can produce convincing-sounding output that is nonsense are likely to increase the risk of dangerous or harmful translation errors. My experiments showed that users should not trust what Galactica says at face value, but instead need to examine it carefully to verify everything. Note that this problem will be worse in languages with relatively little training data in these models.

2. Labeling is essential. Risks increase when it is not clear where translation comes from. When MT hallucinates or generates a harmful output, who is responsible for that? Companies that deploy MT and do not label it or check it run the risk of becoming the legal case that determines where liability lies when they publish a translation that gives the wrong dosage for a drug or that directs a visitor seeking support to undertake an action that results in injury. If you are using MT, make certain that you have appropriate disclaimers and that you have a policy to determine where you can and, more critically, cannot release the results.

3. Translation will pose new problems for AI more generally. What happens when translation errors make their way into training data? What about when models start consuming translations that their predecessors generated and treat them as authoritative? Large language models will not verify translations, ensure that edits are made to published text, or address ambiguities that negatively affect their output. The result could be a vicious amplification of errors. On the other hand, deprecating translated content runs its own risk of reinforcing an English-centric view of the universe, assuming that the models are even “aware” of where content comes from.

4. Translation grades provide a way to address risk. After our webinars and article on translation grades, we received feedback that the idea of grades cheapened human translators and suggested that they could call themselves good translators if they produced low-grade output. However, the assumption is that professional human translators would almost always provide high-grade translation and that low- and medium-grade translation would be left to MT, paraprofessionals, or “light” MT editing services.

Note: It is not even clear to me that professional translators could, ethically or practically, produce low-grade output. On the other hand, if you feel MT cannot be trusted for your use case, it is critical that humans take the lead for producing high-grade translation and that consumers understand the difference between the grades so that they are not taken in by overblown claims about MT and, more broadly, AI. To be clear: MT and large language models cannot be trusted to produce high-grade translations. This limitation is more than just one of capability: The sources they use may themselves be dubious.

5. Professional linguists will continue to play a major, and growing, role. Based on the last point, it should be obvious that MT is not going to displace human translators or interpreters. Too much is at risk. It may take away some jobs where low- or medium-grade output meets user requirements in a cost-effective manner. However, it will definitely change the jobs that translators do today. Outside of certain genres – literary translation, advertising, high-end content – MT will play a growing assistive role for translators. They will need to learn to use it and, crucially, learn how to recognize when it gets things wrong.

6. Quality estimation will become key. As the output of MT becomes more fluent, detecting problems will become increasingly difficult, which can: a) raise the risk that content can pose; and b) increase the cognitive load for MT editors and thereby decrease their efficiency. This means that quality estimation will become more important, requiring breakthroughs in this area. When the technology can reliably identify problems and risk, it will address the trust problem.

These points show that the concerns with broader artificial intelligence have implications for the language industry. The problems that led Galactica to be pulled are ones the language industry will need to face. ChatGPT partially shows the way to move forward, but also shows that even systems that address these concerns may still leave unresolved issues. The challenge for the language industry will be how to integrate this technology and use it responsibly to avoid negative outcomes.

If you’d like to discuss these issues in greater depth, we encourage you to attend CSA Research’s online discussion with four leaders in the MT sector about the future of machine translation on January 26, 2023 at 11:00 US Eastern Standard Time (UTC -5). This webinar will be free and open to the public and we will be asking our guests some pointed questions about large language models and MT and hope that you will bring your own tough questions to contribute to the discussion.