MT Reaches Human Quality? Maybe, If You Squint Really Hard

21Mar

MT Reaches Human Quality? Maybe, If You Squint Really Hard

Microsoft decided to up its public relations game around machine translation. A recent blog post trumpeted that the company’s neural machine translation (NMT) system has reached parity with human translators for Chinese-English news text.

Predictably, mainstream tech news turned the news into a general claim about MT now being as good as human translation. It’s hard to blame Microsoft for wanting to make strides in an MT PR war dominated by its competitors – and the research paper with the actual findings is a good deal more careful than the news media in how it states the findings. But are we really at the point where machines are just as good as humans in translating global content, at least from Chinese to English?

There is good reason to be skeptical, and attendees at the AMTA conference in Boston learned why. Google’s Macduff Hughes discussed problems in MT quality assessment in his keynote addressing why MT needs to move beyond sentences. Although he focused on his company’s software, his comments apply equally to Microsoft’s blog post. His criticism falls into three general categories:

MT quality assessment focuses on sentences in isolation. Evaluators may see the Spanish “Fue a la tienda” translated into English as “He went to the store.” They will evaluate this as a good translation, but it could be wrong in context – “fue” could mean that “he,” “she,” or “it” went to the store. Even the best MT systems make such mistakes, a fact that is hidden in sentence-by-sentence versus in-context evaluations.
Dubious human translations are not a good standard. When linguists are paid by the word for reference translations, they know that their translations will not be rejected. Thus, they have a strong incentive to produce high-volume, low-grade translation. CSA Research’s examinations of so-called “gold standard” reference translations used in previous years’ WMT test sets – for example, Microsoft used the 2017 version – has revealed that such problems are pervasive. Thus, instead of saying that NMT output rivals that of humans, we could conclude conversely that that NMT performs like bad human translators.
MT lives and dies by statistics. Engines cannot handle issues such as supplying appropriate verb tenses when coming from languages – for example, Chinese – that lack grammatical tense. Although the translations they generate may be adequate in a sentence-by-sentence mode, a full translation will highlight things that they get wrong. For example, Google Translate– based on statistical data – equates doctors and engineers with men and depicts nurses and teachers as women, even if the context makes it clear that the opposite is true in a specific case.

Taken together, these criticisms should temper any overly optimistic interpretation of the results of sentence-level MT evaluations carried out in an artificial online environment. Sadly, such claims may also distract from the very real progress Microsoft has made when they lead skeptics to throw the baby out with the bath water and believers to make unrealistic claims.

The accurate claim would be that MT is now “at parity” with poor-quality human translation when evaluated under a set of highly artificial conditions. But an article titled “New Study Shows That MT Isn’t Terrible” would not generate the attention Microsoft wanted, although the few people who chose to read it would learn that the gap between human and machine translation still exists and that flesh-and-blood linguists won’t lose their jobs any time soon.

Human translators and MT engines have complementary skill sets: Hughes’ keynote showed that machines perform well on tedious and repetitive tasks because they never get bored. On the other hand, humans perform better on challenging and interesting tasks where the machines’ linguistic algorithms break down. Combining the two to leverage both their strengths – the hybrid approach we call augmented translation – will always provide results superior to either humans or machines acting alone.

As MT – along with automated content enrichment, lights-out project management, and improved terminology management and translation memory – moves forward, the balance will shift: Volumes will grow inexorably, humans will perform the high-value tasks, and machines will execute the repetitive things they can handle well. CSA Research predicts that demand for human translation will continue growing.

As we expressed it a few years ago, the only translators who need to worry about machine translation are those who translate like machines. Nothing since then has changed the equation. The machines may be getting better, but – hyperbolic PR claims notwithstanding – they are not about to replace good translators.