Is Neural MT Really as Good as Human Translation?

18Jan

Is Neural MT Really as Good as Human Translation?

By now, most language professionals have seen that claims that neural machine translation (NMT) is delivering results as good – or almost as good – as human translation. If these claims – which have been repeated in the mainstream tech press without much examination – are accurate, it is only a matter of time before human translators will be out of work.

Based on our ongoing examination of the role of machine translation in facilitating multilingual communications, we find that these claims are overblown. In many respects, NMT represents a significant improvement over state-of-the-art statistical machine translation, but it has not closed the gap with human translation. It will make MT more acceptable and help address the huge mismatch between language needs and supply, but we predict that it will simultaneously help increase the value of human translators for the high-value tasks where they excel.

Google is among the first large-scale implementer of NMT in a production environment. Prior to 2016, artificial intelligence (AI) researchers saw NMT as something that would come, but largely treated it as a bit of “pie in the sky.” All this changed in 2016. Microsoft actually beat Google to launch with NMT by several months, but its deployment received minimal attention, whereas Google’s press release immediately garnered the sort of love-struck press attention that language technology almost never receives.

The discrepancy in attention is due, in part, to Google’s savvy media strategy that tied in with increasing attention to AI and delivered seemingly easy-to-digest claims that – on their surface – promised world-changing results. By contrast, more realistic and measured press releases from Microsoft and SYSTRAN – which announced a shift to NMT at the same time that Google was making its announcements – did not appeal to the imagination in the same way.

What particularly caught on with the press was Google’s internal finding that NMT-generated target text was almost indistinguishable from human translation when graded by human reviewers and that it had reduced translation errors by 55–85%. The latter claim relied on unclear math that the company’s own charts – which showed substantial but incremental improvement – did not seem to reflect:

Comparison of Google’s Phrase-Based and Neural Systems with Human Translation

Google claims quality levels of its NMT system approach that of human translation
Source: Google

A third-party clearinghouse for MT testing and data would help contextualize these findings and avoid the tendency for developers to present their advances in ways that may exaggerate success. However, in the absence of a neutral and reliable source to counterbalance it, Google’s announcement led to credulous media claims that MT was on the verge of replacing human linguists.

It appears that Google’s numbers refer to the percentage of the difference between the scores of the older phrase-based system and human translation that the NMT engine closed, but that claim is misleading for the following two reasons:

The tests Google used to show high quality cannot actually demonstrate it. The testing methods that Google relied on ask non-translators who are fluent in both the source and target text to assign a score to both the MT output and a human translation. It does not quantify the number of errors or give any sense of their severity. In addition, fluency is not enough to guarantee full understanding of translation issues and this method tends to privilege fluent – natural sounding – translation over accurate translation. An examination of Google’s samples shows that the company’s older phrase-based statistical system often sounded stilted but got the meaning right when the NMT version did not. This shift can be a problem if better sounding output leads information consumers to assume the translation is accurate when it is not.
The human references would not pass muster with professional translators. Again, the human references show many problems. They were not the sorts of translations that LSPs would be proud to deliver to their clients. Google’s own results show that they often fall far short of the score that “perfect translation” would receive. Accordingly, Google has demonstrated that the NMT output is more fluent than the older system and that it is almost at the level of bad or preliminary human translation, but not that it is on the level with good quality human translation.

Admittedly, the mainstream press is not going to be interested in the details of translation quality assessment. Comparing human and machine translation is a particularly difficult topic where the reference-based methods favored by MT researchers have almost no connection with how professional language workers see and understand quality.

All that said, NMT holds the promise of delivering significantly better MT and extending it to areas where current solutions are not good enough and content remains untranslated. CSA Research predicts that NMT will play an ever-increasing role in enterprises’ multilingual content strategies. Current-generation MT is already helping LSPs deliver sustained growth that outstrips the unaided capability of human translators. It will deliver improved MT that will increase market share in areas where quality issues have kept MT from making inroads.

NMT and other technologies such as adaptive MT will transform how language professionals work, but will not replace them any time soon. Instead, it will help professional linguists to increase their productivity and value. As they use good-quality MT in their daily activities, it will make them faster and allow them to focus on the “good parts” of their jobs: those aspects that require their expertise and creativity.