A Cynic's Guide to "Winning" at Machine Translation

05Apr

A Cynic's Guide to "Winning" at Machine Translation

Every new development in the field of machine translation (MT) is accompanied by a set of numbers that purport to show dramatic improvement in terms of quality, usually BLEU or METEOR scores. These measures use a scale from 0 to 100 to quantify how similar the MT output is to one or more human translations of the same source text based on a mechanical analysis of how many of the same words show up and how likely they are to appear in the same order.

These scores frequently play a prominent role in marketing from MT providers, and any new technology will be billed with a sentence something like, “With our new WhizzBang technology, average BLEU scores rise by three points over the baseline engine.” Developers like these numbers because they are easy to calculate and make it simple to quantify how changes affect system output.

However, CSA Research and leading MT experts have pointed out for over a decade that these metrics are artificial and irrelevant for production environments. One of the biggest reasons is that the scores are relative to particular references. Changes that improve performance against one human translation might degrade it with respect to another.

To test how big this effect was, CSA Research used a popular online statistical MT (SMT) system to translate 1,000 segments of Chinese news data into English and compare its output with 11 human translations of the same content. We then generated over 1,000 BLEU scores: one for every possible combination of those 11 references. The results allowed us to see the full range of scores for the same translation.

Our findings showed that BLEU scores are highly dependent on which and how many references we chose, with the highest score almost four times the lowest. We found that:

Scores for single references varied by more than 50%. The scores for single references ranged from a low of 13.68 to 21.32, 56% higher than the lower score. For combinations of two, results ranged from 23.85 to 31.73. This shows us that using two references, even those that by themselves generated the lowest figures, produces higher values than just one. This easiest way to raise BLEU scores is to add more human translations, without even updating the MT output.
The random variation in scores exceeds what developers claim is significant. The standard deviation for any number of references up to nine exceeds the 0.25-point change that many academic publications cite as significant. If simply switching the references would change the score that much, we cannot rely on BLEU score differences smaller than about three points to indicate real change because small numbers only tell us how well systems perform against particular references, not in general.

We ran a second test that revealed even bigger problems. If higher BLEU scores equal better quality and human translations are presumed to exhibit higher quality than MT, they should themselves score well. To determine whether this was true, we took the individual human translations and calculated BLEU scores for them using the others as references. We then compared the average scores of the the human translations with one rule-based (RbMT) and two statistical systems.

Our test finds that BLEU is actually a poor predictor of whether a translation came from a human or a machine. In fact: 1) One online SMT system out-performed eight of 11 human translators in terms of BLEU scores; 2) the second SMT engine beat did not do as well, but still beat one or two of the HT options, depending on how many references we used; and 3) the RbMT system was soundly beaten by the human linguists, consistent with the well-known underperformance of such engines in BLEU.

However, even a casual examination shows that the English translations from the high-performing SMT system were significantly less fluent and accurate than the human references. Instead of measuring quality, the high scores demonstrate that BLEU can reward a word salad over a human translation if it happens to get chunks that resemble the references.

These results show us that the scores MT developers often use to show how great their technology is may not actually mean anything for potential adopters of the technology and that the changes they cite may not indicate any real difference.

In summary, if you are cynical and want to declare yourself a winner in developing MT software, you can take the following steps:

Tune your system to perform well against particular references. It won’t actually improve your quality, but you should be able to gain three or four points that you can cite as a major breakthrough.
Cite your “BLEU scores” without details. When adopters don’t know the references you used, they cannot compare your results against other systems. Thus, nobody can question your scores. Anyone playing this game might throw in a few extra references to get an instant boost against your competitors. It’s fast and easy.
Focus on fluency over accuracy. Using combinations of words from your training data will give you better scores than you’ll get by trying to generate the correct meaning. This is bad news for developers of rules-based systems: They might actually provide a more useful result, but they won’t receive winning BLEU scores.
Emphasize the scores over other ways of assessing quality. Don’t work with user testing, rating, or feedback. These approaches might reveal that your differences in scores don’t mean anything in the real world. Instead, tell the public that BLEU scores “correlate to human judgment” – even if the findings actually show that human evaluators with access to reference translations tend to score translations based on whether the particular words in the reference are in the MT, not on how good the translation is.
Bonus point: Compare MT against human scores rather than “perfection.” This is how Google could transform modest – but real – improvements in fluency into claims that its new neural MT system reduced translation errors 55–85%. Human raters do not treat the references as perfect translations, and rate them accordingly, so focusing on them lowers the bar you have to clear and lets you magnify the importance of relatively small changes.

As MT moves to become a mainstream solution, it is important that developers work toward meaningful ways of evaluating the quality of its output. Software developers’ continued emphasis on BLEU and other reference-based metrics only serves to muddy the water. Approaches that emphasize usability and user acceptance take more effort than automatic scores, but point the way toward a useful and practical discussion of MT quality.