Access Business Analytics
Every new development in the field of machine translation (MT) is accompanied by a set of numbers that purport to show dramatic improvement in terms of quality, usually BLEU or METEOR scores. These measures use a scale from 0 to 100 to quantify how similar the MT output is to one or more human translations of the same source text based on a mechanical analysis of how many of the same words show up and how likely they are to appear in the same order. These scores frequently play a prominent role in marketing from MT providers, and any new technology will be billed with a sentence something like, “With our new WhizzBang technology, average BLEU scores rise by three points over the baseline engine.” Developers like these numbers because they are easy to calculate and make it simple to quantify how changes affect system output. However, CSA Research and leading MT experts have pointed out for over a decade that these metrics are artificial and irrelevant for production environments. One of the biggest reasons is that the scores are relative to particular references. Changes that improve performance against one human translation might degrade it with respect to another. To test how big this effect was, CSA Research used a popular online statistical MT (SMT) system to translate 1,000 segments of Chinese news data into English and compare its output with 11 human translations of the same content. We then generated over 1,000 BLEU scores: one for every possible combination of those 11 references. The results allowed us to see the full range of scores for the same translation.
Our findings showed that BLEU scores are highly dependent on which and how many references we chose, with the highest score almost four times the lowest. We found that:
We ran a second test that revealed even bigger problems. If higher BLEU scores equal better quality and human translations are presumed to exhibit higher quality than MT, they should themselves score well. To determine whether this was true, we took the individual human translations and calculated BLEU scores for them using the others as references. We then compared the average scores of the the human translations with one rule-based (RbMT) and two statistical systems.
Our test finds that BLEU is actually a poor predictor of whether a translation came from a human or a machine. In fact: 1) One online SMT system out-performed eight of 11 human translators in terms of BLEU scores; 2) the second SMT engine beat did not do as well, but still beat one or two of the HT options, depending on how many references we used; and 3) the RbMT system was soundly beaten by the human linguists, consistent with the well-known underperformance of such engines in BLEU. However, even a casual examination shows that the English translations from the high-performing SMT system were significantly less fluent and accurate than the human references. Instead of measuring quality, the high scores demonstrate that BLEU can reward a word salad over a human translation if it happens to get chunks that resemble the references. These results show us that the scores MT developers often use to show how great their technology is may not actually mean anything for potential adopters of the technology and that the changes they cite may not indicate any real difference. In summary, if you are cynical and want to declare yourself a winner in developing MT software, you can take the following steps:
As MT moves to become a mainstream solution, it is important that developers work toward meaningful ways of evaluating the quality of its output. Software developers’ continued emphasis on BLEU and other reference-based metrics only serves to muddy the water. Approaches that emphasize usability and user acceptance take more effort than automatic scores, but point the way toward a useful and practical discussion of MT quality.
Senior Analyst
Focuses on language technology, artificial intelligence, translation quality, and overall economic factors impacting globalization
In October 2023, we argued that the future of AI would be in “focused large language models” (FLLM...
Some people feel that using artificial intelligence (AI) to interpret human speech is a curse becaus...
A common worry about generative AI (GenAI) is that the content that it creates may be subject to cop...
It is incredible to think that, less than eight years after the first publicly available neural mach...
National variants of multinational languages such as English, French, and Spanish underscore the cha...
The explosion of interest in generative AI technology like ChatGPT has led to general calls for the ...
Posts by CSA_Research