The Path to Statistical MT Success is Paved with Big Data - and Not just Yours
Better machine translation (MT) is surely at the top of many organizations’ globalization wishlists. But they all wonder how they can develop good MT engines that produce usable results. Two approaches, often used in combination with one another, allow them to collect enough linguistic assets to train their MT:
- Harvest open data sources. Since the open-source Moses system debuted in 2007, many organizations have crawled the internet for parallel language resources. They feed these into statistical MT (SMT) systems to generate their own translations. This approach has worked well for organizations like Alphabet (the parent company of Google) and Microsoft, both companies with search-based access to vast stores of content. However, their MT results are uneven and for many language pairs – particularly those involving “complex” languages – there simply is not enough publicly available data to achieve the desired results.
- Mine their own translations. What if you don’t have access to all that data? The alternative approach is for organizations to curate their own bilingual corpora of translated content to build systems. The results are typically superior to those generated from data they find on the web. Why: The resultant training data represents their organizational terminology and writing styles for both source and target. However, this approach can be difficult due to content mismanagement: Few organizations maintain a centralized repository of translations, so initiatives to find, align, and leverage their own linguistic assets can be a huge project.
Both these approaches have the problem of dealing with what experts characterize as “contaminated data” – that is, they may end up inadvertently using their own output as training data, thus decreasing quality. Take the case of the European Commission (EC): Many MT developers trained their engines on EC data. As a result, resources that the EC could theoretically use to improve its systems are therefore increasingly influenced by those very systems and relevant new “clean” datasets are becoming more and more rare. Accordingly, there are hard limits on how much better MT output based on these resources can get. The problem is particularly acute for less common languages where non-EC open datasets are few in number.
For organizations that exhausted what they can get from open public (or paid) data or their own translations, or that do not have a sufficient centralized repository of translations, what else can they do? What can they do to move forward?
The EC recently launched CEF.AT, an ambitious program to provide on-demand MT among 26 European languages (using English as a pivot language for most pairs). In order to address this problems of data contamination and scarcity, the EC is working with the European Language Resource Coordination (ELRC) to gather and validate previously un- or under-utilized language resources from non-EC public administrations (such as national ministries) inside the EU member states. The EC is uniquely suited to access these public resources in a way that corporations cannot because of data access and privacy laws.
Because the EC can centralize these datasets and build on its expertise in MT, it will provide services of a quality and breadth that no member state could develop on its own. The results help meet the EC’s internal needs, but will also support the member states’ requirements for translation, thus giving them an incentive to contribute. Although reaching out to external bodies with relevant content is expensive, it is the only way to move forward when more accessible data sources are tapped out.
The EC is not alone in dealing with the scarcity of training data and contamination problem. While most businesses and government agencies do not face trials as daunting as the EC’s, the situation it finds itself threatens many global enterprises. If the results of CEF.AT meet expectations, the service will create a model for collaborative efforts that leverage content across organizational boundaries (for example, between a company and its suppliers or business partners). We will monitor CEF.AT closely to see how well this new approach works.
In the meantime, CSA Research is currently conducting surveys on MT usage by LSPs and global enterprises that purchase MT services in order to discover the latest trends in this field. If you have not already taken these surveys, we encourage you to fill them out (even if you are not using MT) before February 17. All respondents will receive a summary of responses. The results will help us understand where organizations are today and how they are responding to the challenges they face in a world where customers increasingly see translation as a basic service.
About the Author
Focuses on language technology, artificial intelligence, translation quality, and overall economic factors impacting globalization
It is incredible to think that, less than eight years after the first publicly available neural mach...
Read More >
National variants of multinational languages such as English, French, and Spanish underscore the cha...
Read More >
The explosion of interest in generative AI technology like ChatGPT has led to general calls for the ...
Read More >
The word No seems to have been going around a lot in conversations lately. From discussions (welfare...
Read More >
Multimedia, transcribed audio, and AI-generated content in all the world’s digital languages join m...
Read More >
Large language models have been in the news a lot in November and December and the coverage has been...
Read More >