X

Our Analysts' Insights

Blogs & Events / Blog
08Feb

The Path to Statistical MT Success is Paved with Big Data - and Not just Yours

Better machine translation (MT) is surely at the top of many organizations’ globalization wishlists. But they all wonder how they can develop good MT engines that produce usable results. Two approaches, often used in combination with one another, allow them to collect enough linguistic assets to train their MT:

  • Harvest open data sources. Since the open-source Moses system debuted in 2007, many organizations have crawled the internet for parallel language resources. They feed these into statistical MT (SMT) systems to generate their own translations. This approach has worked well for organizations like Alphabet (the parent company of Google) and Microsoft, both companies with search-based access to vast stores of content. However, their MT results are uneven and for many language pairs – particularly those involving “complex” languages – there simply is not enough publicly available data to achieve the desired results.

  • Mine their own translations. What if you don’t have access to all that data? The alternative approach is for organizations to curate their own bilingual corpora of translated content to build systems. The results are typically superior to those generated from data they find on the web. Why: The resultant training data represents their organizational terminology and writing styles for both source and target. However, this approach can be difficult due to content mismanagement: Few organizations maintain a centralized repository of translations, so initiatives to find, align, and leverage their own linguistic assets can be a huge project.

Both these approaches have the problem of dealing with what experts characterize as “contaminated data” – that is, they may end up inadvertently using their own output as training data, thus decreasing quality. Take the case of the European Commission (EC): Many MT developers trained their engines on EC data. As a result, resources that the EC could theoretically use to improve its systems are therefore increasingly influenced by those very systems and relevant new “clean” datasets are becoming more and more rare. Accordingly, there are hard limits on how much better MT output based on these resources can get. The problem is particularly acute for less common languages where non-EC open datasets are few in number.

For organizations that exhausted what they can get from open public (or paid) data or their own translations, or that do not have a sufficient centralized repository of translations, what else can they do? What can they do to move forward?

The EC recently launched CEF.AT, an ambitious program to provide on-demand MT among 26 European languages (using English as a pivot language for most pairs). In order to address this problems of data contamination and scarcity, the EC is working with the European Language Resource Coordination (ELRC) to gather and validate previously un- or under-utilized language resources from non-EC public administrations (such as national ministries) inside the EU member states. The EC is uniquely suited to access these public resources in a way that corporations cannot because of data access and privacy laws.

Because the EC can centralize these datasets and build on its expertise in MT, it will provide services of a quality and breadth that no member state could develop on its own. The results help meet the EC’s internal needs, but will also support the member states’ requirements for translation, thus giving them an incentive to contribute. Although reaching out to external bodies with relevant content is expensive, it is the only way to move forward when more accessible data sources are tapped out.

The EC is not alone in dealing with the scarcity of training data and contamination problem. While most businesses and government agencies do not face trials as daunting as the EC’s, the situation it finds itself threatens many global enterprises. If the results of CEF.AT meet expectations, the service will create a model for collaborative efforts that leverage content across organizational boundaries (for example, between a company and its suppliers or business partners). We will monitor CEF.AT closely to see how well this new approach works.

In the meantime, CSA Research is currently conducting surveys on MT usage by LSPs and global enterprises that purchase MT services in order to discover the latest trends in this field. If you have not already taken these surveys, we encourage you to fill them out (even if you are not using MT) before February 17. All respondents will receive a summary of responses. The results will help us understand where organizations are today and how they are responding to the challenges they face in a world where customers increasingly see translation as a basic service.

About the Author

Arle  Lommel

Arle Lommel

Senior Analyst

Focuses on language technology, artificial intelligence, translation quality, and overall economic factors impacting globalization

Related

Sentient AI: Parrot, Parity, or Parody?

Sentient AI: Parrot, Parity, or Parody?

Last week, the Washington Post published an article about Blake Lemoine’s claim that his employer G...

Read More >
Airbnb: A Lesson in How to Implement Language at the Platform Level

Airbnb: A Lesson in How to Implement Language at the Platform Level

Are you ready to implement language as a feature at the platform level? Do you know how to gain exec...

Read More >
Responsive Machine Translation: The Next Frontier for MT

Responsive Machine Translation: The Next Frontier for MT

CSA Research’s recent survey-based examinations of machine translation deployment at language servi...

Read More >
2020: A Year of Superlatives in the Language Industry

2020: A Year of Superlatives in the Language Industry

CSA Research recently released our list of the 100 largest LSPs and langtech providers, along with e...

Read More >
Building a Comprehensive View of Machine Translation’s Potential

Building a Comprehensive View of Machine Translation’s Potential

It is no secret that machine translation (MT) has gone from a relatively niche solution to seeming u...

Read More >
Making the Best of a Bad Year: Five Lessons for 2021

Making the Best of a Bad Year: Five Lessons for 2021

As we look back at the annus horribilis that was 2020, what are some things we can learn and take fo...

Read More >

Subscribe

Name

Categories

Follow Us on Twitter