Machine Translation

The Path to Statistical MT Success is Paved with Big Data - and Not just Yours

December 5, 2018

Arle Lommel

Better machine translation (MT) is surely at the top of many organizations’ globalization wishlists. But they all wonder how they can develop good MT engines that produce usable results. Two approaches, often used in combination with one another, allow them to collect enough linguistic assets to train their MT:

Harvest open data sources. Since the open-source Moses system debuted in 2007, many organizations have crawled the internet for parallel language resources. They feed these into statistical MT (SMT) systems to generate their own translations. This approach has worked well for organizations like Alphabet (the parent company of Google) and Microsoft, both companies with search-based access to vast stores of content. However, their MT results are uneven and for many language pairs – particularly those involving “complex” languages – there simply is not enough publicly available data to achieve the desired results.

Mine their own translations. What if you don’t have access to all that data? The alternative approach is for organizations to curate their own bilingual corpora of translated content to build systems. The results are typically superior to those generated from data they find on the web. Why: The resultant training data represents their organizational terminology and writing styles for both source and target. However, this approach can be difficult due to content mismanagement: Few organizations maintain a centralized repository of translations, so initiatives to find, align, and leverage their own linguistic assets can be a huge project.

Both these approaches have the problem of dealing with what experts characterize as “contaminated data” – that is, they may end up inadvertently using their own output as training data, thus decreasing quality. Take the case of the European Commission (EC): Many MT developers trained their engines on EC data. As a result, resources that the EC could theoretically use to improve its systems are therefore increasingly influenced by those very systems and relevant new “clean” datasets are becoming more and more rare. Accordingly, there are hard limits on how much better MT output based on these resources can get. The problem is particularly acute for less common languages where non-EC open datasets are few in number.
For organizations that exhausted what they can get from open public (or paid) data or their own translations, or that do not have a sufficient centralized repository of translations, what else can they do? What can they do to move forward?
The EC recently launched CEF.AT, an ambitious program to provide on-demand MT among 26 European languages (using English as a pivot language for most pairs). In order to address this problems of data contamination and scarcity, the EC is working with the European Language Resource Coordination (ELRC) to gather and validate previously un- or under-utilized language resources from non-EC public administrations (such as national ministries) inside the EU member states. The EC is uniquely suited to access these public resources in a way that corporations cannot because of data access and privacy laws.
Because the EC can centralize these datasets and build on its expertise in MT, it will provide services of a quality and breadth that no member state could develop on its own. The results help meet the EC’s internal needs, but will also support the member states’ requirements for translation, thus giving them an incentive to contribute. Although reaching out to external bodies with relevant content is expensive, it is the only way to move forward when more accessible data sources are tapped out.
The EC is not alone in dealing with the scarcity of training data and contamination problem. While most businesses and government agencies do not face trials as daunting as the EC’s, the situation it finds itself threatens many global enterprises. If the results of CEF.AT meet expectations, the service will create a model for collaborative efforts that leverage content across organizational boundaries (for example, between a company and its suppliers or business partners). We will monitor CEF.AT closely to see how well this new approach works.
In the meantime, CSA Research is currently conducting surveys on MT usage by LSPs and global enterprises that purchase MT services in order to discover the latest trends in this field. If you have not already taken these surveys, we encourage you to fill them out (even if you are not using MT) before February 17. All respondents will receive a summary of responses. The results will help us understand where organizations are today and how they are responding to the challenges they face in a world where customers increasingly see translation as a basic service.

Machine Translation

Stay Informed with CSA Research

Subscribe to our newsletter for updates on the latest research, industry trends, and upcoming events.

Ready to Explore CSA Research Insights?

Access exclusive data, reports, and analysis that power smarter decisions across the global content industry.

Reliable
Comprehensive
Data-Driven
Research

Visit the platform

Meet Our Analyst

Arle Lommel

VP Of Research

After obtaining a BA in linguistics in 1997, I began working for the now-defunct Localization Industry Standards Association (LISA), where I headed up standards development and worked on quality assessment models. At the same time, I completed a...

Connect with Arle Lommel

Recent Blogs

March 12, 2026 Alison Toon

What Are the Language Needs for Healthcare Tourists and Their Providers

The language needs of healthcare tourists are straightforward in theory but surprisingly difficult in practice. Patients must understand procedures and risks, d...

January 20, 2026 CSA Research

AI and Global Content Predictions for 2026

Artificial intelligence (AI) is no longer an emerging topic in localization or global content operations. In 2026, AI maturity will become a decisive factor sep...

April 11, 2025 Peter Coleman

Powerling and OXO Merge: A Partnership towards A Global Content Service Provider (GCSP)

On April 10, 2025, Powerling and OXO (ranked #73 and #93, respectively, on CSA’s Ranking of the Largest LSPs in the World for 2024) announced their strategic me...

March 5, 2025 Peter Coleman

How Official English Language Changes US Policy

On March 1st US President Trump signed an executive order (EO) designating English as the official language of the United States. This action takes the first st...

December 5, 2024 Helene Pielmeier

Human Vs. AI Interpreting – a Real-Life Comparison

For the last 10 years, I have written hundreds of pages of research on interpreting in its various forms. I personally tested countless interpreting technologie...

October 22, 2024 Rebecca Ray

Spanish-Speakers: Informal or Formal?

Do you work for a brand that addresses Spanish-speaking prospects and customers formally during some phases of their experience with your company, while informa...

October 8, 2024 Helene Pielmeier

Unlocking the Power of Upselling and Cross-Selling

Upselling and cross-selling aren’t just buzzwords—they’re critical strategies that allow LSPs to deliver more value to their clients while driving growth. For p...

October 1, 2024 Arle Lommel

The Language Sector Slowdown: A Multifaceted Outlook

After we published our recent Q3 2024 update on market sizing for the language sector, which was also covered in a public webinar, this blog addresses some of t...

September 6, 2024 Alison Toon

The Global Enterprise Content Production Line

In today’s interconnected world, a global enterprise’s success hinges on its ability to produce, refine, and deliver content across multiple languages and cultu...

Machine Translation

The Path to Statistical MT Success is Paved with Big Data - and Not just Yours

Stay Informed with CSA Research

Ready to Explore CSA Research Insights?

Reliable Comprehensive Data-Driven Research

Meet Our Analyst

Arle Lommel

Recent Blogs

What Are the Language Needs for Healthcare Tourists and Their Providers

AI and Global Content Predictions for 2026

Powerling and OXO Merge: A Partnership towards A Global Content Service Provider (GCSP)

How Official English Language Changes US Policy

Human Vs. AI Interpreting – a Real-Life Comparison

Spanish-Speakers: Informal or Formal?

Unlocking the Power of Upselling and Cross-Selling

The Language Sector Slowdown: A Multifaceted Outlook

The Global Enterprise Content Production Line

Turn Research Into Action

Reliable
Comprehensive
Data-Driven
Research