Access Business Analytics
Better machine translation (MT) is surely at the top of many organizations’ globalization wishlists. But they all wonder how they can develop good MT engines that produce usable results. Two approaches, often used in combination with one another, allow them to collect enough linguistic assets to train their MT:
Both these approaches have the problem of dealing with what experts characterize as “contaminated data” – that is, they may end up inadvertently using their own output as training data, thus decreasing quality. Take the case of the European Commission (EC): Many MT developers trained their engines on EC data. As a result, resources that the EC could theoretically use to improve its systems are therefore increasingly influenced by those very systems and relevant new “clean” datasets are becoming more and more rare. Accordingly, there are hard limits on how much better MT output based on these resources can get. The problem is particularly acute for less common languages where non-EC open datasets are few in number. For organizations that exhausted what they can get from open public (or paid) data or their own translations, or that do not have a sufficient centralized repository of translations, what else can they do? What can they do to move forward? The EC recently launched CEF.AT, an ambitious program to provide on-demand MT among 26 European languages (using English as a pivot language for most pairs). In order to address this problems of data contamination and scarcity, the EC is working with the European Language Resource Coordination (ELRC) to gather and validate previously un- or under-utilized language resources from non-EC public administrations (such as national ministries) inside the EU member states. The EC is uniquely suited to access these public resources in a way that corporations cannot because of data access and privacy laws. Because the EC can centralize these datasets and build on its expertise in MT, it will provide services of a quality and breadth that no member state could develop on its own. The results help meet the EC’s internal needs, but will also support the member states’ requirements for translation, thus giving them an incentive to contribute. Although reaching out to external bodies with relevant content is expensive, it is the only way to move forward when more accessible data sources are tapped out. The EC is not alone in dealing with the scarcity of training data and contamination problem. While most businesses and government agencies do not face trials as daunting as the EC’s, the situation it finds itself threatens many global enterprises. If the results of CEF.AT meet expectations, the service will create a model for collaborative efforts that leverage content across organizational boundaries (for example, between a company and its suppliers or business partners). We will monitor CEF.AT closely to see how well this new approach works. In the meantime, CSA Research is currently conducting surveys on MT usage by LSPs and global enterprises that purchase MT services in order to discover the latest trends in this field. If you have not already taken these surveys, we encourage you to fill them out (even if you are not using MT) before February 17. All respondents will receive a summary of responses. The results will help us understand where organizations are today and how they are responding to the challenges they face in a world where customers increasingly see translation as a basic service.
Senior Analyst
Focuses on language technology, artificial intelligence, translation quality, and overall economic factors impacting globalization
In October 2023, we argued that the future of AI would be in “focused large language models” (FLLM...
A common worry about generative AI (GenAI) is that the content that it creates may be subject to cop...
It is incredible to think that, less than eight years after the first publicly available neural mach...
National variants of multinational languages such as English, French, and Spanish underscore the cha...
The explosion of interest in generative AI technology like ChatGPT has led to general calls for the ...
The word No seems to have been going around a lot in conversations lately. From discussions (welfare...
Posts by CSA_Research