Are You Ready for the Challenge of Machine-Generated Content?

14Jun

Are You Ready for the Challenge of Machine-Generated Content?

User-generated content (UGC) has garnered a lot of attention due to the challenges it poses for localization, such as an abundance of spelling errors, the extent to which its meaning depends on context, a lack of consistency, and time sensitivity. But even as enterprises and language service providers (LSPs) struggle to deal with it, another type of generated content has been quietly swelling into a looming tsunami: machine-generated content (MGC). Today, increasing quantities of content appear that nobody has written and that – in some cases – no human may read. However, these materials play a vital role in e-commerce and various fields, and their volume may come to exceed that of UGC in the future.

CSA Research first examined the importance of MGC last year and predicted that it would become a major concern for localization. While we were at Localization World in Warsaw last week, representatives of several companies told us that MGC is an increasingly important topic for them and that they are just now grappling with its implications. For some LSPs, translation of MGC has already become a significant revenue stream, albeit one that is hard to control and monetize.

MGC is not limited just to text. Increasingly, machines also generate audio-visual content that seeks to drive clicks and ad impressions. AI-based agents produce videos from news stories and even generate music, all in a quest to convert computing power into cash. Some games feature MGC-based dialogue, in part as a way to deal with the sheer complexity of virtual worlds that can have millions of objects for players to interact with and that may play a role in audio, visual, or written content.

One of the challenges MGC poses is that humans are often only a secondary audience. It is instead intended to communicate to other computers, which then use it to perform some action. Because this MGC is not written for humans, traditional linguistic processes may not be optimized to deal with it effectively. For example, an optimal e-commerce process might deliberately introduce seeming errors and inconsistencies into translations of MGC that help it perform effectively in different markets.

Consider a product description for a tablet computer that contains misspelled terms, names of competitors’ products, and other items that would be considered mistakes in traditional localization. An effective localization would contain similar mistakes specific to the target locale as long as they help match user-generated search strings to product descriptions and boost the ranking for a product. Much like “click bait” articles target humans, these items try to capture the attention of other machines to decide what to show to humans. However, MGC can also pose a problem if the AIs and scripts that generate it do not have appropriate human supervision.

Beyond e-commerce, increasingly large numbers of Internet of Things (IoT) devices generate content that can be delivered to humans, but more often targets other machine services for further processing. Although some of this information consists of sensor data, images, and other non-linguistic data, much of it is in the form of templated text that substitutes data for variables. This may be in a structured form – usually XML or JSON – but may also be plain text. Machine-driven chains in which automatic agents act upon content and then pass their own results on to other ones are growing in importance but require machine translation services for cases where multilingualism comes into play. For example, if a poorly designed IoT-enabled window sensor provides status updates in German plain text to a home-monitoring system from Spain, the ability to instantly understand the meaning and act upon it is essential, but badly generated content or MT that cannot interpret it can destroy the ability of these systems to function together.

As of today, the localization industry is at the early stages of addressing these problems and best practices have yet to emerge. Although semantic markup schemes based on micro-data, such as Schema.org, help provide some of the needed interoperability, most of them remain fundamentally monolingual and cannot deal with the complexities of a multilingual world. This situation will not change until large technology developers start to feel the pain of processes that do not work or products that fail because they cannot communicate with other devices. Fixing these problems will require a combination of technological change, development of internationalization best practices for MGC, and expertise in how to work around limitations. Enterprises that can master these skills will have a marketplace advantage and those LSPs that gain experience in this area will find it a valuable tool in their transition to becoming global content service providers (GCSPs).

Just as UGC has posed fundamental challenges but is now something that organizations can deal with, MGC today is scarcely on the radar of most LSPs, but within a few years will be a common content type for them. Facing the issues today can be a challenge, but those organizations that do will be better prepared to capitalize on opportunities in the future.