CSA Research

19Sep

Intelligent Content Goes Global

Intelligent or smart content has been a dream since the late 1990s. The concept refers to text, data, and audio-visual materials that contains machine-interpretable information describing its structure, giving some guidance as to its meaning, and defining its relationship to other content. Various technologies have tried to deliver on the promise of content that machines can act upon. Today some approaches are beginning to bear fruit, but significant hurdles remain in the base technologies and their usage beyond English. CSA Research’s recently released reports “Four Futures for Global Intelligent Content” and “Four Trends in Intelligent Content for LSPs” explore the potential – and challenges – in this area.

One of the most serious deficiencies today is in international support in the formats needed for intelligent content. In our reports, we analyzed Schema.org, a widely implemented specification for describing web-based content in terms of known schemas (that is, descriptions of how items fit together within a given field). This initiative has tremendous backing from major IT companies and web developers, and yet does not provide mechanisms for handling information across languages. For example, its coverage of the medical field allows content authors to link a medicine to the conditions it treats, but the results, shown below, assume a common language (English in this case). In this scenario, a search could find all pages that list toothache as an “indication” but would not find pages that list Zahnschmerzen (German), fogfájás (Hungarian), or 歯痛 (Japanese) in the same fashion.

As a result, the intelligence encapsulated in the text is monolingual and it loses out on the advantage of information not in English. Lack of priority for international concerns in a project that is still grappling with basic issues in a monolingual version means that it may be some time before this situation changes.

Limitations such as these mean that many intelligent content initiatives fail when their developers try to extend them globally. In CSA Research’s recent coverage of this topic, we identified four challenges that developers face in this arena:

Semantic markup requires better multilingual support. Building semantic links is hard enough in English, but the resources and formats are lacking for many of the world’s languages. Automated content enrichment (ACE) provides a way to start filling in the gaps. When developers link data across locales, it helps build information that benefits both human and machine translation.
Machine-generated content (MGC) threatens to swamp localization processes.Increasing amounts of the text, video, and audio you may encounter on the web was created by computers rather than people. Humans may design the templates, but many pages do not exist until the moment they are requested. For example, a weather page about Patagonia in Finnish appears based on an underlying model that interprets data and renders it in another language. Because such content is created at the moment it is needed, enterprises and LSPs cannot develop it in advance, but can work together to design automated processes that render it upon request. MGC is one area where localization providers can deliver significant benefit beyond the alternative of raw machine translation.
Conversational content requires models of interaction. As our analysis in “The Winds of Content Are Changing” demonstrated, the shift to conversational content has profound implications for localization. It refers to materials that exist in dialogue with users and respond to them. They often mix MGC and human-created text. The important thing is that machines must be able to store a sense of “state” – that is, where the conversation has been and what it has done. Without this capability, human-machine dialogue resembles conversations with severe amnesiacs. Unfortunately, a patchwork of rules, regulations, and cultural norms combine to make this task extremely difficult. Those LSPs that can help their clients negotiate this field can offer significant commercial advantage for next-generation user interfaces.
Speech-enabled and headless devices are transforming the nature of content.Increasingly large numbers of devices have no screen, no mouse, and no keyboard. They interact with users via speech – think of Alexa, Cortana, and Siri, as well as a host of other, lesser known devices – or through remote interfaces that interpret streams of data from the device and make them accessible over the internet. The limitations of speech recognition technology in terms of language and domain coverage make such projects difficult to scale and will require LSPs to develop new skill sets in order to build the resources businesses will need.

These four topics each provide the potential to overturn what we mean by content and how we work with it. Despite the difficulties they pose for both enterprises and LSPs, they also point to a future in which content becomes ever more important and compelling. Companies that can solve these issues to create intelligent global content will have a compelling advantage going forward.