FREME Services Power Vistatec Automated Content Enrichment
The digital universe grows by a massive amount of structured and unstructured content every day. This flood of data comes in a broad range of standardized and proprietary formats and languages for uncounted application needs around the planet. Much of this data never leaves the silo for which it was created, but business analysts and information scientists have long researched how it might be useful in other applications, markets, and languages. For example, tagging the topics or categories of documents, data, and applications could make them reusable, discoverable by other applications, and perhaps even salable.
For decades theorists have encouraged programmers to document their code both internally and in repository indexes and begged content authors to create taxonomies, structure their writings, insert tags, and harvest terminology. The problem is that most programmers and content authors – already stressed by deadlines and productivity measures fail to enhance their work with metadata (that is, information about the content) that would broaden its usefulness. Although recent advances – such as OpenCalais, which automatically enriches data with links to Thomson Reuters news articles – have simplified the task for narrow use cases, no systematic approach has emerged to support this task and the multilingual challenges in this area remain virtually unaddressed.
Automating such content enrichment process activities is the mission that the FREME project undertook in February 2015.
- The project. Funded by a European Commission Horizon 2020 Innovation Action, it will determine whether six multilingual and semantic technologies are ready for use in real-life business cases: 1) the Internationalization Tag Set (ITS) for tagging information related to content internationalization and localization; 2) Linked Data based on the Natural Language Processing Interchange Format (NIF), the DBpedia Ontology, and RDF; 3) an entity processing service to recognize, link, and classify entities – such as names, places, and events – in multilingual texts; 4) a terminology service to identify, manage, and annotate terms; 5) a machine translation service; and 6) a publishing service to package and export content in the open EPUB3 format.
- The goal. FREME's mission is to "build an open, innovative, commercial-grade framework of e-services for multilingual and semantic enrichment of digital content. Its e-services will be capable to process (harvest and analyze) content, capture datasets, and add value throughout content and data value chains across sectors, countries, and languages." The term the project employs is "content enrichment," but what these language technologies and linked data services allow for is the automated creation of metadata that can accompany code, data, and files wherever they go.
- The partners. The eight FREME partners are: language technology firms Tilde in Latvia and Vistatec in Ireland; web analytics firm Wripl in Ireland; agricultural and food sciences specialist Agroknow in Greece; iMinds, a use case partner for the publishing industry, in Belgium; and three research institutes, ISMB in Italy and DFKI and InfAI, both in Germany.
Last week, Vistatec's CTO Phil Ritchie briefed us on the development of Ocelot, an open-source translation editor that the company is using as a deployment platform for FREME services and the backbone for its new "Deep Content" service. To start his demo, he opened a technical documentation file for translation. As soon as he did:
- Ocelot immediately began executing a pipeline of instructions behind the scenes. It called a cloud-based MT server to translate the document, identified acronyms that it looked up in DBPedia, harvested terms, and presented several other metadata elements to the editor. In just a few seconds, it produced metadata that would help a content author, translator, or reviewer better understand what the file contains and how it could be used.
- The Linked Data service used the URLs assigned to all of the participating components and software to define and document a relationship between them. These stored links allow the objects to communicate their attributes to others, so their persistence means that the internet becomes a vast database of semantic and business connections among those objects.
- As users open more files and applications, Ocelot picks up more attributes about what they do and how they do it, thus increasing the value of the content they contain – and eliminating much of the tedious work associated with today's mostly manual content-enrichment schemes. In the translation and localization arena, both end-buyers and language service providers will benefit with relatively little effort from the intelligent content that Vistatec's innovation brings to the table. Besides that, search engines will be able to zero in with a laser-like focus on the attributes these files carry with them, thus making today's tools seem crude compared to what they can find using intelligent content.
FREME's funders expect that the project's enhanced access to public and private sector data will generate hundreds of multilingual applications and reuse billions of open data records. If the six e-services validated by this project work as well as they did in Vistatec's technology preview, we see far more opportunity in sharing and reuse than just documents and their translations. For example, it could make marketing software far more intelligent in processing the many inputs that it gets from the marketing supply and campaign management chain. It can make application code smarter and more transparent to its developers. And with its metadata available to analytics programs, it will make all this multilingual content in innumerable formats far more analyzable.
The European Commission has been funding language technology projects for nearly a decade under the Seventh Framework Project (FP7) and Horizon 2020. At conferences that we've attended and addressed, the EC has highlighted its goal of sparking innovation in the European development community. Projects like FREME and Falcon show that its money is well spent. Initiatives like these add real value and make cutting-edge research technology accessible to the public in Europe and beyond.
An earlier version of this post omitted iMinds, the eighth partner in the FREME consortium.
About the Author