Artificial Languages and the Dream of Universal Knowledge

23Oct

Artificial Languages and the Dream of Universal Knowledge

What do Barsoomian, Esperanto, Klingon, Ku, Na’vi, and Tenctonese have in common? They’re all languages created for sci-fi films, except for Esperanto, whose developer, Ludwig Zamenhof, sought to create a universal means of communication. They all also represent a human desire to explore or undo the effects of Babel. Although they seem to be a thoroughly modern project, they are actually part of a scholarly project that has been underway since at least the Middle Ages and has spawned a whole series of “a priori philosophical languages.” These artificial tongues aspired to directly mirror the nature of reality in their structure. The most ambitious creators of these schemes sought to create linguistic forms where the meaning of a word would be directly understandable from its form and sounds, even if a speaker did not already know the word. For a masterful account of such creations, see Umberto Eco’s The Search for the Perfect Language, from which the examples below are taken.

For example, Cave Beck’s 1657 The Universal Character, the sentence leb toreónfo pee tofosénsen and pif tofosénsen would unambiguously convey the Biblical commandment “honor your father and your mother,” with the individual words and pieces of words essentially serving as a lookup table for a universal dictionary of concepts not tied to any natural language. Another proposal, presented in George Dalgarno’s 1661 Ars signorum, contained “words” such as Neik (“quadruped animals”) and NeiPTeik (“warm-blooded animals”), Nƞk/pot (“horse = animal with an uncleft hoof-courageous”), and Nƞk/sof/pad (“mule = animal with an uncleft hoof-deprived-sex”), all of which are distinguished by the meaning of individual letters.

Although none of these creations were usable in practical terms – and may seem laughably naïve today – their creators put them forward as a way to eliminate ambiguity and promote world peace. In an era of nearly constant war, they believed that if everyone just understood and agreed upon the nature of reality and how to discuss it, there would be no need for conflict. This train of thought inspired countless imitators, some of whom worked to create actual usable languages, such as Volapük (the bane of companies that simply dump ISO language lists into their software) and Esperanto, which recently received a high-profile shot in the arm due to U.S. President Trump’s misspelling of his defense secretary, Mike Esper’s, name in a Tweet.

Even outside the realm of “let’s give peace a chance” projects such as Esperanto, the intellectual children of these efforts live on in the structure of library call numbers, the Standard Industrial Classification (SIC) codes used to classify vertical industries, bar codes, and almost any information cataloguing and retrieval activity. Even computer programming languages show the influence of these languages. Most of these derivatives have given up on the idea of universal knowledge classification to focus instead on narrower applications that they can realistically manage.

However, the development of big data, web-scale architecture, and automated content enrichment is now reviving dreams of universal knowledge and understanding. Machines that can process large amounts of text and successfully disambiguate it can then link words – theoretically in any language – to authoritative references about concepts. Known as automated content enrichment (ACE), this technology allows machines to “understand” content, deliver more relevant results to readers, and act upon the meaning of the text. Artificial intelligence promises to usher in an era when intent and meaning can be conveyed accurately, although human language consistently refuses to cooperate with these plans.

Unfortunately, far too many of the resources that are meant to liberate content from the shackles of its superficial written form work only in English or Chinese, the main languages of academic research and target of most funding. As content creators move beyond a handful of major languages, they find that the resources they need to add intelligence-driven features for customers are few and far between, if they even exist at all. Standard formats, such as schema.org or Dublin Core, rely on the surface form of words and so cannot provide intelligence across language boundaries. One of the next great challenges for researchers and the language industry – on par with going to the moon or Mars, if not more difficult – will be to extend such resources to cover the hundreds or thousands of tongues which would benefit from them.

As we celebrate the dramatic advances of AI into areas we could scarcely conceive of a few years ago, it is useful to look back at the past to consider why the artificial languages of the past failed to live up to their promises and find ways to make sure that today’s ACE- and machine learning-driven approaches can avoid the same problems and deliver benefit to the entire world. Chatbots and personal virtual assistants may not create world peace, but they can deliver real benefit. LSPs will play a major role in providing the insight and data needed to realize these goals. If organizations succeed, they may finally realize the dreams of countless scholars from almost a thousand years of world history.