CSA Research

28Nov

Translating the Library of Babel

In his short story “Library of Babel,” Argentine author Jorge Luis Borges describes a building of seemingly limitless extent that contains, in no particular order, every possible 410-page book that can be written using Roman letters. Stephen L. Peck’s novella A Short Stay in Hell is the narrative of an individual who has been condemned to wander a literal version of the library in search of the single book that best describes his life: When he finds it, he will be liberated from hell and allowed to enter paradise.

At the time of narration, he has been in the library for 800 billion years and has examined really just a tiny portion of it. In a note at the end, Peck calculates that his version, which is arranged in two columns of stacks around a central space – I imagine it as being something like Trinity College Dublin’s famous library writ at a cosmic scale – contains about 2.345×102,594,773 books and is approximately 4.5×101,109,139 light years in height and depth. He notes that it contains more books than there are electrons in the universe. Although he does not calculate its volume, it would far exceed that of the observable universe, but at least his wanderer can take solace that the library doesn’t use the 100,000+ characters of Chinese or all of Unicode.

calculus

These numbers of course do not reflect our reality, but increasingly we live in a digital Library of Babel, stored as bits and bytes on countless hard drives and cloud servers. It is ephemeral – parts of it appear and disappear as needed, created by authors to serve both temporary and long-term needs. In 2016, CSA Research conducted a thought experiment that found that translation firms translate just 0.00000000009% of content generated each day. Even after discarding the portion that is not translatable, far less than 1% of all potentially useful content appears in another language. We estimated that it would take the combined population of the United States, China, and Indonesia – roughly two billion people – to render all of this content in a second language, not to mention the 16 languages needed to reach 90% of online economic potential in 2022 or the 3,000+ languages in the world. Although these numbers are minuscule compared to the Library of Babel, they nevertheless represent a level that could never be attained through un-aided human endeavor.

Addressing the needs of this reality without machines would leave us in the shoes of Peck’s narrator, condemned to a “short stay” in hell. On the other hand, without human intelligence, machines would be no more capable of finding their way out of the library than the narrator. We need both machines’ abilities to index and process vast amounts of text and the skills of professional linguists to make sense of it. Either one alone will fail to deliver what is needed: Machine translation does not understand what it is translating and cannot address the changes that occur in language without access to a constant supply of human translation, and humans will never keep up with the volume of work that is needed.

Fortunately, new technologies – such as those implemented in augmented translation – make it easier for this vital collaboration to happen. They allow humans to focus on their strengths – creative language, dealing with new domains and concepts, and handling nuance – while allowing machines to do the heavy lifting for content which can be translated effectively based on training data. This capability acts as a force multiplier and allows humans to address millions of times more content than they can unaided. It would be as if the lonely wanderer in A Short Stay in Hell could dispatch ever-multiplying numbers of automata to scour the stacks in search of the perfect book, bringing him closer and closer to his eventual salvation.

As our recent blog post on the fire at the National Museum of Brazil shows, the task of digitalization and rendering content accessible is vital if we are to preserve the human aspects of language. This task is urgent, and achieving this goal will require ever closer collaboration and integration between computers and humans. Only then can we escape our own short stay in the Library of Babel.