Happy 30th Birthday, Unicode! - Our Analysts' Insights
X

Our Analysts' Insights

Blogs & Events / Blog
08Dec

Happy 30th Birthday, Unicode!

I recently missed an important birthday that everyone in the language industry should have celebrated with fireworks and music. In October 1991, Unicode 1.0 was first released. In the 30 years since that publication an entire generation of language workers have been educated and started work, never having had to know the “joys” of trying to ungarble text that had gone through multiple encodings. I have no nostalgia for the days when LSPs had to hire an engineer just to deal with text encoding or when projects came back because someone’s résumé on Windows had turned into their rÈsumÈ on a Mac or a German name like Müller turned into M□ller with no warning. And if your internet content happened to pass through a seven-bit gateway, the loss might have been unrecoverable.

Those who have entered the language industry since the days when my beard was still red may not be familiar with the little boxes – colloquially known as “tofu” – that appear when software doesn’t recognize a character. They have come of age in a world fundamentally better than the one I started out in. They are used to text just working and to phones that can easily switch between dozens of languages with different writing systems without any trouble.

 1282021

Prior to Unicode, if you wanted to use a different writing system, you would switch to a different font that supported it. For example, someone working with Old Church Slavonic might create a font where ꙁ is stored in the slot for “z” and ꙗ is stored as “q”. Because these fonts were made to support specific projects and purposes, different people would come up with different approaches and keyboard mappings, ensuring that data was tied to specific fonts and could not be shared with anyone using a different font or approach – at least not without considerable difficulty. It also meant that natural language processing tools could not reliably know what content was represented or intended. Many lifetimes of effort still sit in dusty boxes of diskettes that cannot be read because nobody has the appropriate font today.
 

12082021
Devanagari character picker in PopChar X, a commercial tool for inserting Unicode characters in text.

Double-byte character sets – such as Big5 and GBK for Traditional and Simplified Chinese respectively – were even more complex and most applications simply failed to render languages that used them (such as Japanese, Korean, and Chinese) or required specialized work-arounds, such as one system I used in 1996 that allowed you to type in Chinese but rendered it by switching between hundreds of single-byte fonts that each contained some portion of the Chinese character set. As I learned the hard way, a single change of font in this system could easily ruin all of a day’s work, so I saved frequently under different names so I could always roll back to a recent version.

In the late 1990s I had a job editing linguistic research papers. I typically spent about 40% of my time in this job tweaking fonts in FontLab and dealing with special character needs, all because Quark Xpress didn’t support Unicode. Then in August 1999, Adobe InDesign 1.0 came out and I switched to it and never looked back. Overnight that 40% of my work became about 5%. And then it became 1–2% as Unicode-rich fonts came out and I no longer had to kludge fonts or work with fine grained character attributes to represent characters like ǘ or ǟ. Eventually it became 0% for the last few times I worked in that field in the late 2010s because Unicode had improved so much. I probably owe the creators of InDesign and Unicode at least a lifetime supply of the beverage of their choice because their efforts have saved me months of my life.

However, as much as I am indebted to Unicode, its most meaningful contribution has not been in saving me hundreds of hours of work. Instead, it has been in enabling people around the world to enter the digital era. As of July 21 (Unicode 14), the specification provides standard encodings for 93 modern scripts (writing systems) and 66 historic or ancient ones, with specializations for many languages. Speakers of hundreds of languages can represent their words in digital format because of Unicode. CSA Research now tracks 130+ languages with significant economic and cultural activity on the web, and hundreds of others appear at lower levels of usage. Many or most of these would not exist online were it not for Unicode.

So, raise a glass in honor of thirty years of Unicode if you remember the bad old days when you had an engineering named Kevin in a back room who spent his days and nights fixing encoding problems. Raise it if you don’t remember those days just because you won’t have to know what that was like. Raise it if you can use your native language online and it isn’t originally from a few countries in Western Europe. Or finally, if you work with and love languages and communication across borders, thank the countless individuals who have contributed their time and effort to developing and expanding Unicode.

About the Author

Arle  Lommel

Arle Lommel

Senior Analyst

Focuses on language technology, artificial intelligence, translation quality, and overall economic factors impacting globalization

Related

The Global Enterprise Content Production Line

The Global Enterprise Content Production Line

In today’s interconnected world, a global enterprise’s success hinges on its ability to produce, r...

Read More >
Localization Reinvention

Localization Reinvention

Back in the day when I first began working in localization, we didn’t have a translation management...

Read More >
Simple Actions for Achieving More Efficient Localization Processes

Simple Actions for Achieving More Efficient Localization Processes

While the goal for project management has long been full automation (“lights-out”), few organizati...

Read More >
Wanted: Expert Project Managers

Wanted: Expert Project Managers

Are you an expert project manager or interpreting scheduler? We need to talk! Project management – ...

Read More >
Generative AI and Copyright: Unraveling the Complexities

Generative AI and Copyright: Unraveling the Complexities

A common worry about generative AI (GenAI) is that the content that it creates may be subject to cop...

Read More >
AI in Multimedia Localization: How to Spot the Winners and Avoid the Scams

AI in Multimedia Localization: How to Spot the Winners and Avoid the Scams

During our research into multimedia localization – and all the new AI-enhanced tools that are sprou...

Read More >

Subscribe

Name

Categories

Follow Us on Twitter