X

Our Analysts' Insights

Blogs & Events / Blog
17Oct

Brown M&Ms and Bad Locale Tags

In the 1980s, the American rock band Van Halen became famous for including a requirement in contracts with concert venues that they provide a bowl of M&Ms candy with all of the brown ones removed. At the time, this was widely seen as an example of how out of touch rock musicians were with reality, but it actually served a purpose. The band’s manager explained that if venues took care of the small details, he could be reasonably certain that they had also addressed more important things. However, if they did not pay attention to the request for no brown M&M’s, he knew that he had to double-check everything else: For example, a stage once collapsed under the weight of the rockers’ equipment because the venue team had not read the requirements for weight-bearing loads.

Similarly, when organizations publish content on the web, they have the option of tagging it for language and locale. If they get these tags correct – a small detail that is solved at the last moment before publication – it is a good sign that other aspects of their localization processes are working properly. By contrast, if language tags are missing, broken, or incorrect, it may indicate that their translation technology or approach has other problems. 

Language tags – which appear as the value of “lang” or “xml:lang” attributes in online content – are defined by the Internet Engineering Task Force’s (IETF’s) BCP 47 standard. When proper locale tags appear in HTML or XHTML pages, they play an important role in helping consumers, data-processing tools, and various processes know what language content appears in. However, CSA Research’s examination of 4.1 million web pages from major organizations reveals that these tags are frequently incorrect, broken, or missing.

At first glance, these problems – like a stray brown M&M in Van Halen’s dressing room – may appear to be quite minor, but we found that just two-thirds of these pages correctly identify their language. Over one quarter do not identify language at all, and 6% get it wrong. The reasons that sites have incorrect information vary from simple misspellings, to broken HTML templates, to more serious process corruption.



Although incorrect tagging will not usually degrade the user experience for site visitors, it can result in more serious downstream effects – the localization equivalent of Van Halen’s collapsing stage – such as: corruption of translation processes and resources such as translation memory or MT; poor SEO; breaking HTML headers if tags are malformed; and incorrect machine translation results when MT services trust incorrect tags.

Fortunately, correcting language tagging problems is usually straight-forward if you pay attention to the details. Conducting an audit to find incorrect tags and determine their cause can help you identify breakdowns in processes or find content that has escaped proper management in the past.

When you do check this, make certain that you use a spider to review your site rather than trusting what your web publication and maintenance tools tell you. Compare the results against a language identification tool such as fastText to find potential problems. If you do uncover broken or incorrect tags, figure out where they come from and fix the underlying problems.

Taking care of language tags is usually so far down the priority list for web developers that they just trust their tools and never verify but checking for problems may help you find bigger problems, just like discovering a brown M&M in a Van Halen dressing room. Even if you don’t turn up any problems, knowing that your content is tagged appropriately can give you assurance that you are prepared to maximize its value.

About the Author

Arle  Lommel

Arle Lommel

Senior Analyst

Focuses on language technology, artificial intelligence, translation quality, and overall economic factors impacting globalization

Related

Happy 30th Birthday, Unicode!

Happy 30th Birthday, Unicode!

In October 1991, Unicode 1.0 was first released. In the 30 years since that publication an entire ge...

Read More >
Thinking Big about Interoperability

Thinking Big about Interoperability

Mention “interoperability” and many localizers think of yet another conference panel about the val...

Read More >
TBX:2019: A New Version of the ISO Standard Raises the Bar

TBX:2019: A New Version of the ISO Standard Raises the Bar

Localization industry veterans may recall when the OSCAR standards group in the now-defunct Localiza...

Read More >
Getting Derailed: Why Standards Initiatives Fall Short

Getting Derailed: Why Standards Initiatives Fall Short

The Holy Grail of the language industry has been to standardize the transfer of jobs between the var...

Read More >
TAPICC – Because No One Has Time for Closed Systems

TAPICC – Because No One Has Time for Closed Systems

The history of standards for data and file exchange formats in the language industry goes back to th...

Read More >
Intelligent Content Goes Global

Intelligent Content Goes Global

Intelligent or smart content has been a dream since the late 1990s. The concept refers to text, data...

Read More >

Subscribe

Name

Categories

Follow Us on Twitter