CSA Research

17Oct

Brown M&Ms and Bad Locale Tags

In the 1980s, the American rock band Van Halen became famous for including a requirement in contracts with concert venues that they provide a bowl of M&Ms candy with all of the brown ones removed. At the time, this was widely seen as an example of how out of touch rock musicians were with reality, but it actually served a purpose. The band’s manager explained that if venues took care of the small details, he could be reasonably certain that they had also addressed more important things. However, if they did not pay attention to the request for no brown M&M’s, he knew that he had to double-check everything else: For example, a stage once collapsed under the weight of the rockers’ equipment because the venue team had not read the requirements for weight-bearing loads.

Similarly, when organizations publish content on the web, they have the option of tagging it for language and locale. If they get these tags correct – a small detail that is solved at the last moment before publication – it is a good sign that other aspects of their localization processes are working properly. By contrast, if language tags are missing, broken, or incorrect, it may indicate that their translation technology or approach has other problems.

Language tags – which appear as the value of “lang” or “xml:lang” attributes in online content – are defined by the Internet Engineering Task Force’s (IETF’s) BCP 47 standard. When proper locale tags appear in HTML or XHTML pages, they play an important role in helping consumers, data-processing tools, and various processes know what language content appears in. However, CSA Research’s examination of 4.1 million web pages from major organizations reveals that these tags are frequently incorrect, broken, or missing.

At first glance, these problems – like a stray brown M&M in Van Halen’s dressing room – may appear to be quite minor, but we found that just two-thirds of these pages correctly identify their language. Over one quarter do not identify language at all, and 6% get it wrong. The reasons that sites have incorrect information vary from simple misspellings, to broken HTML templates, to more serious process corruption.

Although incorrect tagging will not usually degrade the user experience for site visitors, it can result in more serious downstream effects – the localization equivalent of Van Halen’s collapsing stage – such as: corruption of translation processes and resources such as translation memory or MT; poor SEO; breaking HTML headers if tags are malformed; and incorrect machine translation results when MT services trust incorrect tags.

Fortunately, correcting language tagging problems is usually straight-forward if you pay attention to the details. Conducting an audit to find incorrect tags and determine their cause can help you identify breakdowns in processes or find content that has escaped proper management in the past.

When you do check this, make certain that you use a spider to review your site rather than trusting what your web publication and maintenance tools tell you. Compare the results against a language identification tool such as fastText to find potential problems. If you do uncover broken or incorrect tags, figure out where they come from and fix the underlying problems.

Taking care of language tags is usually so far down the priority list for web developers that they just trust their tools and never verify but checking for problems may help you find bigger problems, just like discovering a brown M&M in a Van Halen dressing room. Even if you don’t turn up any problems, knowing that your content is tagged appropriately can give you assurance that you are prepared to maximize its value.

About the Author

Arle Lommel

Intelligent or smart content has been a dream since the late 1990s. The concept refers to text, data...

Our Analysts' Insights

Brown M&Ms and Bad Locale Tags

About the Author

Arle Lommel

Related

Happy 30th Birthday, Unicode!

Thinking Big about Interoperability

TBX:2019: A New Version of the ISO Standard Raises the Bar

Getting Derailed: Why Standards Initiatives Fall Short

TAPICC – Because No One Has Time for Closed Systems

Intelligent Content Goes Global

Subscribe

Categories

Follow Us on Twitter

Americas

Europe