CSA Research

21Mar

The Rise of Dialectal MT-ism

Listen to This Blog

Back when I was a practicing Slavicist, I met up with a Czech friend a few weeks after arriving in Prague for the Summer School of Slavonic Studies. We had become acquainted a few years earlier when we both studied at the Institute of Foreign Languages in Moscow. Our common language then had been Russian, but I was intent on using the Czech I had been studying at Brown University and at Charles University. After an hour or so of walking around Old Town and conversing exclusively in Czech, I asked what she thought. She replied, “Je to báječné! Zní to, jako bys právě vystoupil z románu z 19 století” – “it’s marvelous! It sounds like you just stepped out of a 19th-century novel.”

Her comment inspired my new mission for that summer: I would learn to code-switch, shifting among and between the literary (spisovná), spoken (hovorová), and common (obecná) Czech variants depending on the venue and my interlocutors. Given my limited time and location, I focused on the Prague variant, deciding to avoid regional dialects such as Moravian Hanák.

Language Variants Pose Challenges to Localization

My experience with just this one language and its national variants underscores the challenges to a government or business of providing readable content to any citizen or customer, prospect, or employee. They face what sociolinguists call varieties – that is, dialects, registers, styles, lexicons, and gender conventions – in both written and spoken language. Furthermore, usage and comprehension issues extend deep into any interaction, further affected by payment systems, regulations, and other extralinguistic realities – for each country. Businesses have a much bigger challenge. They must justify the resources and expenses required to support those dialects – and most choose not to do so.

“One” language used in many countries. The most common reason for businesses to support dialects is for eight languages with “official” status in multiple nations where there’s substantial online traffic and opportunity (“Multilingual Digital Opportunity: 2022”): Arabic (25), Chinese (4), English (67), French (30), German (6), Portuguese (9), Russian (8), and Spanish (21). For example, multiple forms of Spanish provide access to 8.6% of global online GDP and 9.0% of the global total online population. Practically speaking, one country typically generates most of the opportunity, but there’s incremental prospects for business in other nations using that language.

Note: This gang of eight pluricentric languages is not exhaustive. For example, Danish and Norwegian largely share their morphology and lexicon (but not phonology); Dutch and Vlaams are close neighbors; some sites fork off a Swiss-Italian variant justified by per capita income; and Iran, Afghanistan, and Tajikistan share a Persian heritage (Farsi, Dari, and Tajik, respectively).

Dialects of a single language. Except for maybe Vatican City, Monaco, and a few other sovereign states with small footprints, every official language splinters into regional variants. Any business would have a tough time demonstrating the ROI for hyperlocalizing a product or website to the 48 dialects of Japan or to the various patois of Czechia’s 10 million people (in every company that we’ve ever worked with, there’s no splitting the Czech). However, governments must communicate with their citizens about voting, for education, and with social services. National companies often want to tap into local preferences – consider regional advertising campaigns in any large country.
Languages in diaspora. Around the world, there are small linguistic bubbles that reflect population borders redrawn by wars and by immigration to escape current conflicts, political and religious persecution, and economic turmoil. For example, Central Europe is dotted with populations that reflect ancient and not-so-old borders. Support typically stems from a governmental obligation to communicate with citizens.

That’s the Problem – Is MT the Solution?

With apologies to Karl Marx and his “dialectal materialism,” I’ll use the term dialectal MT-ism to emphasize, as he did, the importance of real-world conditions and complexities, but in this case related to where languages are spoken, and the socioeconomic factors that complicate support. I avoid exact comparisons of dialects – if you’re really interested, look to the voluminous oeuvre of dialectologists. Suffice it to say that there is a high level of intelligibility among and between them, with major differences and nuances that keep people occupied, annoyed, and employed.

Practically speaking, dealing with language varieties is one case where machine translation should be ubiquitous. Unfortunately, today’s automated translation solutions don’t score very highly on the dialectal MT-ism scale. Most ISVs and LSPs support just one variant per pluricentric language, with some offering two variants each of Chinese, French, and Portuguese.

Dialect_MT_tables_v0...
Notes:

We harvested this data from MT provider and LSP websites on March 7-10, 2023. Their software may not support these languages as both source and target. This data could be outdated – it often is, but it is a starting point for analysis. “Total” is the number of languages that the suppliers list.
We chose the languages based on our long-running data analytics series, “Multilingual Digital Opportunity – 2022.” The list of MT providers is not exclusive – we listed companies that have come up in recent inquiries to our analysts.

Why are there so few variants? The answer is ROI – MT developers see insufficient demand from their customers who choose to ignore dialects, so the cost of developing versions to support pluricentric variants may not seem worth it. For an overview of what CSA Research recommends to companies in one such scenario, see “Spanish Variants Deliver Global Value.” For some languages, the degree of variation may not seem to justify the investment, even though – in the case of German – the written form in Germany has a letter (ß) that is not used at all in Switzerland (where is appears as ss).

Improving Dialectal MT with Humans at the Core

Machine learning and neural technology drove the leap in MT fluency, if not correspondence (“Making the (Translation) Grade”). The same technologies can make similar jumps for dialects. However, this effort requires the involvement of humans at the core of training and refining these variants. Once augmented by humans, pluricentric variants can more easily propagate tailored government and business messages to their respective audiences, increase engagement and conversions in the process, and thus build a strong case for investment. MT vendors that want to add dialectal support should:

Add locale identifiers to your training syllabus. Machine learning ballooned in the last decade, with data-rich companies like Google and Microsoft leveraging the vast amounts of content running through their servers. They supplemented that input with the growing ranks of data factories like Appen and TELUS AI, and smaller language-centric companies like e2f and Pangeanic. MT ISVs can pass dialect information to their engines, in the process enabling their software to acquire knowledge of lexical and syntactic similarities. Once embedded in large language models, this same locale-specific data becomes available to other NLP systems such as semantic analysis, search, knowledge management, and, of course, generative AI.

Note: When you include identifiers in your data, be sure to follow the BCP 47 guidelines. Our examination of sites and language resources show that many implementers choose to create their own codes or use them improperly (such as the ubiquitous “es-LA” that, contrary to popular practice, refers to the variety of Spanish spoken in Laos, not in Latin America – the correct code is “es-419”). Using the proper codes ensures a long and ambiguity-free life for your content in downstream applications.

Run your zero-shot MT engines with locale-identifying labels. In 2017 we outlined the pros and cons of zero-shot NMT engines trained on multiple languages. In this model, an engine asked to translate between two languages where it has insufficient data can extrapolate using training data for other language pairs. For example, if it does not have Finnish-Greek data, it can use Finnish <>German, Finnish<>English, Greek<>English, and Greek<>German training sentences to fill in the gaps. Google recently described how it will leverage zero-shot MT with human-labeled dialectal differences to improve locale-specific dialects.
Verify any domain-specific data for locale differences. Companies typically train MT engines to their industry requirements – legal, finance, health, manufacturing, high-tech. Terminology and practices in some verticals might be identical, but there will likely be regulatory and other references that require attention. Larger MT providers typically offer solutions tuned to meet the needs of an industry, but don’t offer dialectal variants of their engines. That means two levels of labeling for many implementations.
Automate extra-linguistic support. There’s an array of extra-linguistic factors that complexify any dialectal implementation – address and phone formats, payment systems, regulations, working hours, support methods, and much more. Some of these differences rely on known structured data that you can automate and code into your website, content and document management systems, and ERP and accounting models. The Unicode Consortium’s CLDR should be your starting point for information on local variants.
Leverage AI and specialists for anything moving beyond simple differences. You can identify legal and regulatory issues with specialists, outsourcers, and maybe even a generative AI chatbot as a starting point. Note that in our website analytics research, we list the HQ country of sites but find that most sites don’t distinguish between dialects for the pluricentric languages. When a website does list country sites (for example, Spanish for Mexico), we often find that the content does not differ much by the dialects beyond things specific to the country such as payment systems in the local currency.

Finally, think outside the MT text box. The dialectal MT challenge extends to spoken language – interpreting, audio, video, multimedia, wherever people communicate (“Introduction to Multimedia Localization”). Consider a scene from the film Inglorious Basterds in which the German major Hellstrom’s ear correctly identifies two of the Germans in the room as Mr. Frankfurt and Mr. Munich but struggles “für ihren bizarren Akzent” of the British officer posing as a German, who ultimately gives himself away with a three-fingered British hand gesture when he orders three drinks.

Machines will be expected to recognize these accents and hand gestures. Some MT engines have been reading language in photos and images for years. In 2018 Google added Translatotron to its Android and iOS Translate products, offering direct speech-to-speech translation without using intermediate text representation. It included spoken dialects for Bengali, English, French, and Spanish – and has since added support for variants of Arabic, Chinese, Portuguese, Swahili, Tamil, and Urdu. All of this derives from the company’s long-term investment in machine learning.

The-Range-of-Adaptat...

Dialectal MT-ism: There’s More Work to Be Done

The mission to improve the local experience and thus increase engagement and conversion will consume the efforts of mainstream and NLP developers for years to come – in written form as well as the far more challenging task of spoken and visual localization across the omnichannel spectrum on to the metaverse and whatever comes next.

Thinking back to my Bohemian idyll as a grad student, I remember cutting class one morning to meet friends in a kavárna (café). Shortly after we arrived, a contingent of fellow travelers from another class joined our group. Not having Major Hellstrom’s ear for accents, I wondered about the Czech pronunciation of one woman in the other group.

Using the vernacular Czech that we were speaking in the café, I asked in common Czech, “vodkud seš?” (where are you from?). She answered, to my surprise, “Ne, nechcí vodku” (no, I don’t want any vodka) and she went on to explain, “it’s only 9:30. I’ll have coffee.” Then the penny dropped. Not conversant in common Czech, she heard “vodku chceš,” which in literary Czech means “do you want vodka?” The next generation of dialect-aware automated speech recognition and MT must help us avoid mixing up location with libation.