Access Business Analytics
Listen to This Blog
Back when I was a practicing Slavicist, I met up with a Czech friend a few weeks after arriving in Prague for the Summer School of Slavonic Studies. We had become acquainted a few years earlier when we both studied at the Institute of Foreign Languages in Moscow. Our common language then had been Russian, but I was intent on using the Czech I had been studying at Brown University and at Charles University. After an hour or so of walking around Old Town and conversing exclusively in Czech, I asked what she thought. She replied, “Je to báječné! Zní to, jako bys právě vystoupil z románu z 19 století” – “it’s marvelous! It sounds like you just stepped out of a 19th-century novel.”
Her comment inspired my new mission for that summer: I would learn to code-switch, shifting among and between the literary (spisovná), spoken (hovorová), and common (obecná) Czech variants depending on the venue and my interlocutors. Given my limited time and location, I focused on the Prague variant, deciding to avoid regional dialects such as Moravian Hanák.
My experience with just this one language and its national variants underscores the challenges to a government or business of providing readable content to any citizen or customer, prospect, or employee. They face what sociolinguists call varieties – that is, dialects, registers, styles, lexicons, and gender conventions – in both written and spoken language. Furthermore, usage and comprehension issues extend deep into any interaction, further affected by payment systems, regulations, and other extralinguistic realities – for each country. Businesses have a much bigger challenge. They must justify the resources and expenses required to support those dialects – and most choose not to do so.
Note: This gang of eight pluricentric languages is not exhaustive. For example, Danish and Norwegian largely share their morphology and lexicon (but not phonology); Dutch and Vlaams are close neighbors; some sites fork off a Swiss-Italian variant justified by per capita income; and Iran, Afghanistan, and Tajikistan share a Persian heritage (Farsi, Dari, and Tajik, respectively).
With apologies to Karl Marx and his “dialectal materialism,” I’ll use the term dialectal MT-ism to emphasize, as he did, the importance of real-world conditions and complexities, but in this case related to where languages are spoken, and the socioeconomic factors that complicate support. I avoid exact comparisons of dialects – if you’re really interested, look to the voluminous oeuvre of dialectologists. Suffice it to say that there is a high level of intelligibility among and between them, with major differences and nuances that keep people occupied, annoyed, and employed.
Practically speaking, dealing with language varieties is one case where machine translation should be ubiquitous. Unfortunately, today’s automated translation solutions don’t score very highly on the dialectal MT-ism scale. Most ISVs and LSPs support just one variant per pluricentric language, with some offering two variants each of Chinese, French, and Portuguese.
Notes:
Why are there so few variants? The answer is ROI – MT developers see insufficient demand from their customers who choose to ignore dialects, so the cost of developing versions to support pluricentric variants may not seem worth it. For an overview of what CSA Research recommends to companies in one such scenario, see “Spanish Variants Deliver Global Value.” For some languages, the degree of variation may not seem to justify the investment, even though – in the case of German – the written form in Germany has a letter (ß) that is not used at all in Switzerland (where is appears as ss).
Machine learning and neural technology drove the leap in MT fluency, if not correspondence (“Making the (Translation) Grade”). The same technologies can make similar jumps for dialects. However, this effort requires the involvement of humans at the core of training and refining these variants. Once augmented by humans, pluricentric variants can more easily propagate tailored government and business messages to their respective audiences, increase engagement and conversions in the process, and thus build a strong case for investment. MT vendors that want to add dialectal support should:
Note: When you include identifiers in your data, be sure to follow the BCP 47 guidelines. Our examination of sites and language resources show that many implementers choose to create their own codes or use them improperly (such as the ubiquitous “es-LA” that, contrary to popular practice, refers to the variety of Spanish spoken in Laos, not in Latin America – the correct code is “es-419”). Using the proper codes ensures a long and ambiguity-free life for your content in downstream applications.
Finally, think outside the MT text box. The dialectal MT challenge extends to spoken language – interpreting, audio, video, multimedia, wherever people communicate (“Introduction to Multimedia Localization”). Consider a scene from the film Inglorious Basterds in which the German major Hellstrom’s ear correctly identifies two of the Germans in the room as Mr. Frankfurt and Mr. Munich but struggles “für ihren bizarren Akzent” of the British officer posing as a German, who ultimately gives himself away with a three-fingered British hand gesture when he orders three drinks.
Machines will be expected to recognize these accents and hand gestures. Some MT engines have been reading language in photos and images for years. In 2018 Google added Translatotron to its Android and iOS Translate products, offering direct speech-to-speech translation without using intermediate text representation. It included spoken dialects for Bengali, English, French, and Spanish – and has since added support for variants of Arabic, Chinese, Portuguese, Swahili, Tamil, and Urdu. All of this derives from the company’s long-term investment in machine learning.
The mission to improve the local experience and thus increase engagement and conversion will consume the efforts of mainstream and NLP developers for years to come – in written form as well as the far more challenging task of spoken and visual localization across the omnichannel spectrum on to the metaverse and whatever comes next.
Thinking back to my Bohemian idyll as a grad student, I remember cutting class one morning to meet friends in a kavárna (café). Shortly after we arrived, a contingent of fellow travelers from another class joined our group. Not having Major Hellstrom’s ear for accents, I wondered about the Czech pronunciation of one woman in the other group.
Using the vernacular Czech that we were speaking in the café, I asked in common Czech, “vodkud seš?” (where are you from?). She answered, to my surprise, “Ne, nechcí vodku” (no, I don’t want any vodka) and she went on to explain, “it’s only 9:30. I’ll have coffee.” Then the penny dropped. Not conversant in common Czech, she heard “vodku chceš,” which in literary Czech means “do you want vodka?” The next generation of dialect-aware automated speech recognition and MT must help us avoid mixing up location with libation.
Chief Research Officer
Focuses on market trends, business models, and business strategy
It Depends As your organization pivots toward integrating generative AI (GenAI) into more of its ...
In October 2023, we argued that the future of AI would be in “focused large language models” (FLLM...
The topic of automation has taken the interpreting industry by storm. On the one hand, enthusiasts b...
Back in the day when I first began working in localization, we didn’t have a translation management...
When friends and family hear what I’m working on these days, they typically ask: 1) won’t AI elimi...
The rising frequency of discussions about AI has led to much unease among interpreting service provi...
Posts by CSA_Research