AI Can Help Improve Even Mature Enabling Technologies
I have an interest in 19th-century sporting newspapers and would like to have them in digital form. Unfortunately, these newspapers used almost microscopic fonts, the type was often heavily worn, and their paper has degraded over time. Despite these problems, humans usually don’t have any trouble reading them, but optical character recognition (OCR) is another matter. Although this technology is widely considered mature, the results may disappoint, as can be seen in the image below, which provides short samples from two common systems for an 1858 issue of the New York Clipper:
- Basic OCR with a dictionary. The first example would never win any awards, although it would give a dedicated reader some idea of what the text is about. It makes mistakes in 40% of the words and some of them – such as “pco<1iog” instead of “pending” – show cases where the dictionary simply could not make sense of poor character recognition.
- OCR with a language model. The second result introduces a basic language model. Much like those used in phrase-based statistical machine translation, it tries to predict likely word choices from the context. It delivers much better results, with just a 7.5% error rate, and is quite readable, but finding problems like “bag” instead of “has” would require careful proofing.
In order to improve output, developers have long turned to basic AI techniques. For example, they rely on dictionaries to try to spot mistakes, although the example above shows that this doesn’t always work. Language models rely on statistical analysis of corpora to predict the probability of individual words in context. Even though such approaches are somewhat prone to “hallucination” – they see words that aren’t there or even plausible to humans because their models try to make their output fit the language model – they clearly do help.
These sorts of results are typical if you happen to work in languages that use the major European and East Asian writing systems. However, if you need OCR for Tibetan, Georgian, Telugu, or most of the roughly 190 writing systems documented in Unicode, you may find yourself wishing that you had something even as good as the first result. Options become scarce or non-existent for most of the world’s languages, and major developers see little incentive to add support because traditional models for adding new scripts and languages for smaller markets are expensive and require data that may not be available to them.
Although CSA Research has focused on the role of advanced artificial intelligence (AI) for machine translation, as well as in some innovative applications to other translation technologies, it has potential to change other established technologies – like OCR – in fundamental ways. For example, consider OCRopus, an open-source OCR package that uses recurrent neural networks – the same basic technology that underlies current neural machine translation – to perform character recognition. Developed by the German Research Center for Artificial Intelligence (DFKI) with funding from Google, OCRopus served as the engine behind Google Books. It delivers advantages parallel to those seen in neural MT: More accurate results, easier training, and greater flexibility. In principle, even a small enterprise or LSP could use OCRopus to add a new script and start digitizing documents, something that is not possible with most commercial offerings.
So how do the results of a neural system compare? At least for the English example, such a system does very well indeed. In the short sample, it makes zero word errors, and only has a minor punctuation problem. In a longer test, it made some errors, but still far fewer than even the second option shown above.
Just as neural MT does away with the distinction between translation and language models in favor of a single-pass approach to generating translation, so too can OCR with neural networks combine character and language models into one layer to deliver better results. Although OCR will still lack any understanding of the content – the same problem that MT faces – it will get much better at deducing what is written. Rather than recognizing individual characters and then trying to make sense of them, it will come closer to reading words as units, just as humans do.
Chinese OCR Takes a Different Approach
Dictionaries improve OCR for most languages, but systems seldom use them for one major language: Chinese. Why? Chinese words are frequently only one or two characters. When the character recognition model and the dictionary overlap so significantly, adding a separate dictionary offers little benefit. However, neural network processes that do not require a separate language model should deliver much better results because they can detect what characters are most likely to be appropriate in a given context without the need for modules for characters, word lookup, and language.
Although most of the major developers have yet to move to neural networks for OCR, the advantages in terms of flexibility and scalability are clear. We predict that many of the technologies we describe in our TechStack series will experience a re-evaluation in the near future as their developers identify cases where AI can breathe new life into mature approaches and deliver unforeseen benefits. We are already witnessing such a transformation of the TMS category. Expect new entrants and disruptors to challenge entrenched developers in other spaces as well. OCR may not usually appear on the top of the most exciting localization technologies, but it shows how developments in AI can shake up even established fields.
About the Author