TBX:2019: A New Version of the ISO Standard Raises the Bar
Localization industry veterans may recall when the OSCAR standards group in the now-defunct Localization Industry Standards Association introduced TermBase eXchange (TBX) way back in 2002, based on earlier work from 1999. Released in the early days of XML, it promised to be a major step forward for making terminological data useful. After it was adopted as an international standard (ISO 30042) in 2008, it seemed that it had reached maturity and a firm place as a star among language industry standards. However, TBX never quite lived up to its potential. A new version, released this year, could rehabilitate its position and prepare it for the next generation of content applications.
Translation tools vendors claimed to support TBX, but they never quite managed to interoperate properly with competing and complementary terminology tools. As a result, many – if not most – LSPs and translators continued to exchange terminology via spreadsheets or CSV files, even though these mechanisms have serious problems, such as inconsistent format and encoding and a lack of vital metadata. Even though TBX represented a good solution to such difficulties, users preferred the apparent simplicity of a spreadsheet.
TBX Grows Up
The situation has recently changed. Over the past few years TBX underwent a major overhaul to address its limitations and prepare it to meet new goals. A steering committee in ISO Technical Committee 37 – comprised of representatives from ASTM, CSA Research, FH Köln, LTAC Global, Universidad de Las Palmas de Gran Canaria, Kent State University, and the XLIFF committee – recently completed the 2019 version of ISO 30042 that ISO published. This new edition streamlines the format and addresses many of the complaints about and limitations of the 2008 version. Some of the major changes are:
- Updated XML syntax. The earlier version adopted a syntax where data categories appeared as attributes in the XML code. Since it appeared, XML best practice has shifted to using tag names for this purpose. As a result, TBX now supports two “styles” of XML: the original DCA (Data Categories as Attributes) and a newer DCT (Data Categories as Tag names). In the long term, practice may evolve to DCT exclusively and the current format provides a migration path for existing TBX implementations. Other changes that apply to both styles are designed to make it easier to parse and work with TBX files and to use terminological data with XLIFF.
- Dialects simplify adoption. Perhaps the biggest impediment in the past has been that TBX does not define a single format for terminological data, but instead a way to represent the different formats various termbases use. As a result, many different data sets with different models have proven to be incompatible for interchange purposes. The newer version defines several “dialects” of TBX intended for common use and data interchange. The availability of standard dialects will remove a lot of guesswork and provide specific implementation targets for tool developers. In addition, the official dialects – TBX-Core, TBX-Min, and TBX-Basic – “telescope” into each other: Each one is a progressive superset of the preceding one, which facilitates interoperability between them. The standard also provides approaches for handling customized data categories and for developing custom dialect extensions.
- Required dialect names. The 2008 version was problematic because implementers often ignored the requirement to declare what variant of TBX they were using in a separate file attached to every document: As a result, when someone received a file, there often was no way of knowing what data categories it would contain. The new version makes this declaration mandatory by using a dialect name rather than a separate file so that implementers know what to expect from a given TBX document. No longer will someone receive a “TBX file” with no guidance concerning which data categories it implements. Creators of customized extensions to dialects are required to post formal dialect definitions as links (using XML namespace for the DCT style) where users can find the information they need to ensure reliable interchange scenarios.
In order to simplify implementation, the TBX Steering Committee set up TBXInfo.net with guidance, tools, and resources for implementers. This site helps ensure that materials needed to work with TBX are open to the public and freely available. By contrast, the standard itself – which carries a price of CHF158 (~US$160) – has been streamlined and shortened to reduce cost. In most cases, only developers will need to purchase ISO 30042 because other interested parties will find answers to other questions at the TBXinfo.net site.
TBX Plays a Vital Role in the Intelligent Content World
Why does this matter to language service providers and enterprise content creators? The most common type of translation error is failure to comply with terminology. Although TBX cannot resolve every problem, it does provide a standards-based approach to exchanging data about terms and implementing best practices for terminology management. Managing and controlling terminology is also a key requirement for creating intelligent content and translating it. Terminology management is thus set to become more important in the language industry, especially as TBX guides processes past spreadsheets to automated workflows and deployment of terminological resources.
The changes to TBX have modernized it and prepared it for the next generation of content applications. The new version resolves many of the challenges that implementers of the previous version faced and sets TBX up to fill a vital role in the language industry and intelligent content applications.
About the Author