|
In this issue…
StandardsXML in Localisation: Reuse Translations With TM and TMX
Reduce Translation Time and Effort With the Aid of XML Standards
Translation memories (TM) are essential components in the translation process. The following article demystifies TM technology for beginners and explains how Translation Memory eXchange (TMX), an XML standard, helps you to achieve independence from translation tool vendors.
TM is a fundamental component of computer-aided translation (CAT) tools. It has become so common in the translation industry that the term "translation memory tool" is often used in place of "computer-aided translation tool." However, these terms should not be used interchangeably, as CAT technologies also include machine translation, a computer technology based on linguistic rules and the use of bilingual dictionaries. A TM system remembers translations that have been typed by a human translator. When the translator needs to work on a similar text, the system offers the previously saved version. This can save a lot of time when a translator works with repetitive texts, such as technical manuals, and can also help to achieve terminological consistency. Search typesTM systems are basically search engines that specialize in performing two kinds of searches:
Similarity is often measured using Levenshtein distance, which refers to the algorithm written by Russian scientist Vladimir Levenshtein in 1965. Based on Levenshtein distance, you can count the number of insertions, deletions and substitutions required to turn one phrase into another. For example, to change the word "magazine" into "magazines," you type one character ("s") and the difference between them is "1." One character in eight represents 12.5% (the difference is usually measured against the source phrase). A CAT tool using this method would measure the similarity between the two words as 87.5%. The Levenshtein algorithm is very practical, but not practical enough. For example, a human translator can tell that the following two sentences are similar in meaning:
However, a program computing the number of keystrokes required to change one phrase into the other would conclude that these phrases are quite different, and therefore not offer translations from the database. Rearranging the words requires too many keystrokes. With languages like English or French, it is easy to separate the words that compose a sentence, making it possible to take word order into account when measuring similarity. However, languages like Chinese or Japanese have no spaces between words, making it difficult to go beyond Levenshtein distance. The key to providing better matches lies in the ability of the TM engine to break every segment into very small pieces and to store them as a large collection of fragments in a database. The drawback to this is that the data storage requires you to spend a lot of time and effort to do proper indexing. Nevertheless, searching for fragments is faster than full text search combined with similarity calculations. During the translation process, it is often necessary to transfer translation memories along with the documents being translated. Because the various participants in the process may have different TM systems, a standard for TM data exchange was created. The Localization Industry Standards Association’s (LISA’s) Open Standards for Container/Content Allowing Re-use (OSCAR) group defined Translation Memory eXchange (TMX) as a common standard to allow users to reuse text more effectively when working with different CAT tools and translation providers. TMX: Translation Memory eXchangeThe formal definition of TMX on LISA’s OSCAR web site states: TMX (Translation Memory eXchange) is the vendor-neutral open XML standard for the exchange of Translation Memory (TM) data created by Computer Aided Translation (CAT) and localization tools. The purpose of TMX is to allow easier exchange of translation memory data between tools and/or translation vendors with little or no loss of critical data during the process. Several complete translation tools are available to help you. Translators have a choice of tools that specialize in, among other things, general documentation, software localization, technical manuals and brochures. When a translator needs to work with two or more tools, the ability to reuse translation memories across tools is a must. This is where TMX becomes the hero. Translation memories are valuable corporate assets, so being tied to a proprietary database format is a bad idea. Open standards like TMX give translators, language service providers and customers who require localization a reasonable degree of independence from tool vendors. TMX in detailA TMX document is an XML document whose root element is General information about the TMX document is described in the attributes of the The main content of the TMX document is stored inside the element. It holds a collection of translations contained in translation unit elements (The TMX DTD will allow a The text of a translation unit variant is enclosed in a
![]() Table 1 Inline tags depend on the format of the original document. In HTML, bold text is delimited with opening and closing <b> tags, and those tags may be enclosed in Figure 1 below shows a sample TMX document with entries in English, Spanish and Chinese, while Figure 2 shows the same document opened in a TMX editor.
![]() Figure 1 ![]() Figure 2 In Figure 1, the TMX LevelsYou can exchange translation memories using TMX documents at two levels:
In theory, any tool that supports TMX at Level 2 should be able to use the tags generated by another Level 2 compliant tool. However, in the real world, differences in implementation mean that tags can't be reused. All major CAT tools currently support the TMX standard, but some compatibility issues still exist:
Version 2.0 of TMX will attempt to correct some of the problems mentioned above. A draft of TMX 2.0 is available on the TMX home page. TMX 2.0, in its current draft, includes a new set of inline tags designed to be compatible with the XLIFF standard. It also has a new set of rules for selecting inline elements. Only one level of compliance is defined; thereafter, all tools that implement support for TMX 2.0 should include inline markup in exported TMX files. Document alignmentThanks to a process called alignment, you can still reuse translations that are not part of a TM engine. An alignment tool requires two files: (1) an original document and (2) its translation to generate a TMX file that can be imported into a TM system. A simple alignment method might look like the following:
The process described above aligns two XLIFF files, not the original documents. Original formatting information is converted to XLIFF inline tags, so that it can then be wrapped in The result of the alignment process depends on the quality of the translation. If the aligned documents have the same or a very similar structure, the translation can be recovered for reuse without much effort. When translation is performed using a CAT tool, structure and formatting are usually preserved. Managing glossaries with TMXIn XML in localisation: A practical analysis, I mentioned that TermBase eXchange (TBX) – a standard also administered by LISA’s OSCAR group – is the appropriate XML format for preparing glossaries. Nevertheless, TMX can also be used for creating and maintaining multilingual glossaries. TMX includes the elements Listing 1 shows how to annotate a TMX file with terminological data using elements available in the TMX DTD.
![]() Listing 1 However, the team that wrote the TMX specification noticed that the TMX standard was not enough for managing terminological data and therefore created the TBX standard. This new standard is better suited for creating and maintaining dictionaries and glossaries, but TMX provides a good alternative for beginners. It is important to note that as TMX and TBX are both XML-based, it is possible to use XSL transformations to convert data from one format to the other. Glossaries in either TMX or TBX format can be used by CAT tools to provide partial translations in a machine translation-like style. I mentioned before that one of the best ways to ensure high quality matching requires you to divide the text to be translated into very small fractions. Translations are extracted from TM databases when a number of fragments above the required similarity percentage match database entries. In a similar fashion, it is possible to retrieve partial translations when a certain number of fragments exactly match all the fragments in a glossary entry. Extracting partial translations from a glossary —usually a few words in a sentence— and presenting them to the translator in a clear way helps the translator to achieve coherency throughout the whole document. SummaryThis article has explained the importance of translation memories and how the localization industry uses them, highlighting the relevance of the TMX format for transferring translation data between different TM implementations. I also presented a brief explanation of an alignment process used to recover legacy translations, illustrating how you can use other related standards like XLIFF to help solve a common problem. Resources
A previous version of this article was originally published by IBM developerWorks in February 2005. Rodolfo Raya is an XML specialist for Heartsome Holdings, developing multi-platform translation/localization and content publishing tools using XML and Java technology. He can be reached at rmraya@maxprograms.com. |
![]() 8-11 December 2008 |
||