Use XML as a Java Localization Solution

The reusability that XML affords TMX-formatted data benefits Java internationalization development

by Masaki Itagaki
mitagaki@uswest.net

Java has been one of the best programming languages for global market-oriented application development since JDK 1.1 covered basic components for internationalization. Java has many internationalization approaches supporting such aspects as Unicode 2.0, multilingual environment, and Locale objects, to name a few. However, you still have to consider the daunting, fundamental work that is required for a global market, which means translating all text items such as labels, messages, menu items, and so on. Even for these kinds of localization issues, Java offers a nice solution in the ResourceBundle class. You can extract all the text items from original source codes, isolating them into ResourceBundle components such as a ListResourceBundle class or a property file.

 
Figure 1. The Decorator Class Click here.
Although such a scheme makes a developer's life much easier, it's rather clumsy from the translation point of view, especially in terms of reusability of translations. In the localization industry, Translation Memory eXchange (TMX) is a standardized data format that uses XML for software and document translation assets. Most of the commercial translation tools can use the TMX file to reuse translation data. Translators who want to use the TMX solution for Java must implement their own data conversion between TMX and ResourceBundle data.

Translators can work on their TMX scheme to embed Java translation data into their central translation repository and leverage their data for Java translation. Once Java text resource data is presented in a TMX format, translators can import and export translation data to or from their translation tools. There are roughly seven major translation memory tools, such as TRADOS and Transit, and most of them are TMX compliant. If you can come up with your own ResourceBundle approach to integrate TXM data directly, bridging Java ResourceBundle data and TMX data will be much easier and simpler (see Figures 1 and 2). Let's see how we can make a ResourceBundle-extended class read text resource data from TMX-formatted files instead of a ListResourceBundle class or a PropertyResourceBundle file.

 
Figure 2. Building a Bridge Click here.
Since 1997 the localization industry has put a lot of effort into standardizing a translation data format. The Localization Industry Standards Association (LISA), a nonprofit internationalization and localization organization, formed a special interest group called Open Standards for Container/Content Allowing Reuse (OSCAR) to define a translation memory data format and publish the TMX standard. This is simply XML-formatted data defining elements and attributes that are necessary to organize translation data efficiently. Listing 1 shows an example of a TMX file format.

Among all the TMX elements and properties, the most important ones provide text resource information. They are shown in Table 1. To use the ResourceBundle class, key information must be stored. In this case the key can be an attribute for the element. The simplest TMX structure could result from the conversion diagrammed in Figure 3.

The TMXResourceBundle Class
Now we can create the TMXResourceBundle class by extending the ResourceBundle class. This class reads resource text data directly from an attached TMX file. First, the TMXResourceBundle class instantiates itself with two parameters: a TMX file name and a target language name. Then, using a DOM parser, it reads all of a translation unit's properties for the key information and specified language data and populates a hashtable with them. Using the hashtable data, the class implements the handleGetObject() method.

There should be many approaches here. Because the XMLResourceBundle simply reads XML data, it doesn't necessarily have to be a subclass of the ResourceBundle class.

 
Figure 3. Conversion Click here.
However, code changes will be minimal if we extend the ResourceBundle class. Listing 2 shows a sample TMXResourceBundle class that uses Sun's XML parser. This XMLResourceBundle class is more similar to the PropertyResourceBundle class than it is to the ListResourceBundle class. You instantiate the XMLResourceBundle class in a program to read data from a TMX file. Once the class is instantiated, it reads all the data in a TMX file and loads into a DOM tree. Then it populates a hashtable so the handleGetObject() method can be called to find text information based on a key just as a standard ResourceBundle class does.

Instantiating the TMXResouceBundle class is the same as instantiating the PropertyResourceBundle class. First you obtain a system language code from a locale's information. In TMX the value of the attribute must be one of the ISO language identifiers (a two- or three-letter code) or one of the standard locale identifiers (a two- or three-letter language code, a dash, and a two-letter region code). In the sample code, the getLanguage() method is used simply to obtain a two-letter language code.

 
Figure 4. Proper English Click here.
Calling the getString() method with key information brings you a correct text string, which is another method that is exactly the same as the ResourceBundle approach (see Listing 3). Figures 4 and 5 show simple applet interfaces when an XMLResourceBundle class is called with English ("EN") and French ("FR"), respectively.

The Importance of Reusability
This XML approach has some drawbacks, just as there are some drawbacks to using the PropertyResourceBundle class. Reading an external text file obviously brings overhead and takes more time to load data than the ListResourceBundle class, which contains all of the data inside its instance. You also have to consider memory resources, especially when the number of resource text items gets significant. A TMX file contains all language information; it can be considered serious overhead to load all the unnecessary language data into memory space.

 
Figure 5. En Français Click here.
Nevertheless, the TMXResourceBundle class is an efficient approach in handling translation data. If a text resource size is relatively small, any language can be handled simultaneously. Further, the resource bundle file itself is totally compatible with the current major translation data standard. Once you create a TMX-formatted resource text file, all you have to do is either hand it over to translators or create a simple translation tool that reads a TMX file. Any TMX-formatted data can be loaded into a major translation memory tool, which is commonly used in software and document translation tasks.

Most benefits of the TMXResourceBundle class are on the development side. Since the number of words usually determines the cost of translation, requesting translation of the same items is not cost efficient. Using TMX's DTD, you can also embed such information as a package name, a class name, and a project name. This gives you an exact match in translation data, which enables you to extract only new items. Meanwhile, if you want to achieve consistency between software translation and document translation (such as guides, manuals, and even computer-based training programs), TMX proves to be a great solution. By importing your Java TMX file into any translation tool, you can reuse Java translations through a word book or glossary functions, which are included in most translation tools. Thus, TMX benefits not just the translation industry, but Java internationalization development, as well.


Masaki Itagaki is a linguistic engineering solution specialist at J.D. Edwards World Source Company. Last year he published The Software Localization Handbook, a coauthored Japanese-language guidebook for software localization.

This article is reproduced with permission from Java Pro