LISA Home page [© 2008 • ISSN 1420-3693 • www.localization.org]
© 2008 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…


Designing Tools-Friendly File Formats

Shailendra Musale

Localizers are often faced with the need to localize custom file formats for specific clients. While it is generally best to use standard file formats when localization is a consideration, this is not always possible. In this article Shailendra Musale of F-Secure Corporation, author of Localizing for Mobile Devices: A Primer, discusses some of the issues that must be taken into account when designing and processing custom file formats that end up in a localization workflow. When developers are aware of these issues they can design file formats that will be easier to deal with during localization. Even with a format that should work, the bottom line is test, test, test.


Shailendra Musale

Creating user-friendly software is a common objective for any software company. By creating user-friendly and useful software a software company wins many customers and, if the software is well localized, captures various international markets. Despite the importance of localization, how many developers design tools-friendly file-formats? How many of them try to make their file formats compatible with localization tools, so that they can minimize problems during the localization process?

A tools-friendly file format is a file format which is compatible with the localization tool(s) in use today and which helps to speed up the localization process by minimizing most common errors. Today we have various tools and software on the market that are designed for localization and internationalization. Each of them supports a wide-variety of file formats, whereas some of them have been specially designed to support specific file formats such as XML.

Often developers try to design their own file formats for various reasons, and such proprietary file formats may not always work well with localization tools. Before attempting to design custom file formats developers should first find out if there are any standard formats that can be used to serve their needs and if the localization tools they are using support these formats. If there are no standard formats, then and only then should they design their own file format. After that, they must test their format to verify that the file format is compatible with the localization tools.

In this article, we will try to focus on some points, which will help to ensure that file-formats are compatible with localization tools. Although this article focuses on software file formats and not the file formats used in documentation and desktop-publishing, many of the principles will be applicable in both instances.

Your own file format

File formats that are commonly used in localization are Windows Resource (*.rc) files, HTML, XML, JAVA Properties files, RTF files, etc. Developers store localizable strings in these file formats and these files are then localized using localization tools. However, many times developers design their own file format thinking that it will ease localization tasks. In most cases, the file is in tab-delimited format. The tab separates the source-code part and the translatable strings, enclosed in double-quotes.

Figure 1

The format and design of such proprietary files often differ considerably from standard file formats; hence they may not always be compatible with the localization tools that you are using. You may need to write additional macros or scripts to make them work with localization tools, and many times a localization engineer or tools engineer will have to do this task. Developers are usually not aware of the localization tools used by the company and will not spare their precious time from designing code to learn about the tools. Sometimes you may need to seek help from the support team of the company of the localization tools to get some file formats to work, but this can be expensive and time-consuming. Here are some points that you should consider before coming up with your own file format.

Discuss Requirements with Your Localization Team

Before you design a proprietary file format, contact the localization team in your company. Check what kind of localization or translation memory (TM) tool they are using. If you have time, try the tool by yourself and get more information from online help or manuals. With the help of the localization team, conduct a demo translation of a small file and test whether your file format works, and whether it is tools-friendly. Make necessary corrections, if any, and write guidelines and instructions which will help translators during translation.

Comments & Instructions

Developers often include some comments or instructions to help translators understand the context of strings and terms. Some notes inform translators not to translate particular strings. Refer to the two comments (in blue) appearing after // in Figure 1: Sample file format. During the demo translation, check if your tools can handle comments correctly. Find out if translators can see your instructions in the file after the tool has processed it. If your instructions are not visible to the translator, then you would probably need to supply them in a different file such as README file.

Line or Word Wrapping

There may be dialog-box or screen-size requirements for some projects which might restrict the length of strings. Developers may use certain characters or symbols to indicate a line or a word wrap. In some cases particular words must not be wrapped in middle. For example, a product name or application name must appear intact in one line. Developers must indicate this with certain characters in the code.

In the sample file shown in Figure 1, the product name Everest Monitor and website address are enclosed in SGML-style pointed brackets. The developers used < and > to bracket words that must be kept on one line and not broken in the middle. Developers write the code in such a way that during compilation the compiler will interpret these < > symbol characters correctly and not display them, while keeping the enclosed words on one line.

In trials by the author this simple file format was not correctly parsed by some major translation tools, pointing out the difficulties that can lie in store even with seemingly simple and intuitive custom file formats. In these tests the tools did not correctly display all translatable strings and had other difficulties. This does not say anything bad about the tools, but does point out the real nature of the problems discussed in this article.

Some times there are hard returns/carriage-returns used in a string. These might not lead to any compilation problems, but might affect the segmentation of files by TM tools, leading to some loss of expected leverage results. For example, in Figure 1, on the line beginning with #define ERROR_MSG1, there is a carriage-return (in red) in the middle of the string, before the sentence Please check the server address. The presence of this carriage return might cause a translation tool to consider the lines separate segments, even though the creator of the format might wish to consider them a single segment.

To avoid these problems, perform a demo translation to make sure that such problems will not pop up during localization process.

Escape Characters

In software files, some characters are reserved and if you want to use these reserved characters, you will need to escape them by inserting backslash before them. In Figure 1, refer to the strings #define ERROR_MSG3, #define ERROR_MSG4 and #define UPDATE_COMPLETE. In these strings, a backslash is used to escape reserved characters. In the string #define ERROR_MSG4, developers want to display double-quotes to the end-user. Since double-quotes are reserved characters in this case, developers have escaped them by inserting backslash before them. The compiler will then process these escaped double-quotes properly and will display it in the application.

During demo localization, check if your tools can handle such escape characters properly. If there are any problems, then make necessary changes to your code and test again.

Language Specific Information

Many times a proprietary file format may not contain language information such as character set encoding, code pages or language codes. Developers might store this language-related information in some other part of the code, or they might not even be aware of the fact that there must be such language-related information in the file. Before starting localization the localization team should check contents of localizable files and discuss any language-related issues with the developers.

During pre-processing of files with localization tools, the tool will ask you to specify source- and target-language information (such as language, locale, font name and size, target file name etc.). After you complete the pre-processing of files or the demo localization, check that the tool has stored the target language information correctly in the target file and the file format is not corrupted.

Saving Localized files

If there is a requirement that the localized file must be saved in a specific encoding (such as UTF-8 or UCS), make sure that your localization tool allows you to do so. Most recent tools support a large variety of encodings, but if support for the encoding you need is not available, then add one more step in your localization process where localized files will be saved in the required format.

File-Naming conventions

You will need to decide a file naming convention for your localized files. Check if your tool allows you to specify the target language filename. Some tools automatically rename the target language filename. In this case, you will need to perform an extra step to rename all localized files as per your file-naming convention.

A proprietary file may have a filename such as uimenu.loc. For a source language file, you can name the file as uimenu.len, where en is a 2-letter language code for the English language. For a target language file, such as Japanese, you might name the file uimenu.ljp, where jp is a 2-letter language code for Japanese language. Adopting a file-naming convention helps the maintenance of localized files. It is important that file-naming convention is consistent and per standard practice. Failure to heed these guidelines may result in confusion in the development teams.

Unused or Dead strings

A localizable file may contain some unused or dead 'strings'. Instead of removing such unused strings, developers tend to comment them out using characters such as //. In Figure 1, refer to the unused string shown as //#define ERROR_MSG1 "The server cannot be reached". When you are pre-processing your file make sure that the tool ignores such dead strings, or that you can specify them as non-translatables. Many times files may have considerable numbers of these dead strings, so it is advisable to clean up your source files periodically and remove any dead strings; otherwise you might end up paying for the translation of unused and unneeded material. Unused or dead strings may also increase the size of your application unnecessarily and create problems while updating localized files using TM tools.

Compatibility with Compilers and tools

Even though many localization tools can compile localized files, they probably cannot compile the custom file format that you have designed. The tool will give you translated files as output files and you will need to compile them using your own compilers or scripts. Perform small test localizations, compile the translated files and check if there are any errors or warnings. Check the integrity of the translated file. Sometimes some special characters, symbols or escape characters are not handled properly by the tool. Before you start compiling, check and correct any of these errors. It will save a lot of time of your project.

Updating with Translation Memory

Localization tools are designed to work with standard file formats. Some tools may have support for additional file formats. You may discover that your tool can localize your proprietary files in a demo translation, but it is equally important to go one step further and check if the tool can handle any version updates of your files.

A test localization of a version update file will help to ensure this. Check if the tool can give best matches from the translation database. Once you design your own file format and know that it works, stick with that file format and be consistent in its layout. Give information about this proprietary format to other team members and make sure that they follow the guidelines regarding the file format properly. Any changes in the layout or format of the file may adversely affect the number of matches in future projects or updates.

Make best use of tools

You can make best use of localization tools to localize your proprietary files. For this, you will need to take proper and consistent steps to streamline your files with your tools. Discuss this with the localization team in your company as they have the expertise in the tools area. Proper planning and small test projects (such as demo translation) will save valuable time of development cycle and will also reduce the overall cost of your project.

Conclusion

Custom file formats can be a real hassle for localizers if the basic guidelines set forth in this article are not taken into account. Even when they are followed problems may occur, but consistent and incremental testing makes it far more likely that problems will be detected early on and corrected before they can become disasters. Keeping the concerns in this article in mind will help make design and use of custom file formats less likely to be a problem.


*Note: The product names and URLs used in this article are fictitious and any resemblance to existing products or URLs is purely coincidental and unintended. They are used for purpose of example only.


Shailendra Musale studied Japanese language at the University of Pune, India. He received the Japan Foundation's Study Tour Award Scholarship in 1995. He then studied the Finnish language. He holds a Master's Degree in Economics from the University of Pune, India. He has worked in the field of software localization since 1994 and his positions have included one at the Machine Translation Research project of the Institute of Systems Science (now known as Kent Ridge Digital Labs) in Singapore. He is currently a Senior Localization Engineer with F-Secure Corporation in Finland, where he works primarily on localization for mobile devices. He can be reached by email at Shailendra.Musale@F-Secure.com.




LISA 2008 events

Advertise with LISA


ADAPT Localization

LISA Forum Europe

8-12 December 2008
Registration Open


LISA Surveys

EventsNews

Joining LISA

Best Practice Guides

LISA Wireless Primer


OSCARTBXTMX

Terminology SIG

Job and CV Postings