|
In this issue…
FOCUS ON STANDARDSCoping with Babel: How to Localize XML
(Part 2 of 2)
This article continues Andrzej Zydron’s exploration of common problems in XML localization and their solutions. In the first of this two-part series, he outlined the pitfalls that are often encountered by authors, programmers and localizers when first using XML, as well as ways to avoid these problems. In this installment, Zydron takes on localizing graphics, dealing with text expansion, and marking up text to facilitate localization. Following Zydron’s advice can save developers time, money and headaches, and can help them reach out effectively to the world. Avoid Processing Instructions (PIs) in Translatable Text
Processing Instructions are a very 'weak' syntactical instrument in XML. There is no built-in mechanism in XML to assist syntactically in the preservation of Processing Instructions. Above all, avoid translatable text in PIs.
Example 10: Incorrect Use of Translatable Text in PIs.
Example 11: Proposed Solution It is generally not a good idea to have any PIs present within translatable text. There is no guarantee that they will survive the translation process, unless special processing is carried out to preserve them. The problem is deciding if the PIs are significant or not. This can cause problems with translation memory (TM) systems. Due to their syntactical weakness, it is not easy for off-the-shelf extraction software to parameterize their handling. The insertion of a PI can cause otherwise linguistically identical text to fail TM matching. As a syntactically weak element, PIs do not have the handling capabilities of elements. It is better to strip out all PIs prior to translation. Avoid the Use of Text in Bitmap GraphicsWith the existence of the SVG (Scalable Vector Graphics) format, there should be no excuse to use bitmapped graphics. They pose particular problems in that the original bitmap will need to recreated for the target language with the translated text. This is usually a very costly and error-prone process and requires appropriate target language knowledge by the person who edits the graphics. Never Make Any Assumptions About Text Length Sizes in Your DesignAlways allow for the fact that the target language text may be significantly longer than the source. For example, "Welcome" becomes "шчыра запрашаем" in Belarusian and "maligayang pugdatíng" in Tagalog. Design your output with flexibility in mind. Always Use UTF-8 (Or Alternatively UTF-16) Encoding Throughout Your ProcessWith English source, we are often tempted to use 7-bit ASCII or ISO 8859/1 encoding. As soon as you find that you are required to translate into a language that is not covered by ISO 8859/1, you will discover that trying to maintain documents in different encoding schemes to be a real problem. Always use UTF-8 from the start. It gives you immediate access to commonly used punctuation characters such as 'm-dash' and 'n-dash,' etc. It also significantly simplifies your document processing. All XML parsing tools are required to handle both UTF-8 and UTF-16. UTF-8 is more economical in terms of space usage for most European languages whose scripts are based on the Latin alphabet. Never Break a Linguistically Complete Text Unit Over More Than One Non-inline ElementNever start a sentence in one non-inline element and continue it in another. You cannot rely on the translated text being in the same word sequence in the target language. It also makes the job of translation much more difficult as the translator cannot see the whole sentence.
Example 12: Example of a Sentence Broken Over More Than One Element. Avoid the Use of Typographical ElementsUse logical elements that encompass the text, instead of typographical elements.
Example 13: Example of Typographical Element Usage. Use "emph" instead of "bold." Encompass any text that must be included on the same line with line elements.
Example 14: Suggested Correct Usage. Avoid at all costs introducing any line breaks into the text stream. If you do so, it is unconditionally guaranteed that this will cause problems in some, if not all, of the target languages. Do Not Mix Translatable and Non-translatable Text in the Same ElementsKeep non-translatable PCDATA in different elements than translatable PCDATA.
Example 15: Example of Mixed PCDATA. Most XML translation tools will have problems with this type of construct. It is only when inspecting the 'id' attribute that a decision can be made as to whether the PCDATA should be extracted or not.
Example 16: Suggested Solution. Avoid Holding Source and Target PCDATA in the Same DocumentThis can cause all manner of problems for processing and extraction tools.
Example 17: Example of Mixed Source and Target PCDATA Unless your document requires mixed language content, use a separate document instance to store each target language version. If you store both source and target data in the same document, it will become unwieldy, overly large and cumbersome to process. Clearly Define Text That Requires TranslationKeep any PCDATA that requires translation in different elements from PCDATA that does not require translation. Use special elements for text within PCDATA that is specifically not to be translated.
Example 18: Suggested Solution. Suggested Further ReadingYves Savourel of ENLASO Corporation, who has done so much good work in the field of localizing XML, has an excellent web page dedicated to the subject of XML Internationalization and Localization FAQ. Another very good reference work is the paper by Richard Ishida of W3C, Localisation Considerations in DTD Design. Finally – Please Invest Time and Effort in the Quality of the Source TextIf the source text is properly written in a clear and understandable manner, then it will be easy to read and easier to localize. It is worth investing in tools that will check the grammar and terminology in your source text. Without tools, your authors do not have a benchmark against which to test themselves, and it is thus all to easy for poorly written text to make its way into your documents. is a member of the LISA OSCAR Steering Committee. He is the technical architect and editor of the GILT Metrics proposed specification suite, as well as editor of the proposed TBX Link specification. Zydron also sits on the OASIS technical committees for Translation Web Services, XLIFF and XLIFF segmentation. As CTO for xml-Intl Ltd., he is currently developing the next generation of XML-based text memory systems to reduce authoring and translation costs for documentation. Zydron is fluent in English, Polish and French. |
![]() 23-27 June 2008 |
||