LISA Home page [© 2008 • ISSN 1420-3693 • www.localization.org]
© 2008 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…


FOCUS ON STANDARDS

Coping with Babel: How to Localize XML
(Part 2 of 2)

Andrzej Zydron, CTO, xml-Intl Ltd. & Member, OSCAR Steering Committee

This article continues Andrzej Zydron’s exploration of common problems in XML localization and their solutions. In the first of this two-part series, he outlined the pitfalls that are often encountered by authors, programmers and localizers when first using XML, as well as ways to avoid these problems. In this installment, Zydron takes on localizing graphics, dealing with text expansion, and marking up text to facilitate localization. Following Zydron’s advice can save developers time, money and headaches, and can help them reach out effectively to the world.


Avoid Processing Instructions (PIs) in Translatable Text

Andrzej Zydron

Processing Instructions are a very 'weak' syntactical instrument in XML. There is no built-in mechanism in XML to assist syntactically in the preservation of Processing Instructions. Above all, avoid translatable text in PIs.

<para>
  Use a <?tool name="claw hammer"?> to release
  the CPU retention catch.
</para>

Example 10: Incorrect Use of Translatable Text in PIs.

<para>
  Use a <tool id="a1098">claw hammer</tool>
  to release the CPU retention catch.
</para>

Example 11: Proposed Solution

It is generally not a good idea to have any PIs present within translatable text. There is no guarantee that they will survive the translation process, unless special processing is carried out to preserve them. The problem is deciding if the PIs are significant or not. This can cause problems with translation memory (TM) systems. Due to their syntactical weakness, it is not easy for off-the-shelf extraction software to parameterize their handling. The insertion of a PI can cause otherwise linguistically identical text to fail TM matching. As a syntactically weak element, PIs do not have the handling capabilities of elements. It is better to strip out all PIs prior to translation.

Avoid the Use of Text in Bitmap Graphics

With the existence of the SVG (Scalable Vector Graphics) format, there should be no excuse to use bitmapped graphics. They pose particular problems in that the original bitmap will need to recreated for the target language with the translated text. This is usually a very costly and error-prone process and requires appropriate target language knowledge by the person who edits the graphics.

Never Make Any Assumptions About Text Length Sizes in Your Design

Always allow for the fact that the target language text may be significantly longer than the source. For example, "Welcome" becomes "шчыра запрашаем" in Belarusian and "maligayang pugdatíng" in Tagalog. Design your output with flexibility in mind.

Always Use UTF-8 (Or Alternatively UTF-16) Encoding Throughout Your Process

With English source, we are often tempted to use 7-bit ASCII or ISO 8859/1 encoding. As soon as you find that you are required to translate into a language that is not covered by ISO 8859/1, you will discover that trying to maintain documents in different encoding schemes to be a real problem.

Always use UTF-8 from the start. It gives you immediate access to commonly used punctuation characters such as 'm-dash' and 'n-dash,' etc. It also significantly simplifies your document processing.

All XML parsing tools are required to handle both UTF-8 and UTF-16. UTF-8 is more economical in terms of space usage for most European languages whose scripts are based on the Latin alphabet.

Never Break a Linguistically Complete Text Unit Over More Than One Non-inline Element

Never start a sentence in one non-inline element and continue it in another. You cannot rely on the translated text being in the same word sequence in the target language. It also makes the job of translation much more difficult as the translator cannot see the whole sentence.

<para>
  <line>This text should not be</line>
  <line>broken this way – the translated
  text may well be in a different order.</line>
</para>

Example 12: Example of a Sentence Broken Over More Than One Element.

Avoid the Use of Typographical Elements

Use logical elements that encompass the text, instead of typographical elements.

<para><b>Do not use</b>
  '<br/>' type elements.
</para>

Example 13: Example of Typographical Element Usage.

Use "emph" instead of "bold." Encompass any text that must be included on the same line with line elements.

<para>
  <emph>Do not use</emph> 'br' type elements.
</para>

Example 14: Suggested Correct Usage.


Avoid at all costs introducing any line breaks into the text stream. If you do so, it is unconditionally guaranteed that this will cause problems in some, if not all, of the target languages.

Do Not Mix Translatable and Non-translatable Text in the Same Elements

Keep non-translatable PCDATA in different elements than translatable PCDATA.

<data-items>
  <data id="class">
  com.xmlintl.data.dataDefDefinition
  </data>
  <data id="text">
Replace generic data
definitions with specific instances.
  </data>
</data-items>

Example 15: Example of Mixed PCDATA.


Most XML translation tools will have problems with this type of construct. It is only when inspecting the 'id' attribute that a decision can be made as to whether the PCDATA should be extracted or not.

<data-items>
  <class id="com.xmlintl.data.dataDefinition">
<text>
Replace generic data
definitions with specific instances.
</text>
  </class>
</data-items>

Example 16: Suggested Solution.


Avoid Holding Source and Target PCDATA in the Same Document

This can cause all manner of problems for processing and extraction tools.

<para>
<text xml:lang="en">
My hovercraft is full of eels.
</text>
<text xml:lang="fr">
Mon aéroglisseur est plein d'anguilles.
</text>
<text xml:lang="hu">
Légpárnás hajóm tele van angolnákkal.
</text>
<text xml:lang="ja">
私のホバークラフトは鰻で一杯です。
</text>
<text xml:lang="pl">
Mój poduszkowiec jest pełen węgorzy.
</text>
<text xml:lang="es">
Mi aerodeslizador está lleno de anguilas.
</text>
<text xml:lang="zh-CH">
我隻氣墊船裝滿晒鱔.
</text>
<text xml:lang="zh-TW">
我的氣墊船充滿了鱔魚 [我的气垫船充满了鳝鱼]
</text>
</para>

Example 17: Example of Mixed Source and Target PCDATA


Unless your document requires mixed language content, use a separate document instance to store each target language version. If you store both source and target data in the same document, it will become unwieldy, overly large and cumbersome to process.

Clearly Define Text That Requires Translation

Keep any PCDATA that requires translation in different elements from PCDATA that does not require translation. Use special elements for text within PCDATA that is specifically not to be translated.

<para>
  The following part of this sentence should
  <notrans>not be translated</notrans>
  at all.
</para>

Example 18: Suggested Solution.


Suggested Further Reading

Yves Savourel of ENLASO Corporation, who has done so much good work in the field of localizing XML, has an excellent web page dedicated to the subject of XML Internationalization and Localization FAQ. Another very good reference work is the paper by Richard Ishida of W3C, Localisation Considerations in DTD Design.

Finally – Please Invest Time and Effort in the Quality of the Source Text

If the source text is properly written in a clear and understandable manner, then it will be easy to read and easier to localize. It is worth investing in tools that will check the grammar and terminology in your source text. Without tools, your authors do not have a benchmark against which to test themselves, and it is thus all to easy for poorly written text to make its way into your documents.


Andrzej Zydron is a member of the LISA OSCAR Steering Committee. He is the technical architect and editor of the GILT Metrics proposed specification suite, as well as editor of the proposed TBX Link specification. Zydron also sits on the OASIS technical committees for Translation Web Services, XLIFF and XLIFF segmentation. As CTO for xml-Intl Ltd., he is currently developing the next generation of XML-based text memory systems to reduce authoring and translation costs for documentation. Zydron is fluent in English, Polish and French.




LISA 2008 events

Advertise with LISA


Free Online English Russian Dictionary

LISA Forum USA

23-27 June 2008
Register Today
Sponsorship Request



LISA Surveys

EventsNews

Joining LISA

Best Practice Guides

LISA Wireless Primer


OSCARTBXTMX

Terminology SIG

Job and CV Postings