LISA Home page [© 2008 • ISSN 1420-3693 • www.localization.org]
© 2008 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…


Standards

Global Information Management Metrics (GMX)
Slaying the Word Count Dragon

Andrzej Zydroń, CTO, XML-INTL & Member, LISA OSCAR Steering Committee

The latest standard from LISA’s OSCAR, Global Information Metrics eXchange – Volume (GMX/V), has been approved and has entered the final public comment phase. This standard solves an issue that has long plagued the translation/localization industry, i.e., the lack of a verifiable industry standard for word and character counts, by mandating XLIFF as the canonical form for word and character counts. Localization tool providers have been consulted and have contributed to this standard. We would appreciate your views/comments as well. Please visit http://www.lisa.org/standards/gmx/GMX-V.html.


Andrzej Zydron

In the Beginning...

One of the most enduring features of the Localization Industry has been the inconsistency of word and character counts, not only between rival products, but even between different versions of the same product. Trying to establish a measure for the size of a given localization task is not unlike trying to fight a many-headed dragon. The same can be said for word processing software – word and character counts differ between vendors and versions in a similar way.

The havoc that the lack of a uniform system of measurement can cause was recently exemplified in 1999 when the Mars Climate Orbiter Spacecraft was lost because one NASA team used Imperial units, while another used metric units, for a key spacecraft operation. The total cost of this error was $125 million. Trying to cope with a lack of a common definition for estimating the size of a Localization project can lead to similar problems.

This is reminiscent of the situation for general measurements before the advent of the French Revolution. A French foot ('pied du roi' - 12·79 inches) was different from an English foot that was different from the Welsh foot (9 inches). The basis of the current Imperial linear measures were unified by Edward I in 1308 who ordained (in a highly scientific manner for the 14th century) that an inch was to be three grains of barley, dry and round, taken from the middle of the ear and that twelve inches were to make a foot. It took the French Revolution to provide a (mostly) logical approach to establishing general units of measure based on a decimal scale (although somehow, the 10-day week did not catch on).

Global Information Management Metrics (GMX) is a proposed, XML-based LISA standard, aimed at providing a unified and verifiable (and unlike the French Revolution, a bloodless) way of establishing the size of a given localization task for electronic files, as well as allowing this data to be exchanged electronically. Why metrics? The American Heritage® Dictionary of the English Language (Fourth Edition) defines the noun metric as A standard of measurement.

Component Parts

There are three aspects to determining the metrics for a localization task:

1. Volume (GMX-V): This quantifies the character and word counts for the task.

2. Complexity (GMX-C): This quantifies the complexity of the task.

3. Quality (GMX-Q): This quantifies the quality requirements of the task.

GMX-V: Volume

Words and Characters

GMX-V mandates both word and character counts. Character counts convey the most precise definition of a localization task, whereas word counts are the most commonly used metric in the localization industry. GMX-V encompasses both measurements, thus affording both localization suppliers and customers with a choice as to which measurement most adequately reflects the localization task in question.

Other Metrics

GMX-V allows for the exchange of all metrics pertaining to a given localization task. The XML exchange notation of GMX-V allows for the definition and electronic exchange of any metric that is relevant to a given localization task such as page counts, file counts, screen shot counts, etc.

Canonical Form

One of the main problems with calculating word and character counts is the plethora of differing proprietary file formats that can contain a mix of form and content data. Trying to establish a standard that addresses all of these formats is impossible – the word count dragon has too many heads to attempt to cut them all off with one swipe. As soon as one head is cut off, a new one will appear somewhere else. A better approach is to force the dragon to enter a narrow passage where the heads are all forced together. Enter the XLIFF knight in shining armor called Unicode.

XLIFF is the OASIS standard for XML Localization Interchange File Format that is designed as a method for exchanging translatable data in an XML format. The GMX-V proposal relies on using the XLIFF representation as the canonical form for establishing the basis of word and character counts. The proposal mandates that all characters be counted in their Unicode representation and that all multiple space characters be reduced to a single character. In addition, word boundaries are defined with reference to Unicode Technical Report 29 (TR29-9) – Text Boundaries. This provides an unambiguous definition of what constitutes a word.

By using XLIFF as the canonical form for counting the source language text, the GMX-V proposal establishes a common and well-defined format for word and character counts.

Within XLIFF, inline codes are interpreted as inline XML elements. The inline elements are not included in the word and character counts, but form a separate inline element count of their own. The frequency of inline elements can have an impact on the translation workload, so a separate count is useful when sizing a job.

Standalone punctuation characters are also featured as an additional category in both word and character counts. They are included in the main count, but can be deducted from both by mutual consent if they do not increase the translation workload.

GMX-V addresses all issues related to counting words and characters in the XLIFF canonical format. Since the sentence is the commonly accepted atomic unit for translation, it proposes sentence-level granularity for counting purposes within XLIFF.

GMX-V does not preclude producing metrics directly from non-XLIFF format files, as long as the format for counting is based on the XLIFF canonical form for each text unit being counted. This can be done dynamically on the fly. In these instances, an audit file is necessary for verification purposes.

In summary, the main goal of GMX-V is to provide a detailed count for words and characters based on the characteristics of individual sentences. The aim is to provide sufficient detail to enable an accurate definition of the scale of the translation task. The customer and supplier can then decide which of the statistics to use or not when costing the translation task for a given file.

Words and Characters

GMX-V also uses Unicode Technical Report 29 (TR29-9) – Text Boundaries to define words and characters. This provides a clear and unambiguous definition of word or ‘grapheme’ boundaries.

Logographic Scripts

To date, word counts have had little relevance for Chinese, Japanese or Korean (CJK) source text. For these languages, GMX-V recommends using only character counts. There is a proposal before ISO TC 37, submitted by Professor Sun Maosong, relating to the automatic identification of word boundaries for CJK languages. Should this recommendation become a standard, then GMX-V should use reference to it for the provision of word counts for CJK.

Quantitative and Qualitative Measurements

GMX-V falls into two categories – how many and what type. The primary count will always be unqualified, i.e., how many characters and words are in the file. This is the minimal conformance level proposed for GMX-V.

A typical translatable document will contain a variety of text elements. Some of these elements will contain non-translatable text, some will have been matched from translation memory and some will have been fuzzy matched by the customer. It is therefore important to be able to categorize the word and character counts according to type in order to provide a figure in words and characters for a given localization task.

Count Categories

Apart from the Total Word Count and Total Character Count values, GMX-V allows the following count categories:

  • Exact Matched Count – an accumulation of the word and character count for text units that have been matched unambiguously with a prior translation and that require no translator input.
  • Leveraged Matched Count – an accumulation of the word and character count for text units that have been matched against a leveraged translation memory database.
  • Repetition Matched Count – an accumulation of the word count for repeating text units that have not been matched in any other form. Repetition matching is deemed to take precedence over fuzzy matching.
  • Fuzzy Matched Count – an accumulation of the word and character count for text units that have been fuzzy matched against a leveraged translation memory database.
  • Alphanumeric-Only Text Unit Count – an accumulation of the word and character count for text units that have been identified as containing only alphanumeric words.
  • Numeric-Only Text Unit Count – an accumulation of the word and character count for text units that have been identified as containing only numeric words.
  • Punctuation-Only Text Unit Count – an accumulation of the word and character count for text units that have been identified as containing only punctuation.
  • Standalone Punctuation Count – an accumulation of the standalone punctuation word and character counts from the individual text units that make up a document.
  • Measurement-Only Count – an accumulation of the word and character count from measurement-only text units.
  • Other Non-Translatable Word Count – other non-translatable word and character counts.
  • Automatically Treatable Text Counts – A count of automatically treatable inline elements such as date, time, measurements or simple and complex numeric values.

GMX-V allows for additional count categories such as automatically treatable text (Auto Text) as well as repetition counts for identical text that is repeated within the document etc.

Verifiability

Any measurement standard must have a reference implementation as well as an authoritative body that tests and validates the measuring instruments. In the USA, this is provided by the National Institute of Standards and Technology. In order to be successful, GMX-V must provide for a certification authority that will (1) maintain reference documents with known metrics and (2) provide an online facility to test given XLIFF documents. In this way, both customers and suppliers can be confident that GMX-V provides an unambiguous and reliable way of quantifying a Global Information Management task.

Non-verifiable Metrics and Exchange Notation

There are many instances where it is not possible to verify electronically the metrics data, e.g., screen shots, number of pages to be proofread, etc. GMX-V allows for the annotation and exchange of all relevant metrics for a given localization task.

Summary

The GMX-V proposal is based on the following well-defined standards:

1. XLIFF
2. Unicode ISO 10646
3. Unicode TR29-9

GMX-V proposes maintaining counts for words and characters, standalone punctuation and inline code and references. It also recommends additional qualitative counts for the text element categories detailed above. All of this detail allows for a precise and unambiguous definition of the localization task for a given electronic file. This rich detail allows suppliers and customers to be able to precisely measure the task at hand and to more easily do business with one another due to the greater level of trust generated.

GMX-C: Complexity

GMX-C provides a notational mechanism for establishing the complexity level of a given localization task. Complexity is predicated on the notional existence of a 'simple' or non-complex task. Tasks are then rated in relation to this base task using a percentage indicator. This figure establishes how much more work is necessary to carry out the task compared to the notion of a simple task.

It is envisaged that the complexity factor will be mutually agreed by the supplier and customer. Trying to provide an algorithmic mechanism that will cover all eventualities for complexity is an impossible task, given all the potential factors that can affect a given localization task. Some of the readily quantifiable complexity issues are already covered within GMX-V with regard to inline elements and cross-reference inline elements.

GMX-Q: Quality

GMX-Q provides a notation for determining the required quality level for a given localization task. This can then be passed electronically with the source data to the supplier and used as part of any quality review to check that the completed localization task meets the required quality levels.

GMX-V Has Now Entered its Public Comment Phase

The GMX-V proposed standard is now available online as part of the public comment phase for the standard. It can be reviewed at the following URL: http://www.lisa.org/standards/gmx/GMX-V.html

Your Feedback Is Most Welcome

If you are interested in commenting on any of the ideas presented in this article, please forward your feedback to letters@lisa.org.



Andrzej Zydroń is a member of the LISA OSCAR steering committee. He is also a member of the British Computer Society and sits on the OASIS Technical Committees for Translation Web Services (TransWS), XML Localization Interchange File Format (XLIFF) and XLIFF segmentation. He is an invited expert on the W3C ITS Committee. Zydroń was responsible for the design and architecture of the European Patent Office data capture system for Xerox, as well as the Xerox Language Services XTM translation memory system. As CTO of XML-INTL, he is currently developing the next generation of XML-based text memory systems to reduce authoring and translation costs for documentation.




LISA 2008 events

Advertise with LISA


ADAPT Localization

The Internationalization & Unicode Conference 32

LISA Forum Europe

8-11 December 2008
Registration Open



LISA Surveys

EventsNews

Joining LISA

Best Practice Guides

LISA Wireless Primer


OSCARTBXTMX

Terminology SIG

Job and CV Postings