LISA Home page [© 2010 • ISSN 1420-3693 • www.localization.org]
© 2010 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…


Translation Memories - Smoke? Or Mirrors?

Mick McAllister, Operations Manager (Boulder), Sykes Enterprises

Ten years after translation memories first became available commercially, Mick McAllister reviews their benefits and limitations. His critical conclusion: TMs are a necessary and valuable tool, but all too frequently oversold.


The localization industry is beginning to reap the benefits of translation memory technology, after ten years of selling and client education. Unfortunately, we are also about to reap the whirlwind of client irritation prompted by the hyperbolic promises we sowed during those years.

“Translate once, use many,” we claimed, shushing the traditional translators in the background murmuring, “Yes, but….” “With fuzzy matching, we can even reuse old translations that merely resemble the new source. (Quiet back there!)”

It’s time to review the claims and reduce them to the realities—the 95% we can deliver rather than the 150% we’d like to. There are four areas where our claims need adjustment:

1. Data Repositories

The notion that we can arbitrarily create documents by assembling sentence and paragraph chunks from a vast warehouse of old documents is a bit overgeneralized.

2. Automatic Re-use

The idea that old translations can be culled automatically from existing memories is linguistically unsound.

3. Fuzzy Matching

The average computer’s idea of a fuzzy match is a joke.

4. Time and Cost Savings

The use of TMs does not guarantee cost or time savings, or even higher quality, any more that the use of FrameMaker guarantees well-designed documents.

This is not to say that data repositories don’t work; Fortune 100 companies are building them, using them, and localizing them successfully. Nor are automatic re-use and resultant cost/time savings out of the question. Even fuzzy matching may actually be useful, if used properly. The key to effective use will always be continuing education and risk management.

Data Repositories—The Post-It Approach to Information Development

A few years ago someone described the data repository to me as providing this service to the technical writer: “You can just go through and pick out whatever sentences you need. If you need to describe how to create a new file, there’s gotta be sentences about that in there somewhere!”

As an apostate technical writer, I wondered if the speaker had ever actually written a manual. Imagine writing a business letter by digging through a pile of sentences you used in previous business letters. “I know I have a greeting in here somewhere…”

Whether the data is indexed on content (“To open a file, go to the…”) or meta-content (“Sentences about creating files”), it is difficult to imagine a search tool that would work faster than just writing the stuff.

A simplistic example, perhaps. Data repositories should, one assumes, index data by document, so that you could call up a “word processor user guide”, for example, and then mine that corner of the warehouse for editorial assistance. However, that usage differs very little from the traditional method of revision, where the old document is opened, reviewed, and updated.

Localization gets into the data warehousing model because a parallel index of translation strings, one per English chunk, is at least theoretically possible. Often there is not a one-to-one match of source and target languages, however. Translation chunks often are only ‘theoretically’ parallel.

For example, if the source document contained the sentence, “Go to the Files menu,” followed by the sentence, “Select Open,” and the Vascokeresan translator combined these sentences to create the ambling sentences Vascokere computer users expect (“A philix pila gehtti, am krakit grap.”), then the data repository would contain a rather odd match for the first English sentence, and none for the second.

Solvable problems, perhaps. But the potential for generating strange translations is high. Unless the “100% matches” are edited manually, the translator might respond to the news that the second sentence was unmatched by translating it. The revised Vascokeresan manual would say, in effect, “Go to the Files Menu, and Select Open. Select Open.”

Automatic Re-Use—Last Year’s Sentences Were Pretty Darn Good!

Imagine that your TM contains the sentence “Press the Return key to activate the macro.” You haven’t needed that sentence for a few years, but here it is at last, a chance to save some writing time by recycling it. Unfortunately, the usability lab decided two years ago that it should be called the “Enter” key, and you have forgotten. But your TM hasn’t. The next sentence you select to cobble the section together, culled from a service manual written a few weeks ago, says, “Caution: Pressing the Enter key twice can cause the application to lock up.”

In English, the example seems a bit far-fetched. But if the translation of 100% matches is generated automatically, with no costly human intervention, what is to prevent it from happening?

Patching sentences from one document into another is not re-use but recycling. The purveyors of translation memory often neglect to consider this important distinction. When a sentence is pulled from a TM into a child of its original context (a revision or new edit of the document it was created for), it is being re-used. Chances are that its terminology will be accurate and it will be appropriate to the context.

But recycling old translations into new documents, such as pulling 100% matches from last year’s service manual to speed up translating this year’s user guide, is an economy offset by the dangers of inappropriateness and inconsistency. A recycled sentence must be reviewed, even if it is a 100% match. And a translation automatically assembled from “Post-It data” of multiple origins needs a full edit, just as it would if it were the result of multiple translators working from scratch.

Fuzzy Matches Light My Fire

English is inherently ambiguous. It’s a word-order language that won’t allow word order to clarify this sentence: “I saw a man in the park with my telescope.” It’s a language with lexical richness driven by our willingness to change nouns to verbs at will (‘game’, ‘input’) as long as we follow undefined rules (I can ‘re-purpose’ something, but I can’t ‘purpose’ it).

The notion that when two English source sentences resemble each other, their translations will resemble each other in the target language is, in a word, silly. Sentences don’t map that way, any more than words do. An absolutely 100% match, accurate down to the last italic, may map to its translation, but even then contextual information can interfere.

Consider the statement, “I can fish.” Coincidentally, the verb for catching fish and the noun ‘fish’ are similar in English (unlike, say, ‘deer’), and—two coincidences in three words—the verb for producing what the British might call “potted fish” looks like an auxiliary verb. But if these coincidences did not transfer precisely into another language, you might find yourself, naïve user of your second language, explaining that while on your annual vacations to Yellowstone, you put fish in cans for recreation.

I recently had a tools vendor remark to me that fuzzy matches below 90% “aren’t useful.” I would have to agree. And if you like word games, it can be great fun generating sentences with 90% matching words but little or no matching meaning.

The problem, quite simply, is that fuzzy matches run against non-linguistic elements. In other words, the same kind of logic that considers “I can can some fish” and “I can fish some here” an 80% match would call ‘catch’ and ‘couch’ an 80% match. The alikeness is linguistically irrelevant.

What is being matched in translation tools is the typographical tokens and, in sophisticated systems, the word order. A system which recognizes linguistically similarities beyond word stems (‘sing’ is the stem of ‘sung’) might be more useful. But if it could tell us that “I am canning some fish” and “I was canning some fish” are 80% similar, even then, in a language considerably different from English, the grammatical and lexical connections might not facilitate recycling the target sentence.

To use a simple example, assume that “Now click the green tab to see your configuration options” is changed to “Now click the green button to see your configuration options.” With one word in ten changed, the fuzzy match is 90%. But if ‘tab’ is masculine and ‘button’ is feminine, chances are the word ‘green’ will also need changing. And what if the word used to translate ‘green’ doesn’t happen to work well with ‘button’? (Imagine “If the light turns scarlet, stop your car immediately.” Stop lights are red, not scarlet.) Then the entire word selected to translate ‘green’ must be replaced. And, if the English writer referred to the tab anywhere as ‘it’ in English, then every use of the masculine pronoun appropriate to the translation of ‘tab’ must be changed to match the regendered ‘button.’ Write one, translate many, indeed.

If every other sentence on the page had a 100% match, and the page were not edited for these contextual problems, the translation would sound as stupid to a French reader as “Stop your car and close his windows.”

The ‘fuzz’ that matters in translation is the similarity of the target sentences (which can’t be determined until after the sentence is translated), not the source. Because the value of a fuzzy match is driven by the target data, pricing a project by checking source fuzzy matches is, at best, a marketing gimmick. There is no predictable correlation between the percentage of non-linguistic matches and the amount of work required to localize a sentence, pararaph, or document. At worst, this sort of pricing will jeopardize costs, schedules, quality, mental health, client relationships, and perhaps even world peace.

Time and Cost Savings—No Joke

The time and cost economies that TMs “guarantee” are as dependent on coincidences and conjunctions as an astrological chart or a weather report. Leveraging a revision from a poor translation can be more trouble than it’s worth. Building a TM with alignment tools can take as long as starting over, and it can introduce disastrous problems like offsets or false matches. Extensive evolution of the glossary can make aging TMs irrelevant. Multisource TMs will contain inconsistent terminologies.

To appreciate the value of a translation memory, clients must keep in mind the speed/cost/quality balance. If quality is not the top priority, a TM can guarantee cost and time savings. If speed is not a top priority, then a TM can improve quality; and so forth. And under ideal circumstances, all three goals can be achieved.

What Is to Be Done?

In other words, this is not to say that TMs are all smoke and mirrors, nor that they are a useless marketing gimmick. Used properly, in the proper environment, they ensure higher quality, reduce costs, and accelerate translation. Here’s how:

1. Data Repositories

Take into account the complex, living, organic quality of language while building the system. Build it around the writers’ best practices, mirroring and facilitating them, and give the translators’ best practices (which are not the same) equal attention.

As with any database, the record template and the user interface, not the data, will determine the ultimate value. Finally, recognize that the repository is a translation candidate, requiring updates as part of maintenance, to remove archival inconsistencies in, for example, terminology.

2. Automatic Re-Use

Recognize the important distinction between re-using and recycling. A TM is at its best when being used to optimize the translation of a revised document. Re-using those translations and integrating the newly translated sentences, whether the process occurs during the document’s English development or later, during a product roll, may indeed allow the translator to accept, unedited, 100% matches.

The implications for data repositories? Each data record should also track the provenance of the data: where it came from, where it is used. If documents are stored with provenance, then TMs can be extracted that contain only the previous version.

Re-use to accelerate revision, however, is not the same thing as recycling translations into other documents. Recycling must include human review to ensure consistency and even appropriateness.

3. Fuzzy Matches

For now, forget ’em. From the vendors’ perspective, they help estimate the complexity of a job. For instance, when the fuzzy match is ‘fuzzed’ by formatting changes, then the translator can work on such sentences a bit faster, so there will be some economies. But anything short of a 100% match will have an undetermined negative effect on translation, governed by the nature of the target language.

Fuzzy logic is, in that peculiar sense the computer world uses the adjective, ‘sexy.’ But without knowledgebased translation systems capable of recognizing grammatical and lexical ‘fuzz,’ it is not as useful here as in other, differently challenged applications.

4. Cost and Time Savings

Clients expect them, and vendors can provide them. But promises of savings should be qualified by accurate risk assessments. A localization project that begins during the development phase will almost always benefit from the use of translation memories. The benefit will translate into better quality, faster response to revisions, and lower costs driven by reductions of work.

However, the savings will also be affected by the English developers’ disciplines and methods. A document that has as its last development phase a complete revision of ‘style’ might well require re-translation from scratch, simply because changes that are trivial and focussed in English have wider implications in the grammar and syntax of the target language.

Thus yes, certainly, the translation memory is a valuable tool, however much its merits have been exaggerated and oversold. The advent of the TMX standard means that clients will be able to buy non-proprietary TMs for their own re-use. And intelligent re-use of document-specific TMs will allow the information development cycle to avoid becoming bogged down, as it does now, at the localization phase.

The curmudgeons who refused to believe the Wright brothers flew, because they could prove, scientifically, that it was impossible, were wrong. And localization suppliers who refuse to take advantage of new tools will fall behind—in price, in time to market, and in quality. However, anyone who sells TMs as if they were the Ginzu knives of our industry will eventually have project disasters or unhappy clients. Or both.


Mick McAllister
Operations Manager, Boulder
Sykes Enterprises, Inc.
5757 Central Ave., Suite G
Boulder, CO 80301
Tel +1 303 440 0909
Fax +1 303 440 6369
Mick.McAllister@corp.sykes.com




Contents


LISA Business Data

LISA Publications Catalog

Industry Insights Reports

Best Practice Guides

Surveys

QA Model

Forum Summaries and Presentations

LISA Globalization Consulting Network

Webinars and TouchPoint Advisory Calls


Join LISA

Subscribe


Upcoming Events

LISA Forum USA
(Foster City, California, April 13–16, 2010)

LISA@Chinasoft Fair
(Chengdu, China)

LISA Forum Asia
(Suzhou, June 28–July 1, 2010)

LISA Forum Europe
(Budapest, October, 2010)

LISA Forum India
(New Delhi, December, 2010)


Open StandardsTBXTMX

Terminology SIG

Job and CV Postings