LISA Home page [© 2008 • ISSN 1420-3693 • www.localization.org]
© 2008 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…


Towards Second-Generation Translation Memories
Targeting the Second-Generation Translation Memories

Dr. Emmanuel Planas, Nippon Telegraph and Telephone

This study proposes that "first-generation" TM tools can be-and must be-significantly improved in order to meet real-life requirements. According to the author, the road ahead lies in the ability of such tools to "shallow translate" close sentences.


NOTE: Japanese characters in this document may not be displayed properly by Netscape Navigator 4.x


Introduction

Companies have been using translation memories (TM) for some years now. These revolutionary tools allow them to speed up production and keep the homogeneity of today's quality assurance demands. When used for an update, they can typically save from 15% to 90% of the literal translation, but what about the layout? And what about the remaining 10% to 85%? When using memory built on "close products" from the same company, an average of only 5% is reusable. Why? When hyperlinks or revision marks have to be translated, DTP has to insert them again. What about doing it automatically?

In this study, we propose to show that first-generation TM tools can be improved to solve such issues. Upgrading to second-generation TM suggests a complete remodeling of the product's core architecture. Yet, if accomplished, it should open new horizons towards third-generation tools able to "shallow translate" close sentences.

I would like to make clear at the beginning that this article does not, in any way, deny the merits of first-generation TM. I was the first to recommend the use of such tools when I created the former I&G Com, now Bowne France R&D center. And all editors and translators are aware of the quality and the competitiveness they can produce.

Two small tests

I will now present two small tests which will help us understand what can be improved upon. Test one: using linguistic knowledge This test was designed for a Xerox-internal, state-of-the-art study of TMs available on the market. A French sentence of fourteen words was chosen and imported into the memory, constituting with its translation the only translation unit (TU) in the memory. Twelve sentences were then derived from the first one, changing only the form of successive words, in the following manner:

  1. "Les données concernant la puissance:"
  2. "La données concernant la puissance:"
  3. "La donnée concernant la puissance:"

This produced ungrammatical sentences - but ones which were suitable for testing the ability of TM to find the similarity between sentences containing the same words in a different form. The test was conducted with Star TransitT 2.7, Trados WorkbenchT 2.1, and Xerox XMS Memory ManagerT 1.0. The curves on the following graph represent the number of similar characters, words, basic forms and parts of speech, as well as the percentage of similarity between each successive sentence compared to the original sentence for the three tools.

Correlation between Xerox (Xtr), Trados (TTW), Star (STR) tools, and the percentage of words, basic forms, part of speech and characters changed.

The results clearly show that while the Xerox tool has a strong correlation with basic forms ("lem") and parts of speech (first top three overlapping curves), the Trados (second curve from the bottom) and Star (third curve from the bottom) tools follow the number of characters (fourth curve from the bottom) that have been changed. What are the implications? Consider the following sentence: "The white horse is nice". Linguistically based tools like Xerox give sentence number 1 as the closest to sentence number 0 (like humans would do), while non-linguistic based tools give sentence number 2, where only one letter has been changed.

 Test sentences TX
0 The white horse is nice 100 100
1 The white horses are nice 78 98
2 The white house is nice 88 79
3 The white houses are nice 75 79

Percentage of similarity of sentences 1, 2, and 3 comparing Trados (T) and Xerox (X) tools.

This shows the crucial importance of using linguistic data for enabling more precise retrieval of the closest sentence in the database. It is also important for getting the maximum number of close sentences.

Test two: layout and non-literal object management

This test was conducted on slightly older versions of the tools: IBM TM 2.0.7 T (TM), Star TransitT 2.6 (TRA), and Trados WorkbenchT 1.14 (WB). The aim was to test their ability to transfer layout attributes attached to characters or words, and non-literal objects.

The following table lists the tested features. A "1" means that this object is transferred from the source sentence to its translation, "0" means that it is not transferred, and "n/a" means that it could not be tested because the sentence in which it should appear was not translated.

FeaturesWBWBTRA
Paragraph style 11 1
Bolded word0 0 0
Italics on words 0 0 0
Italics and bold0 0 0
Paragraph italics 1 n/a n/a
Index mark0 0 0
Footnote mark0 n/a n/a
Total2/7 1/7 1/7

Transferring layout and non literal objects with Workbench, TM, and Transit.

Thus, none of the first-generation TM tools were able to transfer these features apart from paragraph attributes. As a result, this transfer has to be done by human translators or DTP people. This can add an extra 50-150% of the translation time to the job: a figure that should not be under-rated!! Consequently, a tool that could do this job would reduce overall localization process time by a considerable margin.

Guidelines for improving TM

Let us now look at some ways to solve the problems shown in the previous tests.

Introducing a light linguistic analysis

If we want to be able to match similar sentences, such as those differing only in the form of their words, we need linguistic analysis. We do not need a deep analysis that would take a long time to process, but just a "stemming" and a "tagging" one that would give a light yet crucial analysis. Parsers performing this operation are legion, and are now quick and reliable. In fact the Xerox tool, for example, uses such a parser. (This is one reason why it gave such good results in the first test). Similarly, in order to be able to transfer the layout from the words of the source sentence to the target text, we need a bilingual dictionary, and the ability to use user-defined glossaries that Trados' Multiterm or Star's TermStar tools, for example, could manage.

Eliminating the heterogeneity of the internal data representation

First-generation TM tools tend to have a heterogeneous internal data representation: the literal data, the layout, and non-literal objects are represented in a linear flow of data, often based on SGML, like the following:


w1 w2 w3 w4 w5.

For the first step towards standardization of internal representation, which allows the tool to abstract itself from commercial file formats, we recommend the use of the XML standard that potentially represents all features of multimedia documents.

Nevertheless, we maintain that this kind of representation does not allow correct manipulation of the data. Therefore, a second step is needed: separating the different dimensions of data into uniform linked "floors" of data:


w1w2w3 w4w5.
     

This structure has been called TELA in the Ph.D. work. In this way it is possible to work on the words' floor only, for example, and directly apply an analyzer that would give the basic forms of each word:


w1w2 w3  w4w5.
bf1bf2bf3   bf4bf5
  

Similarly, because we have a link between basic forms and layout attributes, and because we have a bilingual dictionary, it becomes possible to transfer the layout onto the translated sentence. This is illustrated by the following schemata, in which w'1 represents the translation of word w1:


w'1 w'3w'2 w'4w'5.
bf'1bf'3 bf'2bf'4bf'5
 
dict(w1,w'1),     dict(w3,w'3)

Note that in the target language the word order can be changed, as is the case here for w2 and w3. Because we have a suitable representation, this causes no problem for the layout transfer.

It would be dishonest to claim that the process is simple. Care has to be taken, for example, when one source word, like the English for "potato", corresponds to several target words like the French "pomme de terre". Yet this structure really does allow the following key benefits:

  • Format transfer
  • Non-literal objects management
  • Linguistic processing

Shallow translation: Where TM becomes a real translation tool

The separation of the data into floors and the use of linguistic data allows TM to retrieve the best translation unit from the memory, and to manage the layout correctly. This should lead to second-generation tools. Yet by doing so, we have created an environment which is able to adapt close sentences to the input sentence: we are able to "shallow translate".

A first prototype for Japanese, English and French

To illustrate the above, I will now present the results of a first research prototype for Japanese, English and French. The prototype was first built for English and French; however, the overall design of the core shallow translation system is not language-dependent. Once the core system has been built with linguistic modules for the languages to be processed (analyzers, dictionaries), the only thing we have to do in order to add a new language is to add the appropriate linguistic modules for that language. Thus, I added the NTT Japanese parser (an English-Japanese dictionary) and without changing a line in the core system, it translated as it did from English to French. Of course, the system should be able to support several languages at a low level: we manage this by building our system using the Java programming language. This gives a first solution for coping with the different languages managed by Unicode, as far as Unicode can go (some Japanese kanjis can not be represented correctly with Unicode).

The following gives a number of input English sentences (I), the corresponding closest TU retrieved from the database (a TU has a source part sTU and a target part tTU), and the French translation solution (S) proposed by the prototype. Some of these sentences are artificially composed, some come from real software help files. The parentheses indicate that only the basic forms are given in the solution, because the generation module has not yet been implemented in our prototype.


I:Press on the red button.
sTU:Press on the green window
tTU:Appuyez sur la fenétre verte
S:(Appuyer) (sur) (le) (bouton) (rouge)
I:Type IP address 129.23.563.89 to access the ftp site.
sTU: Type IP address 102.53.236.25 to access the http site.
tTU: Entrez l'adresse IP 102.53.236.25 pour accéder au site http.
S:(Entrer) (le) (adresse) (IP) (129.23.563.89) (pour) (accéder) (a) (le) (site) (ftp).
I:Select "Open It" and click "OK"
sTU:Select "Now" and click "Done"
tTU:Choisissez "Maintenant", et Cliquez sur "Fini".
S: Choisissez "Ouvrir", et Cliquez sur "OK".

The last example implies that a preference translation glossary has been entered into the system, to apply the company conventions. According to these conventions, "Open It" should be translated as "Ouvrir", for example. Note that the concurrent use of a standard dictionary and such a glossary is only possible because of the TELA architecture.

Below are a number of English sentences and their Japanese translation. The sentences have been borrowed from the Nikkei Journal, a financial journal that offers both Japanese and English versions of short business news pieces:


I:Sept. Nikkei 100 options did not change hands.
sTU:Aug. Nikkei 300 options did not change hands.
tTU:日経300の8月物は商いが成立しなかった。
S:日経100の9月物は商いが成立しなかった。
I:KDD and Japan Telecom fell back
sTU: DDI and Japan Telecom fell back
tTU:日経DDI、テレコムが反落し、NTTデータも小幅安。
S:日経KDD、テレコムが反落し、NTTデータも小幅安。

In the first example, "8月" (Aug.) and "300" (300) are changed to "9月" (Sept.) and "100" (100). In the second example KDD replaces DDI.

Conclusion and perspectives

This paper claims that first-generation TM needs a completely new technical design of its core system. We have shown that the implementation of these second-generation TM tools would not only allow a drastic improvement in first-generation TM features, but also offer new possibilities, like systematic layout attributes and non-literal object transfer. This should result in a reduction in the translator's DTP process time - something that has been progressively considered part of the translator's job, even though he or she should really be dealing with languages, not hyperlinks and other revision marks!! This reduction should also lead to comprehensive gains for editing and translation companies.

Real fully automatic machine translation systems have not really penetrated the translation business. Well-known success stories "can be counted on the fingers of one hand", as the saying goes.

Shallow translation could be the introduction of machine translation the translation sector has been waiting for: it is reliable because it is based on human translators' output, and because it is based on memory exactly suited to the company's domain. Despite this, it does not require great (or homogeneous) effort to build the system.

Will TM editors and software editors accept the challenge to invest in such revolutionary changes which the TM design demands? Will the TM editors, whose philosophy is never to modify the translation units, be convinced by the robustness of shallow translation? There is no doubt that their opinion will only change with a comprehensive prototype that will show them the validity of this approach. These critical issues should be discussed in the next LISA Shanghai conference, and we invite all those interested to come and argue their positions.

Acknowledgment:

The author would like to thank Xerox for supporting part of this research (test 1).


Dr. Emmanuel Planas
Multilingual Machine Translation Group
Intelligent Media Project
NTT Cyber Solutions Laboratory
2-4 Hikaridai, Seika-cho, Soraku-gun,
Kyoto 619-0237, Japan
Tel: +81 (0)774-93-5327
Fax: +81 (0)774-93-5345
E-mail: planas@cslab.kecl.ntt.co.jp




LISA 2008 events

Advertise with LISA


Adaquest

ADAPT Localization

Languages Media

LISA Forum Europe

8-12 December 2008
Registration Open


LISA Surveys

EventsNews

Joining LISA

Best Practice Guides

LISA Wireless Primer


OSCARTBXTMX

Terminology SIG

Job and CV Postings