|
In this issue…
Towards Second-Generation Translation Memories
Targeting the Second-Generation Translation Memories
This study proposes that "first-generation" TM tools can be-and must be-significantly improved in order to meet real-life requirements. According to the author, the road ahead lies in the ability of such tools to "shallow translate" close sentences. NOTE: Japanese characters in this document may not be displayed properly by Netscape Navigator 4.x IntroductionCompanies have been using translation memories (TM) for some years now. These revolutionary tools allow them to speed up production and keep the homogeneity of today's quality assurance demands. When used for an update, they can typically save from 15% to 90% of the literal translation, but what about the layout? And what about the remaining 10% to 85%? When using memory built on "close products" from the same company, an average of only 5% is reusable. Why? When hyperlinks or revision marks have to be translated, DTP has to insert them again. What about doing it automatically? In this study, we propose to show that first-generation TM tools can be improved to solve such issues. Upgrading to second-generation TM suggests a complete remodeling of the product's core architecture. Yet, if accomplished, it should open new horizons towards third-generation tools able to "shallow translate" close sentences. I would like to make clear at the beginning that this article does not, in any way, deny the merits of first-generation TM. I was the first to recommend the use of such tools when I created the former I&G Com, now Bowne France R&D center. And all editors and translators are aware of the quality and the competitiveness they can produce. Two small testsI will now present two small tests which will help us understand what can be improved upon. Test one: using linguistic knowledge This test was designed for a Xerox-internal, state-of-the-art study of TMs available on the market. A French sentence of fourteen words was chosen and imported into the memory, constituting with its translation the only translation unit (TU) in the memory. Twelve sentences were then derived from the first one, changing only the form of successive words, in the following manner:
This produced ungrammatical sentences - but ones which were suitable for testing the ability of TM to find the similarity between sentences containing the same words in a different form. The test was conducted with Star TransitT 2.7, Trados WorkbenchT 2.1, and Xerox XMS Memory ManagerT 1.0. The curves on the following graph represent the number of similar characters, words, basic forms and parts of speech, as well as the percentage of similarity between each successive sentence compared to the original sentence for the three tools.
Correlation between Xerox (Xtr), Trados (TTW), Star (STR) tools, and the percentage of words, basic forms, part of speech and characters changed. The results clearly show that while the Xerox tool has a strong correlation with basic forms ("lem") and parts of speech (first top three overlapping curves), the Trados (second curve from the bottom) and Star (third curve from the bottom) tools follow the number of characters (fourth curve from the bottom) that have been changed. What are the implications? Consider the following sentence: "The white horse is nice". Linguistically based tools like Xerox give sentence number 1 as the closest to sentence number 0 (like humans would do), while non-linguistic based tools give sentence number 2, where only one letter has been changed.
Percentage of similarity of sentences 1, 2, and 3 comparing Trados (T) and Xerox (X) tools. This shows the crucial importance of using linguistic data for enabling more precise retrieval of the closest sentence in the database. It is also important for getting the maximum number of close sentences. Test two: layout and non-literal object managementThis test was conducted on slightly older versions of the tools: IBM TM 2.0.7 T (TM), Star TransitT 2.6 (TRA), and Trados WorkbenchT 1.14 (WB). The aim was to test their ability to transfer layout attributes attached to characters or words, and non-literal objects. The following table lists the tested features. A "1" means that this object is transferred from the source sentence to its translation, "0" means that it is not transferred, and "n/a" means that it could not be tested because the sentence in which it should appear was not translated.
Transferring layout and non literal objects with Workbench, TM, and Transit. Thus, none of the first-generation TM tools were able to transfer these features apart from paragraph attributes. As a result, this transfer has to be done by human translators or DTP people. This can add an extra 50-150% of the translation time to the job: a figure that should not be under-rated!! Consequently, a tool that could do this job would reduce overall localization process time by a considerable margin. Guidelines for improving TMLet us now look at some ways to solve the problems shown in the previous tests. Introducing a light linguistic analysisIf we want to be able to match similar sentences, such as those differing only in the form of their words, we need linguistic analysis. We do not need a deep analysis that would take a long time to process, but just a "stemming" and a "tagging" one that would give a light yet crucial analysis. Parsers performing this operation are legion, and are now quick and reliable. In fact the Xerox tool, for example, uses such a parser. (This is one reason why it gave such good results in the first test). Similarly, in order to be able to transfer the layout from the words of the source sentence to the target text, we need a bilingual dictionary, and the ability to use user-defined glossaries that Trados' Multiterm or Star's TermStar tools, for example, could manage. Eliminating the heterogeneity of the internal data representationFirst-generation TM tools tend to have a heterogeneous internal data representation: the literal data, the layout, and non-literal objects are represented in a linear flow of data, often based on SGML, like the following:
For the first step towards standardization of internal representation, which allows the tool to abstract itself from commercial file formats, we recommend the use of the XML standard that potentially represents all features of multimedia documents. Nevertheless, we maintain that this kind of representation does not allow correct manipulation of the data. Therefore, a second step is needed: separating the different dimensions of data into uniform linked "floors" of data:
This structure has been called TELA in the Ph.D. work. In this way it is possible to work on the words' floor only, for example, and directly apply an analyzer that would give the basic forms of each word:
Similarly, because we have a link between basic forms and layout attributes, and because we have a bilingual dictionary, it becomes possible to transfer the layout onto the translated sentence. This is illustrated by the following schemata, in which w'1 represents the translation of word w1:
Note that in the target language the word order can be changed, as is the case here for w2 and w3. Because we have a suitable representation, this causes no problem for the layout transfer. It would be dishonest to claim that the process is simple. Care has to be taken, for example, when one source word, like the English for "potato", corresponds to several target words like the French "pomme de terre". Yet this structure really does allow the following key benefits:
Shallow translation: Where TM becomes a real translation toolThe separation of the data into floors and the use of linguistic data allows TM to retrieve the best translation unit from the memory, and to manage the layout correctly. This should lead to second-generation tools. Yet by doing so, we have created an environment which is able to adapt close sentences to the input sentence: we are able to "shallow translate". A first prototype for Japanese, English and FrenchTo illustrate the above, I will now present the results of a first research prototype for Japanese, English and French. The prototype was first built for English and French; however, the overall design of the core shallow translation system is not language-dependent. Once the core system has been built with linguistic modules for the languages to be processed (analyzers, dictionaries), the only thing we have to do in order to add a new language is to add the appropriate linguistic modules for that language. Thus, I added the NTT Japanese parser (an English-Japanese dictionary) and without changing a line in the core system, it translated as it did from English to French. Of course, the system should be able to support several languages at a low level: we manage this by building our system using the Java programming language. This gives a first solution for coping with the different languages managed by Unicode, as far as Unicode can go (some Japanese kanjis can not be represented correctly with Unicode). The following gives a number of input English sentences (I), the corresponding closest TU retrieved from the database (a TU has a source part sTU and a target part tTU), and the French translation solution (S) proposed by the prototype. Some of these sentences are artificially composed, some come from real software help files. The parentheses indicate that only the basic forms are given in the solution, because the generation module has not yet been implemented in our prototype.
The last example implies that a preference translation glossary has been entered into the system, to apply the company conventions. According to these conventions, "Open It" should be translated as "Ouvrir", for example. Note that the concurrent use of a standard dictionary and such a glossary is only possible because of the TELA architecture. Below are a number of English sentences and their Japanese translation. The sentences have been borrowed from the Nikkei Journal, a financial journal that offers both Japanese and English versions of short business news pieces:
In the first example, "8月" (Aug.) and "300" (300) are changed to "9月" (Sept.) and "100" (100). In the second example KDD replaces DDI. Conclusion and perspectivesThis paper claims that first-generation TM needs a completely new technical design of its core system. We have shown that the implementation of these second-generation TM tools would not only allow a drastic improvement in first-generation TM features, but also offer new possibilities, like systematic layout attributes and non-literal object transfer. This should result in a reduction in the translator's DTP process time - something that has been progressively considered part of the translator's job, even though he or she should really be dealing with languages, not hyperlinks and other revision marks!! This reduction should also lead to comprehensive gains for editing and translation companies. Real fully automatic machine translation systems have not really penetrated the translation business. Well-known success stories "can be counted on the fingers of one hand", as the saying goes. Shallow translation could be the introduction of machine translation the translation sector has been waiting for: it is reliable because it is based on human translators' output, and because it is based on memory exactly suited to the company's domain. Despite this, it does not require great (or homogeneous) effort to build the system. Will TM editors and software editors accept the challenge to invest in such revolutionary changes which the TM design demands? Will the TM editors, whose philosophy is never to modify the translation units, be convinced by the robustness of shallow translation? There is no doubt that their opinion will only change with a comprehensive prototype that will show them the validity of this approach. These critical issues should be discussed in the next LISA Shanghai conference, and we invite all those interested to come and argue their positions. Acknowledgment:The author would like to thank Xerox for supporting part of this research (test 1). Dr. Emmanuel Planas
|
![]() 8-12 December 2008 |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||