LISA Home page [© 2008 • ISSN 1420-3693 • www.localization.org]
© 2008 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…

  • Screams in the Quality Jungle (premium) • Robin Bonthrone, Fry & Bonthrone Partnerschaft, Language Consultancy and Services
  • Why Projects Fail (premium) • Fergus O’Connell, Director, ETP Management Training
  • Present and Future Needs in the CAT World • Matthias Heyn, Trados GmbH
  • EUTERPE on an Intranet (premium) • Cornelis van der Laan & Matthias Heyn, TRADOS Benelux S.A., Belgium
  • Getting our Act Together? (premium) • Deborah Fry, Fry & Bonthrone Partnerschaft, Language Consultancy and Services

Present and Future Needs in the CAT World

Matthias Heyn, Trados GmbH

1. Introduction

Whereas in the past, automation of the professional translation process was mostly connected to the use of machine translation (MT), this has significantly changed in the last few years. Today, the keywords for professional translators are computer aided translation tools (CAT Tools) and, notably a key-component: the translation memory.[1] Modern CAT Tools, in most cases an integration of several functionalities into one “workbench”, are gaining more and more ground as a standard tool in the hand of professional translators. Except for literary translations or generally idiosyncratic text types, the use of CAT Tools has been extended to almost every type of translation work. This includes political, administrative, technical, advertising, biographical, and other text types.


Whereas the general idea of a translation memory is fairly simple, the practical realization of a functioning product is a rather complex task. This has mainly to do with the subtasks that such a system has to perform. The problem area “translation memory” covers many aspects within information science and linguistics, such as database design, retrieval technology, mapping of complex data (text) structures, client-server architecture, networking, support of language dependent phenomena (character sets, tokenization, morphology, syntax), software ergonomy etc. We are faced with a very interesting type of application, which appears to users as a rather simple interface but which has underneath a very complex internal functioning.

Up to now, the aspect of the different user needs in this technology has not gained enough attention. We can distinguish between a kernel set of functionalities of a translation memory and the functional extensions due to specific user needs. In combination with an overall broadening of the application area of translation memories, the functional extensions of such a system are getting more and more dispersed. Therefore, we will first identify different user profiles and user needs (section 2) and then discuss on basis of this background the technical aspects of the component parts of CAT Tools in section 3. A few observations concerning side-effects in using translation memories are given in section 4, and a final summary on developments to be expected in the future, in section 5.

Furthermore,we can observe in this area a rather confusing use of different notions that are often leading to misunderstandings. Therefore, we will try to introduce a few helpful notions and define more precisely the unclear use of terms.

2. Tools at the translator’s desktop

2.1 What is a translation memory?

The general idea of a translation memory is very simple: All translations made by a translator are stored in a database and are then in case of re-translations immediately retrievable. This process can be subdivided into several phases:

  1. Selection phase: A translator decides to translate a certain part of a text. This is generally something that a translator treats as a natural unit of a text. It can be either a sentence or a cell in a table, a footnote, etc. This text in the source document will be called a translation unit or, more precisely, a source translation unit, since a translation unit is divided into a source part and the translated counterpart: the target translation unit.
  2. Retrieval phase: The translation memory tries to retrieve the source translation unit. If it succeeds, the retrieved target translation unit will be made available for phase 3.
  3. Translation phase: The “classical” translation takes place with the following exceptions: if a 100% identical target translation unit can be found (100% match), no translation work is required. If a similar translation unit has been identified (fuzzy-match), the translation task consists of the adaptation with the required changes in the target segment.
  4. Update phase: The translation unit is stored in the translation memory.

2.2 What are the benefits of using CAT Tools?

There are certain aspects in applying a CAT Tool to translation projects. Usually, tool vendors have three arguments at hand: 1. Large quantities of texts can be translated faster (Quantity-Argument); 2. Quality of translation is increased (Quality-Argument); 3. Subsequent and next similar translation projects can benefit from earlier work (Re-usability-Argument). These statements are certainly true, but they have rather the character of general statements. Therefore, we will describe more in detail a few factors that are playing a major role in the application of this technology. These factors can later be used to distinguish between the different needs of different user profiles.

2.2.1 Repetition factor

From the explanation of the functioning of a translation memory system, it is clear that translation memories find their main application in the translation of repetitive text material. It is important to distinguish between internal repetitions in a document itself and external repetitions within typical update translations where the repetitions are inherent to a family of documents. We will address this phenomenon later as the repetition factor.

2.2.2 Consistency factor

Another effect of the use of translation memories is that by nature, they enforce a higher consistency in translations. Especially in technical documentation it is very important to have a certain consistency and uniformity when addressing topics. It helps the reader to understand complicated information. In short, consistent wording is a precondition in technical writing and translation memories are automatically supporting this. The integration of a translation memory and a terminology database system (term bank system) can significantly improve this effect. This will be discussed in more depth when we are talking about the term bank part of a CAT Tool. We will call this phenomenon the consistency factor.

2.2.3 Reference factor

Very closely related to the consistency factor is the feature that every translation unit can be classified according to several information types like: creation user, creation time, update user, update time, subject codes, notes, etc. Therefore, the retrieval of a translation unit can be accompanied by a “quality” mark. This leads to quality improvement in the translation by re-using revised and approved wordings. It is like using a translation of an authorized reference, in as such leading to a standardization of translations. This factor will be called the reference factor.

2.2.4 Concordance factor

As can be seen from phase 4, a translation memory is subsequently filled by translation units, embedding source and target language information in the form of sentences and subsentence units. From a linguistic point of view, a translation memory can therefore be described as a bilingual parallel corpus. In case of systems allowing more than one source or target language, this leads to the construction of multilingual parallel corpora. Very useful is the access of this corpus in order to retrieve a translation unit, by searching for one or several keywords. This function is commonly referred to as concordancing, more precisely bilingual concordancing. Translation memories can be seen as a rich source of implicit terminology, compared to the storage of explicit terminology in a term bank system. In this sense, translation memories are competitive to term banks. And as we will see later, very successful ones. We will address this topic as the concordance factor.

2.2.5 Terminology factor

The second important component part of a CAT Tool is a term bank.[2] Terminology management is a complex task that shares common properties with classical lexicography, but that has its own parameters. Proper terminology management is a costly task, but cannot be abandoned if a certain quality in translation has to be reached and maintained.

A key role in CAT Tools is played by the automatic searching for terminology in the source translation unit. This is called terminology recognition. This notion should not be mingled with terminology extraction, which means the automatic extraction of terminology from text material.[3] Terminology recognition replaces the manual search in databases. The system automatically attracts the translator’s attention to existing terminology to be used. Therefore, two types of benefits from the CAT Tool can be expected: 1. the possibility of keeping track of specialized language terms; 2. the manual or automatic retrieval of terminology. This factor will be called the terminology factor.

2.2.6 Resource creation factor

There are 3 possibilities of automatic creation of resources by CAT Tools:

  1. The creation of a translation memory out of existing parallel texts. This is done by a process called sentence alignment.[4]
  2. The creation of a list of term candidates of one language to be introduced into the term bank system. This is done by a process called monolingual terminology extraction.
  3. The creation of a list of term pair candidates from source and target language to be introduced into a term bank system. This is done by a process called word alignment or bilingual terminology extraction.[5] Translation memories are a precondition for bilingual terminology extraction.
We will later refer to this factor as the resource creation factor.

2.3 Platforms used by translators

As with other professions, we can observe a certain predominant platform used by translators.[6] This is in most cases a standard PC. The standard application of translators is the standard word processor found on that platform. Exceptions to this majority can be found within translation departments of institutions or industries where other platforms are chosen for strategic reasons. In the field of complex advertisement material Macintosh-based platforms are predominant. Generally, for the next future we predict that most developments will concentrate on PC-based platforms. Due to the budget profile of translators, programs must run within the typical restricted PC environment that can at present be achieved.

2.4 Text formats on translators’ desks

Translators are faced with an overwhelming variety of different text formats. The predominant formats vary from country to country, but at present, there are a few global tendencies visible. The format is also influenced by the type of document. It is not possible to list all the formats, nor is it possible to give a statistical overview. But in what follows, we try to show a few tendencies:

The most widespread format is Microsoft Word (Version 2/6/7)[7], used for several kinds of text types up to technical documentation. In the field of technical documentation, FrameMaker is gaining more and more importance. Here, FrameMaker is followed by Interleaf. In some countries, WordPerfect (Version 5.2/6.1) plays the same role as Microsoft Word generally plays. It is remarkable that recently WordPerfect is losing ground. A certain tendency towards WordPro from Lotus is visible in the same countries. For advertisement material and generally in documents with a strong graphic arts orientation Quark Xpress and PageMaker are often used.[8] Other formats are playing a subordinate role. With tagged text formats SGML/HTML plays the biggest role.

2.5 Conditions for the success of translation memories

The recent success of translation memories can be explained by several factors:

  • Translators have more and more access to the machine readable version of documents. Text flow is more and more computerized;
  • The processing power of modern computers enables now functionalities that were not available in the past and that are crucial for the successful implementation of such a tool; Generally speaking, we may state that times plays for this technology.
  • New technologies have been introduced for the handling of error-tolerant retrieval;
  • CAT Tools have reached the overall industry software quality standards concerning usability and software ergonomics.
  • The integration into the software environment of translators has significantly improved;
  • CAT Tools are aiming at supporting translators and not at replacing them. They avoid generative capacities and can be described by “all output is human input”. They are freeing translators from boring work and let them concentrate on what they can do best over machines, i.e.: handling semantics and pragmatics. Generally, this leads to a broader acceptance by translators;
  • The knowledge of translators about the benefits of computerizing their work is steadily growing.

2.6 Profile of CAT Tool users

As it has been mentioned before, the general market for the application of CAT Tools is broadening. In what follows, we will give a rough classification of typical user profiles and user needs in CAT Tools. This is not an exhaustive list and there are always exceptions to it.

2.6.1 Multilingual societies

The use of CAT Tools is not only determined by different user profiles, but also differs significantly from country to country. Especially in countries with more than one national language, an overall sensitivity towards translation leads faster to automation in the translation process. Another factor are the translation costs. In countries with high translation costs, a faster introduction of rationalization software can be observed.[9]

2.6.2 Localization industry

CAT Tools have predominantly been used in the past by a specific group of translators. This group can be described as “translators translating manuals of software products” and they have therefore been close to software in general. This group is commonly addressed to as the localization industry. They are the initial user group and they have played a major role in influencing the development of CAT Tools. For the localization industry the major importance lies in the repetition and consistency factors.

2.6.3 Industrial users

Clear enough, not only software manuals, but also every type of technical documentation is a good candidate for a translation memory system. This applies to all kinds of industries: from general engineering to pharmaceutics, food production, electronics, consumer products, automotive industry, aeronautics etc. The size of the companies is not as important as the international marketing strategies of their products. Another factor are the product liability laws enforcing proper localized documentation for targeted markets. As within the localization industry the repetition and uniformity factors are playing the biggest role in applying translation memory technology.

2.6.4 Banking, Insurance and documents dealing with legal matters

Banks and insurance companies have several reasons to consider the application of translation memories in order to solve their translation problems:

  1. Many document types like balances, reports, documents dealing with shares etc. are highly repetitive;
  2. They are faced with a complicated international terminology and phrasal use of notions;
  3. We can observe a steadily growing standardization of this market;

Very similar are applications in the field of translating legal matters. E.g. very successful is the application in this field by patent attorneys. Here are some overlappings with the user type “institution”.

In contrast to the former fields, we observe more stress on the terminology and reference/consistency factor than on the repetition factor.

2.6.5 Military

Communication within the military field is driven by the aim to be as exact as possible in descriptions if international actions that have to be taken. Besides, military texts have a lot in common with institutional texts as well. Strong emphasis lies here in document protection and preserving the degree of confidentiality. This influences the way how translation units are allowed to be stored. Most important factors are the terminology factor, the reference factor and the repetition factor.

2.6.6 Media / Multimedia Sector

The multimedia sector is influenced by a tendency towards globalization and centralization. More and more translation tasks arise here. Compared to other types of users, the the degree of overall text repetition does not play a major role. Stress lies here more on the terminology factor and the concordancing factor. In the MultiMedia World the orientation towards the Internet produces in the next future more and more translation tasks with HTML-documents.

2.6.7 Institutions

A big user group can be identified within national, European and international institutions. This has to do partly with the existence of repetitive texts and especially with terminological problems within European and international communication. The construction of extensive term banks can mainly be found on institutional sites. But here, a few problems are visible: 1. The creation of term banks is extremely costly; 2. The proper administration of term banks is a complex task; 3. Terminology develops so fast in certain fields, that it is almost impossible to keep track, as quickly as is required. It is interesting that the concordance factor as a “side-effect” of CAT Tools is extremely helpful in this context. Within a test phase of the TRADOS Translator’s Workbench for Windows at the European Commission it turned out that the concordance access to a translation memory with only 28.000 translation units was in many cases more helpful for retrieving terminology than the access to Eurodicautom, the biggest multilingual term bank of the world (over 600.000 entries). Therefore, we observe here a certain tendency towards the concordance factor and the resource creation factor. In institutions with important political text material, the reference and consistency factors are playing a major role.

In institutions with standardized report documents the repetition factor plays an important role.

Some institutional documents tend to be subphrase repetitive. That means, that the repetition can be found only as a part of a translation unit. Up to now, there is no commercial system dealing properly with this problem (see also section 5.4).

2.6.8 Translation agencies

For several reasons a clear-cut description of the needs of this user type is rather difficult. This has to do with the diversification within this group of users.

In the case of translation agencies doing the outsourced work for industry, the conditions mentioned under industry apply. Ideally, there is no difference whether e.g. automotive documents are translated at production sites or within a translation agencies. Because of a general tendency to outsourcing, we’ll rather find a lot of instances of CAT Tool applications in the industrial field on the level of translation agencies. The same is true if localization is done by a translation agency or if a translation agency works mainly for an institution.

For translation agencies handling several clients, we see the problem, that they are highly dependent on their clients concerning the format and the type of the documents. In a few cases, current CAT Tools do not yet support the format in question, in other cases, the documents are even not accessible in machine readable format.

Whereas in other areas the CAT Tool can be “tuned” to the type of text in question, within translation agencies the handling of the CAT Tool must be more flexible. This management task of the CAT Tool has led in the recent past to a new professional profile in translation agencies, the so-called IT-manager.

A distinction has to be drawn between translation agencies that deal with translations, made by in-house translators and translation agencies that are managing freelance translators with or without in-house revisors.[10] For the first group, the above mentioned conditions on the CAT Tool use apply.

For the second group, there are other conditions visible than for the first group. Managing freelance translators requires from a CAT Tool several functionalities like “pretranslation” and “off-line-updating” (see technical section 0) as well as from the term bank functionalities in the direction print or electronic publishing. The ensuring of quality can depend extremely on the proper use of CAT Tool applications for this type of translation agencies.

2.6.9 Freelance translators

The use of a full-fledged CAT Tool in the hand of freelancers is up to now relatively seldom. If there is a tool, in most cases it’ll be a term bank system. This is related to the fact that professional CAT Tools are in the cost-range of typical professional software products.[11] Freelance translators are by tradition very conservative with professional investments.

On the other hand, more and more freelancers will be confronted in the next future with pre-processed documents coming from CAT Tools. Or they will be temporarily forced by their work suppliers into the use of a CAT Tool. In this area we can identify at least three problematic topics: 1. who owns the copyright of a translation memory?; 2. Translation memories are making freelance translators more replaceable; 3. We observe a changing policy in translation pricing in combination with translation memories (see also chapter 4).

In the market, it can already be observed that skills in mastering this new technology are appreciated. It can be predicted that like in other professions, investments in the professional environment will be necessary for freelance translators.

2.6.10 Terminologists

In parallel to modern corpus-based lexicography, terminographic work can extremely profit from concordancing into specialized language translation memories. This is up to now a relatively underdeveloped field, but it is predictable, that it will play a major role in the future. Terminologists are therefore mainly interested into the concordance factor and the advanced resource creation factor.

2.6.11 Non-professionals

Up to now, non-professionals are not using CAT Tools. In principle it can be foreseen, that applications emerge where such tools assist in casual translations on basis of existing translation memories.

3. Technical requirements:

Component parts of CAT Tools

3.1 Translation memory

A translation memory is a database storing translation units. The complication of this simple definition derives from the following requirements:

  1. The retrieval has to find similar translation units;
  2. This has to be performed as quickly as possible;
  3. The formatting information of several external formats has to be preserved within a translation memory;
  4. The handling of the translation memory system has to be integrated into the translator’s word processor;
  5. The system has to support as many document formats as possible.

3.1.1 How to cope with “similarity”?

“When are two sentences similar?” This is a very tricky question. There could be misspellings, differences in the formatting, differences in the use of punctuation marks, differences in embedded elements like e.g. index-markers, morphosyntactical differences, syntactical differences etc. But, for a human being it takes only a short time to say: “Yes, they are similar”. How to do this with computers? Certainly not with classical computation based on “0/1”, “yes-no”, “true-false”, “black-white” approaches.

Computation besides Boolean logic has traditionally been called fuzzy. The problem with this term is, that it has outside of mathematical definitions grown to something very vague. Even in a standard microwave handbook, you’ll find today a remark that says that the making of yoghurt is due to the application of fuzzy technology.

In modern computer science, there are successful approaches to deal with similarity problems like e.g. picture recognition, error tolerant retrieval of DNA-chains, robotics etc. Especially neural networks and sparsely coded matrices are a useful means of attacking similarity problems. Interestingly, the same applies to the retrieval task within translation memory systems. Whereas e.g. the first generation of the translation memory system of TRADOS was based on a classical “0/1” approach by doing (linguistically motivated) substring-operations on classical database indices, the current generation employs a sparsely coded matrix approach. The advantages are evident, phenomena like misspellings and complicated syntactical deviations are now manageable and the access time has been significantly reduced. Only after introducing a proper technique for error tolerant retrieval, additional functionalities like e.g. concordancing became possible.

For simplicity reasons within the CAT Tools world the notion fuzzy-match is used in order to indicate the measure of similarity of two source translation units. It is important to understand that this is only a relative notion, which only means that a higher fuzzy-match value means higher similarity.

Good implementation of similarity measures are very important for all types of users. In fact, this is the central feature of a translation memory system. We will come back to the underlying techniques in section 3.1.4.

3.1.2 Interactive Access

If a translator wishes to translate a translation unit, the response time of the system is crucial. Acceptable responses are in the range of up to 1 second. The response time is depending on:

  1. The power of the computer;
  2. The size of the translation memory;
  3. The type and number of additional processes joined with the translation memory access;
  4. The time spent in exchanging the translation unit information between the translation memory system and the front-end (word processor).

3.1.2.1 Processing power

The power of computers is steadily increasing, which is in the case of translation memory systems very welcome. In general, the faster the computer is, the better the access time and the bigger a translation memory can be. In this respect time is playing for translation memories.

3.1.2.2 Size of translation memories

Working translation memories

On systems with traditional data-access, not using error-tolerant retrieval technology access time, this is a big problem. The standard solution here is to pre-process the text in order to derive a smaller working translation memory[12] which embeds all translation units which are close to the text. Therefore, a decision has to be made on a threshold similarity value, defining what translation units are sufficiently similar in order to be placed in the working translation memory. The pre-processing works in the following way:

  1. The systems segments the text into translation units, according to the segmentation strategies of the product;
  2. It retrieves the translation units and places all matches above the threshold value into the working translation memory.

The interactive access then works on this smaller translation memory. Working translation memories are a work-around with the following disadvantages:

  1. An additional pre-processing phase is required;
  2. The threshold value is based on heuristics. A user has no access to translation units below this value, even if there could be a benefit in it. Especially with concordancing this is not acceptable;
  3. If a translator changes interactively the segmentation e.g. in case of segmentation errors, the system is not any more capable to retrieve translation units;
  4. The meaning of “interactivity” is broken: The user is restricted to a “buffer-translation” memory and is not interactively working on the real translation memory. That means that changes to the working translation memory are not accessible by other users. If translation memories are used in a group, this can be a great disadvantage;
  5. The changes to the working translation memory must be resynchronized after the translation process with the original master translation memory. This is again an additional processing phase.

Master translation memory

In modern sparsely coded matrix based systems, real interactive working on big translation memories is possible. Typical big translation memories are in the size of 100.000 translation units. Since this technology is currently in a rather dynamic development phase, memories in the range of 500.000 to 1.000.000 translation units will be possible by the end of this year. Current research estimates are speaking of possible enhancement factors up to 20% and 40% bigger translation memories with constant access times.[13]

The ideal situation for all user groups is when a translation memory system is based on modern technology and at the same time allows for both: the creation of a working translation memory and the direct access to the master translation memory.[14] Interesting could be a combination of both: Translating with the help of a working translation memory and at the same time concordancing in a master translation memory.[15]

3.1.2.3 Additional processes

Other processes can be running on the system in concurrence with the main translation memory process. These could be e.g. term recognition processes or the passing of translation units to an attached machine translation system etc. Here again, the response time can degrade, may be significantly. Possible solutions in this respect are:

  1. Additional processes can be controlled by the translator;
  2. Processes are performed in the background, so that the translation memory process has always the highest priority.

3.1.2.4 Data exchange with the front end

Normally, the data exchange between word processor and translation memory system is sufficiently fast. This applies to all possible architectures concerning front-end integration (see also section 3.2).

3.1.3 Batch-Processing

In many cases it is useful to perform certain tasks in a batch. That means that the translation memory runs certain non-interactive processes in order to compute certain results. We may distinguish the following off-line processes and we will explain their aim and when they should be considered.

3.1.3.1 Preparation of a working translation memory

The preparation of a working translation memory has already been addressed in section 3.1.2.2. The aim is the generation of a smaller work translation memory. This could be necessary if the computer environment does not allow fast enough access for a big translation memory. The use of working translation memories makes only sense if a versatile update mechanism is available in order to resynchronize the changes in the working translation memory with the master translation memory. The disadvantages of the approach have already been mentioned in section 3.1.2.2, as well as the benefits for different user groups.

3.1.3.2 Pretranslation

Pretranslation as we understand it, is the process of the off-line replacing of all 100% matches, all fuzzy-matches up to a certain threshold and possibly terminology detected by the terminology recognition process. For ergonomic reasons, the result of the pretranslation process should be marked by the system, best by use of colours.

Pretranslation has the following advantages:

  1. If a text contains many 100% matches, the translation can be performed much quicker, since the translator can jump over the already translated parts;
  2. In case of external translators who do not have an access to the CAT Tool, texts can be pre-processed and then be off-line translated by the external user. This requires sophisticated updating possibilities of the translation memory system (see section 3.1.3.5).

Pretranslation is a process which is needed by nearly all user groups. Especially for translation agencies supplying work to freelancers, this is a very important feature.

3.1.3.3 Analysis of repetitivity

Repetivity analysis is a process that runs a document against a translation memory and computes statistics of the encountered 100% matches, fuzzy-matches and internal repetitions. In addition, word counts and translation unit counts as well as the overall statistical distribution of items in a document should be rendered. Important is that the segmentation of the text is sufficiently powerful to do proper word and translation unit segmentation and that it treats “non-translatable” items correctly. Non-translatable items are e.g. graphics, automatic field codes (automatic numbering, dates etc.) that are normally not translated but simply placed in the target translation unit by the translator. They are therefore called placeables.[16]

Very important is the capability of computing the delta of the similarity of a set of documents. In short, computing of deltas answers questions like: “What would happen if I first translate this document with the translation memory system and then the following one.

Repetitivity analysis plays an increasing role in negotiating prizes of translation projects. For that reason the statistical analysis in counting words, translation units, repetitions etc. must be exact.

This feature is in itself very important for all users who are doing pricing. In addition, for all user groups where the repetitivity character of documents is not easily to be measured (e.g. for users at the institutional level), delta computing is a means to make estimations objective.

3.1.3.4 Analysis of frequent occurrences of translation units

The detection of all translation units occurring more than a certain time in a document can be very helpful. A list of these translation units can be translated in isolation. Therefore, they pre-fill the translation memory with the highly repetitive parts of the document. This is only possible if a translation “out of context” is feasible. This applies frequently in case of technical documentation.

As a counterpart, the export of all translation units where no similar unit (below a certain threshold value) can be found is useful. They can be pre-processed by an automatic translation system, in order to speed up the translation project.[17] This is only interesting if automatic machine translation plays a role for a user group.

3.1.3.5 Updating and revisions

A very important role is played by the revision of translation memories. All user groups need this feature. Especially if the reference and consistency factors play a big role, revision is very important. Unfortunately, only a few current systems are properly supporting all facets of revision procedures. But first, a summary of the technical possibilities:

In interactive translation memory systems, updating is done automatically by accepting from the user a translation unit. Revisions can be easily done by “re-opening” a translation unit. This is the ideal situation since the updated translation unit is immediately visible to all users of the translation memory system.

Another important requirement is the direct update of the translation memory without using the front-end. This can be achieved by concordancing and editing in the concordance results; that means: directly in the translation memory. Only this procedure enables revisors to get an immediate overview over a certain topic in question. Up to now only the TRADOS system supports this functionality.

In non-interactive translation memory systems, as well as when using working translation memories, an explicit update has to take place in the form of a batch process when the translation project has been finished.

Source preserving systems

A distinction should be made between systems that keep the source translation unit in the document in a “hidden” form and systems that don’t. During the translation process, the first type of system, creates a bilingual document in which the original source translation units are hidden. We will call this a source preserving system. Source preserving requires a step in “cleaning-up” a document, that means that after finalizing a translation project, all source translation units have to be deleted. This is normally done by an update procedure.

Although both types of systems allow for updating working translation memories, only source preserving systems are flexible enough for off-line revisions. Off-line revision means:

  1. A revisor has access to source and target translation units in a document without using the CAT Tool. This enables global replace operations in a document that has no connection to a translation memory system. This allows also the use of the revision facilities foreseen by the word-processor. Another very important aspect is here, that revisors by tradition prefer to work on paper. As this is done, the revisions are then later typed by typists who do not know about and who do not use the CAT Tool. After revision, the revised document can update the translation memory;
  2. In source preserving systems, it is even possible to update a document from a translation memory. In cases where a translation memory is more actual than a text, this can be very useful.

To clarify this again: Translated documents that have been translated by source preserving systems without having been cleaned up, are embedding a translation memory in itself. Therefore, giving away such a document is like giving away a translation memory.[18]

3.1.4 Search-Engine and Data Storage

A lot of misunderstandings derive from the fact that there is no sufficient distinction drawn between the search engine and the data storage within translation memory systems. The search engine is responsible for the retrieval of similar translation units and the data storage is responsible for the physical storage of translation units. Physical storage can be done with every kind of database system, from standard data structures up to SQL-Servers etc.

The confusion arises from systems that are operating directly on the indices of the chosen data storage system and where the intelligence in performing error-tolerant retrieval is closely interwired with the data storage engine. From point of view of attacking the “similarity” problem, this is rather inefficient and not a problem oriented solution. Arguing in favour of a translation memory system by pointing out the data storage engine is therefore rather misleading. Data storage plays in fact only a minor role in judging a translation memory system. Data storage has more to do with security and networking than with the main task of a translation memory system: the fast retrieval of similar translation units.

More crucial for the behaviour of a translation memory system is the architecture of the search engine. As it has already been pointed out in section 3.1.1, sparsely coded matrix approaches (as a subtype of neural networks) are currently state of the art.

From traditional search engines ‑ in the case of translation memory systems characterized by linguistically motivated string operations on data-storage indices ‑ we cannot expect significant improvements in the future.[19]

The contrary applies to matrix approaches. Even if this technique is only in an early development stage, it is already ahead of the classical approach. It certain that research will lead in the next future to significant enhancements. Current prototype developments show, that retrieval can be improved by 20 to 40% (see also the explanations in section 3.1.2.2.).

3.1.5 Networking

Another area of misunderstandings, are the network capabilities of current translation memory systems.

In an ideal world, the following client-server scenario can be depicted:

If, at the same time, a large number of users search for different source translation units, a translation memory server should be capable to provide them with the required set of target translation units in nearly real-time. Up to now, such an architecture does not exist on the market.

Solutions in the client-server area are only available on basis of batch-processing, where a server creates a temporary working translation memory for each user, which is then copied to the user’s workstation. The disadvantages of working translation memories have already been mentioned in section 3.1.2.2.

Interactive solutions are up to now only available using file-sharing architectures. Here remains a large field for future improvements.

Generally it could be said that for the majority of current users the file-sharing architecture is a sufficiently powerful solution. If a big group of users has to share a translation memory, a good choice is a system which fits major needs and which ensures development in client-server direction.[20]

3.1.6 Additional information stored in translation memory systems

To facilitate the interpretation of target translation units, they should be classified according to additional information.

3.1.6.1 Formatting information

Very important is the preserving of formatting information in order to:

  1. Speed up the time spent in the translation phase;
  2. Detect formatting differences between translation units;
  3. Adapt fonts automatically. Clever translation memory systems even allow for explicit font mapping in the translation phase.

3.1.6.2 Administrative information

Administrative information should be linked with a translation memory. Requirements here are:

  1. Administrative information must be user-definable. All users have different needs. Therefore, “fixed field” approaches are unacceptable; It should be proved whether the number and the length of fields foreseen with a translation memory system are sufficient for the users;
  2. There must be automatically maintained fields. Typically, these are fields such as “creation date”, “creation user”, “change date”, “change user”, “used date” etc. Since these fields enlarge a translation memory, there should be a means to select and deselect them. Very important is an automatically updated usage counter, which keeps track of the use of translation units. This allows for a later reduction of a translation memory to all the translation units that have been used at least during a certain time.
  3. All additional information must be accessible by the selection process of target translation units; E.g. a user must be able to express the preference for translation units belonging to the subject field “chemistry” etc.
  4. Additional fields should be used to install and enforce security mechanisms;
  5. Additional fields should be used with concordancing into the translation memory. Here again we see many improvements possible in the future.

In principle, a translation memory underlies the same needs as a term bank system concerning classificational requirements. A translation memory system has to support this feature. This is a major functionality for all user groups.

3.2 Front-end: integration into word processors

The notion front-end means the application with which the translator controls the translation memory system. This should normally be a standard word processor system. But still there are some old-fashioned systems in the market that provide their own editor:

Translation memory systems with own editor

Some translation memory system force the user to deploy a built-in editor with which she or he has to translate text. This has turned out to be an unacceptable solution for translators. The disadvantages can be shortly summarized:

  1. The document has first to be converted into the internal editor format and later, after translation, it must be converted back into the original format. This is a source of formatting errors and requires additional work to be done.
  2. Idiosyncratic editors are not as comfortable as standard word processors on the market. Most often they even lack automatic reformatting capabilities, multi-level undo/redo operations, spelling checkers, thesaurus, hyphenation; they are character based, etc. Nowadays nobody likes to work with such a tool.
  3. Translators are already familiar with their standard word processor. Therefore, it is a natural solution to integrate the translation memory into the environment translators are already familiar with.

The only advantage that systems with their own editor can offer is that an integrated editor allows for more control over the user. But this is already changing due to new developments in the area of standard word processor systems and, all in all, it can already be seen that this type of tool will disappear from the market.

Translation memory systems integrated into standard word processors

Current state of the art is the integration of the translation memory systems into standard word processor systems like Microsoft Word for Windows (Version 6 or 7) or WordPerfect for Windows (Version .6.1).

3.2.1 What does “integration” mean

Integration into an existing word processor can be done in several ways. It is not sufficient to say that they are integrated, it is more important to see how this is done and how far the integration goes. Again different solutions have advantages and disadvantages for users:

3.2.1.1 Internal and external integration

The first distinction which has to be drawn concerns the dependence of the translation memory system on the word-processor environment. External integration means that the translation memory runs as an application independent on the word-processor using its own windows to display retrieval results and its own menus for the manipulation of the translation memory system. Internal integration means that the translation memory system is completely integrated into the word-processor using only the means for displaying and manipulation that the word processor offers.

Internal integration

Internal integration has the advantage that the application appears to the end-users as a functional extension of the word-processor. On the other side there are several disadvantages:

  1. The internal integration can only use the means to display information that are foreseen by the word-processor. That means that there are clashes if a certain formatting is used within the document itself and at the same time by the translation memory system. Formatting of the word processor is e.g. used by the translation memory system for marking differences between source and target translation units. If colours are used for this purpose, they cannot be used within the document etc. Additional information like type of changes, retrieved terminology, placeables etc., makes the visualization almost impossible. Therefore, internal integrated solutions often avoid direct display, which means that users have to open windows and close windows in order to consult information of the system. From an ergonomic point of view this is a major disadvantage.
  2. Internal integration also means a higher dependency on the word-processor which makes it ‑ from a software engineering point of view ‑ more difficult to quickly integrate a translation system into new word-processors re-using existing functionalities. Therefore, if internal integration is used, quick responses to new platforms or updates of word processors cannot be expected.

External integration

Externally integrated systems appear to users as applications of their own. This has the disadvantage that users are forced into the use of a tool, outside of their well-known word-processor. On the other hand, translation memory systems are integrating a lot of functionalities, so that a bundling of all functions into one running application seems to be more natural than to “misuse” the already full packed menu-structures of modern word-processors.

As already mentioned, a translation memory system confronts the user with many additional information on the screen. A fixed window with its own formatting means, like brackets, marking, use of colours, etc. is easier to interpret by users.[21] The disadvantage here is, that bigger screens are needed for this type of application.[22]

External integration has the advantage, that quicker upgrades to new platforms are possible, since only the part consisting of the communication with the word-processor has to be reimplemented. This is an important point to be considered when purchasing a system.

All in all, and especially from an ergonomic point of view, users will have advantages from external integration.

3.2.1.2 Indirect integration

In some cases the front-end used for the creation of documents seems rather complicated to translators. This is especially the case in desktop publishing systems used for technical translations, such as FrameMaker or Interleaf. In this case it is sensible to convert from the desktop publishing system to the standard word processor translators are familiar with. This type of integration is called indirect integration.

Powerful conversion tools have recently been developed, which are smoothing the complex format of desktop publishing systems into a format consumable by translators.[23] The advantage of staying in the normal word-processor environment weighs over the disadvantage in the two conversion steps involved.[24] Another area for successful indirect integration is SGML/HTML.

3.2.1.3 Depth of integration

Big differences can be observed according to the depth and sophistication of integration.

Coverage of constructs

As already mentioned, a translation memory system identifies a source translation unit and retrieves a similar or identical target translation unit. But, what does the system identify as a source translation unit? Here, we can see some deviations: A proper translation memory system should support all constructions foreseen in the word-processing system, like: tables, footnotes, endnotes, field-codes, frames, columns, embedded objects, pictures, indices, revision codes, etc.

This is a very important aspect for all types of users. They have to verify that a certain tool properly supports all constructions they have to deal with in their daily documents.

Segmentation

Another very important aspect, are the segmentation capabilities of a translation memory system. User acceptance will be very small if they are bothered with segmentation errors.

Experience showed computational linguistics that segmentation is not at all a trivial task. There are ambiguities that cannot be exactly decided (e.g. punctuation mark after numbers), there are language dependent phenomena (e.g. semicolon in Greek, abbreviations in Finnish); there are document and user type dependent phenomena (e.g. treatment of tabulators, semicolon etc.).

The only possibility in overcoming this problem is to open the segmentation to users, so that the segmentation can be adapted to languages, preferences in ambiguous cases and document inherent phenomena. Therefore a translation memory system should allow the user to define his own segmentation rules.[25]

Another factor in segmentation is the possibility for users to define lists of abbreviations, ordinal followers etc.

As can already be seen from ambiguous cases, segmentation errors are sometimes unavoidable. Therefore a system should allow for an interactive shrinking and expanding of source translation units in order to specify exactly the size of a translation unit. If this is not possible, users will be displeased after a while.

In many texts, numbers are frequent and changes to the document often only consist of changing the numbers. Automatic exchange of numbers and other invariable constructions are a big help. This functionality is especially important for users in the banking area. But users must at the same time have the right to deactivate this function, if it does not apply any more to a certain document type.

Segmentation must also foresee a means for exclusion. The term exclusion means, that parts of the document can be marked to be excluded by the treatment of the translation memory system. This must be possible at the paragraph level (e.g. to exclude foreign language citations, or e.g. parts of programming language code) and the character level (e.g. to exclude invariable elements like proper names in biographical documents etc.). This feature could be, for some projects, a precondition for the successful application of a translation memory system.

In the subfield of the automatic detection and conversion of date and number formats, many improvements are possible in future developments.

3.2.2 What does “front-end independence mean”?

The storage of translation units in a translation memory should be independent of the front-end. That means that, in an extreme case, a user can translate part of a document by using e.g. Microsoft Word and the rest within WordPerfect, operating on the same translation memory.

This is an important feature for the use of translation memories if:

  1. Heterogeneous front-ends are used by different users;
  2. A change of a front-end is planned on the user’s side.
  3. Translation memories are exchanged between different user groups.

In principle this feature is valuable for all users, since changes to front-ends are happening constantly in the very dynamic software field.

Whereas this condition appears to be simple, the technical realization, however, is complex, because the formatting conventions of different front-ends have to be mapped into one single representation in a translation memory system. If a translation memory system supplies front-end-independence, this indicates a highly sophisticated format management.[26]

3.2.3 A special front-end: concordance access

As it has been mentioned before, concordancing allows an access to translation memories. In this sense concordancing is a front-end of its own, enabling tasks like:

  1. Terminology search;
  2. Maintenance of a translation memory.

In this field, many improvements can be expected in the future. E.g. using filters for concordancing, displaying selective parts of a translation memory via concordance windows, concordance access to more than one translation memory, etc.

Concordance access is important for all users, but especially for institutions and terminologists. It has to be remarked that only modern systems with sparse matrix technology allow for concordancing.

3.3 Terminology databases

Terminology database systems as applications, are older than translation memory systems. They are integral part of CAT Tools and are serving the maintenance of terminology. Terminology management is a complex task and this paper will not address all facets of terminology systems. Only a short overview of requirements and user needs will be given.

3.3.1 Central features of term bank systems

There are certain preconditions to a term bank system, which are absolutely necessary for a successful management of terminology:

  1. The database must be user definable. Every type of user has its own specific requirements, so that inflexible systems have proven not to be applicable;
  2. Every field must be of variable length. Fixed field length and mask-oriented data input are not acceptable any more;
  3. Terminology databases have to be concept-oriented and multilingual. Bilingual databases, where swapping of language directions is required, are not acceptable any more on the market;
  4. Term bank systems must provide proper means for revision and user dependent access rules;
  5. They must support cross-references;
  6. Terminology retrieval must support error tolerant retrieval. Working with chemical notions or languages with terms which are tending to be rather long chains of characters (e.g. Finnish) are not meaningfully treatable without error tolerant retrieval.

In short, since terminology can play a key role for certain types of CAT Tools users, a CAT Tool should offer a full-fledged term bank system as an application of its own. It must be possible to work only with the term bank without the translation memory and vice versa.

If a CAT Tool does not offer a sufficient solution for terminology management this can be dangerous for users who must probably later convert to a real term bank system. It is in no way a solution to have two term bank systems running in parallel, one real application for the terminology management and a second one for the translation memory system.

3.3.2 Terminology recognition

If a source translation unit is passed to the translation memory system, the system can automatically search for existing terminology in the unit. This is called terminology recognition. Terminology recognition differs between systems in certain ways:

  1. In the coverage of languages supported;
  2. The types of morphosyntactical constructions supported: single words, complex terms, compounding phenomena, separable prefix constructions etc. Especially the support of compounding constructions is crucial within technical language.

In the case of pretranslation via batch processing, recognized terms should be automatically replaced or inserted into the document. This gives the possibility to provide e.g. external translators with a dictionary of terms for a given document, which is already included into the document. For certain user groups like e.g. translation agencies supplying freelance translators this is an important functionality.

3.4 Alignment of Sentences

Sentence alignment is a recycling process. Old text material in the form of parallel texts (texts which are translations of each other) are transformed into a translation memory.

Sentence alignment can be based on several strategies, like statistics on the distribution of words or characters, formatting information or heuristics like numbers and acronyms that appear in a text. Other interesting approaches could be pattern recognition strategies. In all cases alignment results have their errors. This paper does not intend to go into detail of sentence alignment strategies and error conditions, but we can state the following requirements for the improvement of the sentence alignment component of CAT Tools:

  1. A user should have the possibility to influence the weighing in the used sentence alignment strategy;
  2. An alignment tool should use as much available information as possible. This could include access to the term bank system as well as access to the translation memory;
  3. Sentence alignment results should be revisable. An intelligent editor for alignment results means that erroneous sentence alignment zones should be marked and easy manageable by the tool;
  4. Alignment programs should support the creation of alignment projects which can later be run in a batch mode;
  5. Alignment programs should support various formats. The more native formats that can be read, the better;
  6. Segmentation should be user definable via segmentation rules and must be subject to influence by lists of abbreviations etc. In principle, everything applies that has been said on segmentation in section 3.2.1.3.
  7. The automatic classification according to the translation memory settings should be supported.

More and more users of CAT Tools are becoming aware of the importance of sentence alignment. Further refinements and extensions in this area are expected in the near future.

3.5 Alignment below sentence level (Word Alignment)

As it has already been mentioned, there are two possibilities for alignment below the sentence level:

  1. Monolingual terminology extraction is used for filtering term candidates from documents. The resulting lists are used for the semi-automatic creation of term bank entries.
  2. Word alignment or, even better, bilingual terminology extraction is used for filtering term candidates from parallel texts. Again, the resulting lists are used for the



LISA 2008 events

Advertise with LISA


Adaquest

ADAPT Localization

Languages Media

LISA Forum Europe

8-12 December 2008
Registration Open


LISA Surveys

EventsNews

Joining LISA

Best Practice Guides

LISA Wireless Primer


OSCARTBXTMX

Terminology SIG

Job and CV Postings