|
In this issue…
Present and Future Needs in the CAT World
1. IntroductionWhereas in the past, automation of the professional translation process was mostly connected to the use of machine translation (MT), this has significantly changed in the last few years. Today, the keywords for professional translators are computer aided translation tools (CAT Tools) and, notably a key-component: the translation memory.[1] Modern CAT Tools, in most cases an integration of several functionalities into one “workbench”, are gaining more and more ground as a standard tool in the hand of professional translators. Except for literary translations or generally idiosyncratic text types, the use of CAT Tools has been extended to almost every type of translation work. This includes political, administrative, technical, advertising, biographical, and other text types. Whereas the general idea of a translation memory is fairly simple, the practical realization of a functioning product is a rather complex task. This has mainly to do with the subtasks that such a system has to perform. The problem area “translation memory” covers many aspects within information science and linguistics, such as database design, retrieval technology, mapping of complex data (text) structures, client-server architecture, networking, support of language dependent phenomena (character sets, tokenization, morphology, syntax), software ergonomy etc. We are faced with a very interesting type of application, which appears to users as a rather simple interface but which has underneath a very complex internal functioning. Up to now, the aspect of the different user needs in this technology has not gained enough attention. We can distinguish between a kernel set of functionalities of a translation memory and the functional extensions due to specific user needs. In combination with an overall broadening of the application area of translation memories, the functional extensions of such a system are getting more and more dispersed. Therefore, we will first identify different user profiles and user needs (section 2) and then discuss on basis of this background the technical aspects of the component parts of CAT Tools in section 3. A few observations concerning side-effects in using translation memories are given in section 4, and a final summary on developments to be expected in the future, in section 5. Furthermore,we can observe in this area a rather confusing use of different notions that are often leading to misunderstandings. Therefore, we will try to introduce a few helpful notions and define more precisely the unclear use of terms. 2. Tools at the translator’s desktop2.1 What is a translation memory?The general idea of a translation memory is very simple: All translations made by a translator are stored in a database and are then in case of re-translations immediately retrievable. This process can be subdivided into several phases:
2.2 What are the benefits of using CAT Tools?There are certain aspects in applying a CAT Tool to translation projects. Usually, tool vendors have three arguments at hand: 1. Large quantities of texts can be translated faster (Quantity-Argument); 2. Quality of translation is increased (Quality-Argument); 3. Subsequent and next similar translation projects can benefit from earlier work (Re-usability-Argument). These statements are certainly true, but they have rather the character of general statements. Therefore, we will describe more in detail a few factors that are playing a major role in the application of this technology. These factors can later be used to distinguish between the different needs of different user profiles. 2.2.1 Repetition factorFrom the explanation of the functioning of a translation memory system, it is clear that translation memories find their main application in the translation of repetitive text material. It is important to distinguish between internal repetitions in a document itself and external repetitions within typical update translations where the repetitions are inherent to a family of documents. We will address this phenomenon later as the repetition factor. 2.2.2 Consistency factorAnother effect of the use of translation memories is that by nature, they enforce a higher consistency in translations. Especially in technical documentation it is very important to have a certain consistency and uniformity when addressing topics. It helps the reader to understand complicated information. In short, consistent wording is a precondition in technical writing and translation memories are automatically supporting this. The integration of a translation memory and a terminology database system (term bank system) can significantly improve this effect. This will be discussed in more depth when we are talking about the term bank part of a CAT Tool. We will call this phenomenon the consistency factor. 2.2.3 Reference factorVery closely related to the consistency factor is the feature that every translation unit can be classified according to several information types like: creation user, creation time, update user, update time, subject codes, notes, etc. Therefore, the retrieval of a translation unit can be accompanied by a “quality” mark. This leads to quality improvement in the translation by re-using revised and approved wordings. It is like using a translation of an authorized reference, in as such leading to a standardization of translations. This factor will be called the reference factor. 2.2.4 Concordance factorAs can be seen from phase 4, a translation memory is subsequently filled by translation units, embedding source and target language information in the form of sentences and subsentence units. From a linguistic point of view, a translation memory can therefore be described as a bilingual parallel corpus. In case of systems allowing more than one source or target language, this leads to the construction of multilingual parallel corpora. Very useful is the access of this corpus in order to retrieve a translation unit, by searching for one or several keywords. This function is commonly referred to as concordancing, more precisely bilingual concordancing. Translation memories can be seen as a rich source of implicit terminology, compared to the storage of explicit terminology in a term bank system. In this sense, translation memories are competitive to term banks. And as we will see later, very successful ones. We will address this topic as the concordance factor. 2.2.5 Terminology factorThe second important component part of a CAT Tool is a term bank.[2] Terminology management is a complex task that shares common properties with classical lexicography, but that has its own parameters. Proper terminology management is a costly task, but cannot be abandoned if a certain quality in translation has to be reached and maintained. A key role in CAT Tools is played by the automatic searching for terminology in the source translation unit. This is called terminology recognition. This notion should not be mingled with terminology extraction, which means the automatic extraction of terminology from text material.[3] Terminology recognition replaces the manual search in databases. The system automatically attracts the translator’s attention to existing terminology to be used. Therefore, two types of benefits from the CAT Tool can be expected: 1. the possibility of keeping track of specialized language terms; 2. the manual or automatic retrieval of terminology. This factor will be called the terminology factor. 2.2.6 Resource creation factorThere are 3 possibilities of automatic creation of resources by CAT Tools:
2.3 Platforms used by translatorsAs with other professions, we can observe a certain predominant platform used by translators.[6] This is in most cases a standard PC. The standard application of translators is the standard word processor found on that platform. Exceptions to this majority can be found within translation departments of institutions or industries where other platforms are chosen for strategic reasons. In the field of complex advertisement material Macintosh-based platforms are predominant. Generally, for the next future we predict that most developments will concentrate on PC-based platforms. Due to the budget profile of translators, programs must run within the typical restricted PC environment that can at present be achieved. 2.4 Text formats on translators’ desksTranslators are faced with an overwhelming variety of different text formats. The predominant formats vary from country to country, but at present, there are a few global tendencies visible. The format is also influenced by the type of document. It is not possible to list all the formats, nor is it possible to give a statistical overview. But in what follows, we try to show a few tendencies: The most widespread format is Microsoft Word (Version 2/6/7)[7], used for several kinds of text types up to technical documentation. In the field of technical documentation, FrameMaker is gaining more and more importance. Here, FrameMaker is followed by Interleaf. In some countries, WordPerfect (Version 5.2/6.1) plays the same role as Microsoft Word generally plays. It is remarkable that recently WordPerfect is losing ground. A certain tendency towards WordPro from Lotus is visible in the same countries. For advertisement material and generally in documents with a strong graphic arts orientation Quark Xpress and PageMaker are often used.[8] Other formats are playing a subordinate role. With tagged text formats SGML/HTML plays the biggest role. 2.5 Conditions for the success of translation memoriesThe recent success of translation memories can be explained by several factors:
2.6 Profile of CAT Tool usersAs it has been mentioned before, the general market for the application of CAT Tools is broadening. In what follows, we will give a rough classification of typical user profiles and user needs in CAT Tools. This is not an exhaustive list and there are always exceptions to it. 2.6.1 Multilingual societiesThe use of CAT Tools is not only determined by different user profiles, but also differs significantly from country to country. Especially in countries with more than one national language, an overall sensitivity towards translation leads faster to automation in the translation process. Another factor are the translation costs. In countries with high translation costs, a faster introduction of rationalization software can be observed.[9] 2.6.2 Localization industryCAT Tools have predominantly been used in the past by a specific group of translators. This group can be described as “translators translating manuals of software products” and they have therefore been close to software in general. This group is commonly addressed to as the localization industry. They are the initial user group and they have played a major role in influencing the development of CAT Tools. For the localization industry the major importance lies in the repetition and consistency factors. 2.6.3 Industrial usersClear enough, not only software manuals, but also every type of technical documentation is a good candidate for a translation memory system. This applies to all kinds of industries: from general engineering to pharmaceutics, food production, electronics, consumer products, automotive industry, aeronautics etc. The size of the companies is not as important as the international marketing strategies of their products. Another factor are the product liability laws enforcing proper localized documentation for targeted markets. As within the localization industry the repetition and uniformity factors are playing the biggest role in applying translation memory technology. 2.6.4 Banking, Insurance and documents dealing with legal mattersBanks and insurance companies have several reasons to consider the application of translation memories in order to solve their translation problems:
Very similar are applications in the field of translating legal matters. E.g. very successful is the application in this field by patent attorneys. Here are some overlappings with the user type “institution”. In contrast to the former fields, we observe more stress on the terminology and reference/consistency factor than on the repetition factor. 2.6.5 MilitaryCommunication within the military field is driven by the aim to be as exact as possible in descriptions if international actions that have to be taken. Besides, military texts have a lot in common with institutional texts as well. Strong emphasis lies here in document protection and preserving the degree of confidentiality. This influences the way how translation units are allowed to be stored. Most important factors are the terminology factor, the reference factor and the repetition factor. 2.6.6 Media / Multimedia SectorThe multimedia sector is influenced by a tendency towards globalization and centralization. More and more translation tasks arise here. Compared to other types of users, the the degree of overall text repetition does not play a major role. Stress lies here more on the terminology factor and the concordancing factor. In the MultiMedia World the orientation towards the Internet produces in the next future more and more translation tasks with HTML-documents. 2.6.7 InstitutionsA big user group can be identified within national, European and international institutions. This has to do partly with the existence of repetitive texts and especially with terminological problems within European and international communication. The construction of extensive term banks can mainly be found on institutional sites. But here, a few problems are visible: 1. The creation of term banks is extremely costly; 2. The proper administration of term banks is a complex task; 3. Terminology develops so fast in certain fields, that it is almost impossible to keep track, as quickly as is required. It is interesting that the concordance factor as a “side-effect” of CAT Tools is extremely helpful in this context. Within a test phase of the TRADOS Translator’s Workbench for Windows at the European Commission it turned out that the concordance access to a translation memory with only 28.000 translation units was in many cases more helpful for retrieving terminology than the access to Eurodicautom, the biggest multilingual term bank of the world (over 600.000 entries). Therefore, we observe here a certain tendency towards the concordance factor and the resource creation factor. In institutions with important political text material, the reference and consistency factors are playing a major role. In institutions with standardized report documents the repetition factor plays an important role. Some institutional documents tend to be subphrase repetitive. That means, that the repetition can be found only as a part of a translation unit. Up to now, there is no commercial system dealing properly with this problem (see also section 5.4). 2.6.8 Translation agenciesFor several reasons a clear-cut description of the needs of this user type is rather difficult. This has to do with the diversification within this group of users. In the case of translation agencies doing the outsourced work for industry, the conditions mentioned under industry apply. Ideally, there is no difference whether e.g. automotive documents are translated at production sites or within a translation agencies. Because of a general tendency to outsourcing, we’ll rather find a lot of instances of CAT Tool applications in the industrial field on the level of translation agencies. The same is true if localization is done by a translation agency or if a translation agency works mainly for an institution. For translation agencies handling several clients, we see the problem, that they are highly dependent on their clients concerning the format and the type of the documents. In a few cases, current CAT Tools do not yet support the format in question, in other cases, the documents are even not accessible in machine readable format. Whereas in other areas the CAT Tool can be “tuned” to the type of text in question, within translation agencies the handling of the CAT Tool must be more flexible. This management task of the CAT Tool has led in the recent past to a new professional profile in translation agencies, the so-called IT-manager. A distinction has to be drawn between translation agencies that deal with translations, made by in-house translators and translation agencies that are managing freelance translators with or without in-house revisors.[10] For the first group, the above mentioned conditions on the CAT Tool use apply. For the second group, there are other conditions visible than for the first group. Managing freelance translators requires from a CAT Tool several functionalities like “pretranslation” and “off-line-updating” (see technical section 0) as well as from the term bank functionalities in the direction print or electronic publishing. The ensuring of quality can depend extremely on the proper use of CAT Tool applications for this type of translation agencies. 2.6.9 Freelance translatorsThe use of a full-fledged CAT Tool in the hand of freelancers is up to now relatively seldom. If there is a tool, in most cases it’ll be a term bank system. This is related to the fact that professional CAT Tools are in the cost-range of typical professional software products.[11] Freelance translators are by tradition very conservative with professional investments. On the other hand, more and more freelancers will be confronted in the next future with pre-processed documents coming from CAT Tools. Or they will be temporarily forced by their work suppliers into the use of a CAT Tool. In this area we can identify at least three problematic topics: 1. who owns the copyright of a translation memory?; 2. Translation memories are making freelance translators more replaceable; 3. We observe a changing policy in translation pricing in combination with translation memories (see also chapter 4). In the market, it can already be observed that skills in mastering this new technology are appreciated. It can be predicted that like in other professions, investments in the professional environment will be necessary for freelance translators. 2.6.10 TerminologistsIn parallel to modern corpus-based lexicography, terminographic work can extremely profit from concordancing into specialized language translation memories. This is up to now a relatively underdeveloped field, but it is predictable, that it will play a major role in the future. Terminologists are therefore mainly interested into the concordance factor and the advanced resource creation factor. 2.6.11 Non-professionalsUp to now, non-professionals are not using CAT Tools. In principle it can be foreseen, that applications emerge where such tools assist in casual translations on basis of existing translation memories. 3. Technical requirements:Component parts of CAT Tools3.1 Translation memoryA translation memory is a database storing translation units. The complication of this simple definition derives from the following requirements:
3.1.1 How to cope with “similarity”?“When are two sentences similar?” This is a very tricky question. There could be misspellings, differences in the formatting, differences in the use of punctuation marks, differences in embedded elements like e.g. index-markers, morphosyntactical differences, syntactical differences etc. But, for a human being it takes only a short time to say: “Yes, they are similar”. How to do this with computers? Certainly not with classical computation based on “0/1”, “yes-no”, “true-false”, “black-white” approaches. Computation besides Boolean logic has traditionally been called fuzzy. The problem with this term is, that it has outside of mathematical definitions grown to something very vague. Even in a standard microwave handbook, you’ll find today a remark that says that the making of yoghurt is due to the application of fuzzy technology. In modern computer science, there are successful approaches to deal with similarity problems like e.g. picture recognition, error tolerant retrieval of DNA-chains, robotics etc. Especially neural networks and sparsely coded matrices are a useful means of attacking similarity problems. Interestingly, the same applies to the retrieval task within translation memory systems. Whereas e.g. the first generation of the translation memory system of TRADOS was based on a classical “0/1” approach by doing (linguistically motivated) substring-operations on classical database indices, the current generation employs a sparsely coded matrix approach. The advantages are evident, phenomena like misspellings and complicated syntactical deviations are now manageable and the access time has been significantly reduced. Only after introducing a proper technique for error tolerant retrieval, additional functionalities like e.g. concordancing became possible. For simplicity reasons within the CAT Tools world the notion fuzzy-match is used in order to indicate the measure of similarity of two source translation units. It is important to understand that this is only a relative notion, which only means that a higher fuzzy-match value means higher similarity. Good implementation of similarity measures are very important for all types of users. In fact, this is the central feature of a translation memory system. We will come back to the underlying techniques in section 3.1.4. 3.1.2 Interactive AccessIf a translator wishes to translate a translation unit, the response time of the system is crucial. Acceptable responses are in the range of up to 1 second. The response time is depending on:
3.1.2.1 Processing powerThe power of computers is steadily increasing, which is in the case of translation memory systems very welcome. In general, the faster the computer is, the better the access time and the bigger a translation memory can be. In this respect time is playing for translation memories. 3.1.2.2 Size of translation memoriesWorking translation memoriesOn systems with traditional data-access, not using error-tolerant retrieval technology access time, this is a big problem. The standard solution here is to pre-process the text in order to derive a smaller working translation memory[12] which embeds all translation units which are close to the text. Therefore, a decision has to be made on a threshold similarity value, defining what translation units are sufficiently similar in order to be placed in the working translation memory. The pre-processing works in the following way:
The interactive access then works on this smaller translation memory. Working translation memories are a work-around with the following disadvantages:
Master translation memoryIn modern sparsely coded matrix based systems, real interactive working on big translation memories is possible. Typical big translation memories are in the size of 100.000 translation units. Since this technology is currently in a rather dynamic development phase, memories in the range of 500.000 to 1.000.000 translation units will be possible by the end of this year. Current research estimates are speaking of possible enhancement factors up to 20% and 40% bigger translation memories with constant access times.[13] The ideal situation for all user groups is when a translation memory system is based on modern technology and at the same time allows for both: the creation of a working translation memory and the direct access to the master translation memory.[14] Interesting could be a combination of both: Translating with the help of a working translation memory and at the same time concordancing in a master translation memory.[15] 3.1.2.3 Additional processesOther processes can be running on the system in concurrence with the main translation memory process. These could be e.g. term recognition processes or the passing of translation units to an attached machine translation system etc. Here again, the response time can degrade, may be significantly. Possible solutions in this respect are:
3.1.2.4 Data exchange with the front endNormally, the data exchange between word processor and translation memory system is sufficiently fast. This applies to all possible architectures concerning front-end integration (see also section 3.2). 3.1.3 Batch-ProcessingIn many cases it is useful to perform certain tasks in a batch. That means that the translation memory runs certain non-interactive processes in order to compute certain results. We may distinguish the following off-line processes and we will explain their aim and when they should be considered. 3.1.3.1 Preparation of a working translation memoryThe preparation of a working translation memory has already been addressed in section 3.1.2.2. The aim is the generation of a smaller work translation memory. This could be necessary if the computer environment does not allow fast enough access for a big translation memory. The use of working translation memories makes only sense if a versatile update mechanism is available in order to resynchronize the changes in the working translation memory with the master translation memory. The disadvantages of the approach have already been mentioned in section 3.1.2.2, as well as the benefits for different user groups. 3.1.3.2 PretranslationPretranslation as we understand it, is the process of the off-line replacing of all 100% matches, all fuzzy-matches up to a certain threshold and possibly terminology detected by the terminology recognition process. For ergonomic reasons, the result of the pretranslation process should be marked by the system, best by use of colours. Pretranslation has the following advantages:
Pretranslation is a process which is needed by nearly all user groups. Especially for translation agencies supplying work to freelancers, this is a very important feature. 3.1.3.3 Analysis of repetitivityRepetivity analysis is a process that runs a document against a translation memory and computes statistics of the encountered 100% matches, fuzzy-matches and internal repetitions. In addition, word counts and translation unit counts as well as the overall statistical distribution of items in a document should be rendered. Important is that the segmentation of the text is sufficiently powerful to do proper word and translation unit segmentation and that it treats “non-translatable” items correctly. Non-translatable items are e.g. graphics, automatic field codes (automatic numbering, dates etc.) that are normally not translated but simply placed in the target translation unit by the translator. They are therefore called placeables.[16] Very important is the capability of computing the delta of the similarity of a set of documents. In short, computing of deltas answers questions like: “What would happen if I first translate this document with the translation memory system and then the following one. Repetitivity analysis plays an increasing role in negotiating prizes of translation projects. For that reason the statistical analysis in counting words, translation units, repetitions etc. must be exact. This feature is in itself very important for all users who are doing pricing. In addition, for all user groups where the repetitivity character of documents is not easily to be measured (e.g. for users at the institutional level), delta computing is a means to make estimations objective. 3.1.3.4 Analysis of frequent occurrences of translation unitsThe detection of all translation units occurring more than a certain time in a document can be very helpful. A list of these translation units can be translated in isolation. Therefore, they pre-fill the translation memory with the highly repetitive parts of the document. This is only possible if a translation “out of context” is feasible. This applies frequently in case of technical documentation. As a counterpart, the export of all translation units where no similar unit (below a certain threshold value) can be found is useful. They can be pre-processed by an automatic translation system, in order to speed up the translation project.[17] This is only interesting if automatic machine translation plays a role for a user group. 3.1.3.5 Updating and revisionsA very important role is played by the revision of translation memories. All user groups need this feature. Especially if the reference and consistency factors play a big role, revision is very important. Unfortunately, only a few current systems are properly supporting all facets of revision procedures. But first, a summary of the technical possibilities: In interactive translation memory systems, updating is done automatically by accepting from the user a translation unit. Revisions can be easily done by “re-opening” a translation unit. This is the ideal situation since the updated translation unit is immediately visible to all users of the translation memory system. Another important requirement is the direct update of the translation memory without using the front-end. This can be achieved by concordancing and editing in the concordance results; that means: directly in the translation memory. Only this procedure enables revisors to get an immediate overview over a certain topic in question. Up to now only the TRADOS system supports this functionality. In non-interactive translation memory systems, as well as when using working translation memories, an explicit update has to take place in the form of a batch process when the translation project has been finished. Source preserving systemsA distinction should be made between systems that keep the source translation unit in the document in a “hidden” form and systems that don’t. During the translation process, the first type of system, creates a bilingual document in which the original source translation units are hidden. We will call this a source preserving system. Source preserving requires a step in “cleaning-up” a document, that means that after finalizing a translation project, all source translation units have to be deleted. This is normally done by an update procedure. Although both types of systems allow for updating working translation memories, only source preserving systems are flexible enough for off-line revisions. Off-line revision means:
To clarify this again: Translated documents that have been translated by source preserving systems without having been cleaned up, are embedding a translation memory in itself. Therefore, giving away such a document is like giving away a translation memory.[18] 3.1.4 Search-Engine and Data StorageA lot of misunderstandings derive from the fact that there is no sufficient distinction drawn between the search engine and the data storage within translation memory systems. The search engine is responsible for the retrieval of similar translation units and the data storage is responsible for the physical storage of translation units. Physical storage can be done with every kind of database system, from standard data structures up to SQL-Servers etc. The confusion arises from systems that are operating directly on the indices of the chosen data storage system and where the intelligence in performing error-tolerant retrieval is closely interwired with the data storage engine. From point of view of attacking the “similarity” problem, this is rather inefficient and not a problem oriented solution. Arguing in favour of a translation memory system by pointing out the data storage engine is therefore rather misleading. Data storage plays in fact only a minor role in judging a translation memory system. Data storage has more to do with security and networking than with the main task of a translation memory system: the fast retrieval of similar translation units. More crucial for the behaviour of a translation memory system is the architecture of the search engine. As it has already been pointed out in section 3.1.1, sparsely coded matrix approaches (as a subtype of neural networks) are currently state of the art. From traditional search engines ‑ in the case of translation memory systems characterized by linguistically motivated string operations on data-storage indices ‑ we cannot expect significant improvements in the future.[19] The contrary applies to matrix approaches. Even if this technique is only in an early development stage, it is already ahead of the classical approach. It certain that research will lead in the next future to significant enhancements. Current prototype developments show, that retrieval can be improved by 20 to 40% (see also the explanations in section 3.1.2.2.). 3.1.5 NetworkingAnother area of misunderstandings, are the network capabilities of current translation memory systems. In an ideal world, the following client-server scenario can be depicted: If, at the same time, a large number of users search for different source translation units, a translation memory server should be capable to provide them with the required set of target translation units in nearly real-time. Up to now, such an architecture does not exist on the market. Solutions in the client-server area are only available on basis of batch-processing, where a server creates a temporary working translation memory for each user, which is then copied to the user’s workstation. The disadvantages of working translation memories have already been mentioned in section 3.1.2.2. Interactive solutions are up to now only available using file-sharing architectures. Here remains a large field for future improvements. Generally it could be said that for the majority of current users the file-sharing architecture is a sufficiently powerful solution. If a big group of users has to share a translation memory, a good choice is a system which fits major needs and which ensures development in client-server direction.[20] 3.1.6 Additional information stored in translation memory systemsTo facilitate the interpretation of target translation units, they should be classified according to additional information. 3.1.6.1 Formatting informationVery important is the preserving of formatting information in order to:
3.1.6.2 Administrative informationAdministrative information should be linked with a translation memory. Requirements here are:
In principle, a translation memory underlies the same needs as a term bank system concerning classificational requirements. A translation memory system has to support this feature. This is a major functionality for all user groups. 3.2 Front-end: integration into word processorsThe notion front-end means the application with which the translator controls the translation memory system. This should normally be a standard word processor system. But still there are some old-fashioned systems in the market that provide their own editor: Translation memory systems with own editorSome translation memory system force the user to deploy a built-in editor with which she or he has to translate text. This has turned out to be an unacceptable solution for translators. The disadvantages can be shortly summarized:
The only advantage that systems with their own editor can offer is that an integrated editor allows for more control over the user. But this is already changing due to new developments in the area of standard word processor systems and, all in all, it can already be seen that this type of tool will disappear from the market. Translation memory systems integrated into standard word processorsCurrent state of the art is the integration of the translation memory systems into standard word processor systems like Microsoft Word for Windows (Version 6 or 7) or WordPerfect for Windows (Version .6.1). 3.2.1 What does “integration” meanIntegration into an existing word processor can be done in several ways. It is not sufficient to say that they are integrated, it is more important to see how this is done and how far the integration goes. Again different solutions have advantages and disadvantages for users: 3.2.1.1 Internal and external integrationThe first distinction which has to be drawn concerns the dependence of the translation memory system on the word-processor environment. External integration means that the translation memory runs as an application independent on the word-processor using its own windows to display retrieval results and its own menus for the manipulation of the translation memory system. Internal integration means that the translation memory system is completely integrated into the word-processor using only the means for displaying and manipulation that the word processor offers. Internal integrationInternal integration has the advantage that the application appears to the end-users as a functional extension of the word-processor. On the other side there are several disadvantages:
External integrationExternally integrated systems appear to users as applications of their own. This has the disadvantage that users are forced into the use of a tool, outside of their well-known word-processor. On the other hand, translation memory systems are integrating a lot of functionalities, so that a bundling of all functions into one running application seems to be more natural than to “misuse” the already full packed menu-structures of modern word-processors. As already mentioned, a translation memory system confronts the user with many additional information on the screen. A fixed window with its own formatting means, like brackets, marking, use of colours, etc. is easier to interpret by users.[21] The disadvantage here is, that bigger screens are needed for this type of application.[22] External integration has the advantage, that quicker upgrades to new platforms are possible, since only the part consisting of the communication with the word-processor has to be reimplemented. This is an important point to be considered when purchasing a system. All in all, and especially from an ergonomic point of view, users will have advantages from external integration. 3.2.1.2 Indirect integrationIn some cases the front-end used for the creation of documents seems rather complicated to translators. This is especially the case in desktop publishing systems used for technical translations, such as FrameMaker or Interleaf. In this case it is sensible to convert from the desktop publishing system to the standard word processor translators are familiar with. This type of integration is called indirect integration. Powerful conversion tools have recently been developed, which are smoothing the complex format of desktop publishing systems into a format consumable by translators.[23] The advantage of staying in the normal word-processor environment weighs over the disadvantage in the two conversion steps involved.[24] Another area for successful indirect integration is SGML/HTML. 3.2.1.3 Depth of integrationBig differences can be observed according to the depth and sophistication of integration. Coverage of constructsAs already mentioned, a translation memory system identifies a source translation unit and retrieves a similar or identical target translation unit. But, what does the system identify as a source translation unit? Here, we can see some deviations: A proper translation memory system should support all constructions foreseen in the word-processing system, like: tables, footnotes, endnotes, field-codes, frames, columns, embedded objects, pictures, indices, revision codes, etc. This is a very important aspect for all types of users. They have to verify that a certain tool properly supports all constructions they have to deal with in their daily documents. SegmentationAnother very important aspect, are the segmentation capabilities of a translation memory system. User acceptance will be very small if they are bothered with segmentation errors. Experience showed computational linguistics that segmentation is not at all a trivial task. There are ambiguities that cannot be exactly decided (e.g. punctuation mark after numbers), there are language dependent phenomena (e.g. semicolon in Greek, abbreviations in Finnish); there are document and user type dependent phenomena (e.g. treatment of tabulators, semicolon etc.). The only possibility in overcoming this problem is to open the segmentation to users, so that the segmentation can be adapted to languages, preferences in ambiguous cases and document inherent phenomena. Therefore a translation memory system should allow the user to define his own segmentation rules.[25] Another factor in segmentation is the possibility for users to define lists of abbreviations, ordinal followers etc. As can already be seen from ambiguous cases, segmentation errors are sometimes unavoidable. Therefore a system should allow for an interactive shrinking and expanding of source translation units in order to specify exactly the size of a translation unit. If this is not possible, users will be displeased after a while. In many texts, numbers are frequent and changes to the document often only consist of changing the numbers. Automatic exchange of numbers and other invariable constructions are a big help. This functionality is especially important for users in the banking area. But users must at the same time have the right to deactivate this function, if it does not apply any more to a certain document type. Segmentation must also foresee a means for exclusion. The term exclusion means, that parts of the document can be marked to be excluded by the treatment of the translation memory system. This must be possible at the paragraph level (e.g. to exclude foreign language citations, or e.g. parts of programming language code) and the character level (e.g. to exclude invariable elements like proper names in biographical documents etc.). This feature could be, for some projects, a precondition for the successful application of a translation memory system. In the subfield of the automatic detection and conversion of date and number formats, many improvements are possible in future developments. 3.2.2 What does “front-end independence mean”?The storage of translation units in a translation memory should be independent of the front-end. That means that, in an extreme case, a user can translate part of a document by using e.g. Microsoft Word and the rest within WordPerfect, operating on the same translation memory. This is an important feature for the use of translation memories if:
In principle this feature is valuable for all users, since changes to front-ends are happening constantly in the very dynamic software field. Whereas this condition appears to be simple, the technical realization, however, is complex, because the formatting conventions of different front-ends have to be mapped into one single representation in a translation memory system. If a translation memory system supplies front-end-independence, this indicates a highly sophisticated format management.[26] 3.2.3 A special front-end: concordance accessAs it has been mentioned before, concordancing allows an access to translation memories. In this sense concordancing is a front-end of its own, enabling tasks like:
In this field, many improvements can be expected in the future. E.g. using filters for concordancing, displaying selective parts of a translation memory via concordance windows, concordance access to more than one translation memory, etc. Concordance access is important for all users, but especially for institutions and terminologists. It has to be remarked that only modern systems with sparse matrix technology allow for concordancing. 3.3 Terminology databasesTerminology database systems as applications, are older than translation memory systems. They are integral part of CAT Tools and are serving the maintenance of terminology. Terminology management is a complex task and this paper will not address all facets of terminology systems. Only a short overview of requirements and user needs will be given. 3.3.1 Central features of term bank systemsThere are certain preconditions to a term bank system, which are absolutely necessary for a successful management of terminology:
In short, since terminology can play a key role for certain types of CAT Tools users, a CAT Tool should offer a full-fledged term bank system as an application of its own. It must be possible to work only with the term bank without the translation memory and vice versa. If a CAT Tool does not offer a sufficient solution for terminology management this can be dangerous for users who must probably later convert to a real term bank system. It is in no way a solution to have two term bank systems running in parallel, one real application for the terminology management and a second one for the translation memory system. 3.3.2 Terminology recognitionIf a source translation unit is passed to the translation memory system, the system can automatically search for existing terminology in the unit. This is called terminology recognition. Terminology recognition differs between systems in certain ways:
In the case of pretranslation via batch processing, recognized terms should be automatically replaced or inserted into the document. This gives the possibility to provide e.g. external translators with a dictionary of terms for a given document, which is already included into the document. For certain user groups like e.g. translation agencies supplying freelance translators this is an important functionality. 3.4 Alignment of SentencesSentence alignment is a recycling process. Old text material in the form of parallel texts (texts which are translations of each other) are transformed into a translation memory. Sentence alignment can be based on several strategies, like statistics on the distribution of words or characters, formatting information or heuristics like numbers and acronyms that appear in a text. Other interesting approaches could be pattern recognition strategies. In all cases alignment results have their errors. This paper does not intend to go into detail of sentence alignment strategies and error conditions, but we can state the following requirements for the improvement of the sentence alignment component of CAT Tools:
More and more users of CAT Tools are becoming aware of the importance of sentence alignment. Further refinements and extensions in this area are expected in the near future. 3.5 Alignment below sentence level (Word Alignment)As it has already been mentioned, there are two possibilities for alignment below the sentence level:
|
![]() 8-12 December 2008 |
||