|
In this issue…
Building Machine Translation on a Firm Foundation
How can we build the next generation of machine translation systems on a firm foundation? We should build on the current generation of systems by incorporating proven technology - i.e. we should emphasize indicative-quality translation where appropriate and high-quality controlled-language translation where appropriate, leaving other kinds of translation to humans. This article, which is based on a paper given by the author at the 1996 Aslib conference on Translating and the Computer, makes five practical proposals to machine translation vendors for enhancing current machine translation systems. Some of these enhancements will also benefit human translators who are using translation technology. Theoretical FrameworkUnless astounding breakthroughs in computational linguistics appear on the horizon, the next generation of machine translation systems is not likely to replace all human translators or even reduce the current level of need for human translators. We should therefore build the next generation of systems on the current generation, looking for ways to further help both human and machine translation benefit from technology that has been shown to work. Certain aspects of currently understood technology have not yet been fully implemented and can provide a firm foundation, both for the further development of existing systems and for implementing new systems. My five practical proposals are based on the assumption that the result of putting a text through a machine translation system largely depends on the type of text being translated. Machine translation has always done best on narrow texts restricted to a single domain of knowledge, with predictable meaning and rigorously controlled use of syntax and vocabulary. This type of text is sometimes called a controlled domain-specific text. There is also another text type called a dynamic-general text (see Melby and Warner in the references for details of the contrast between the two). Although most texts fall somewhere between these two extremes, the more dynamic and general the elements found in a text, the more difficulty the text presents to machine translation. Practical Proposals1. Consumer Labeling for Machine Translation SystemsMy first proposal applies both to current and future machine translation systems. Machine translation vendors should be open about what kind of text their system is intended to translate, and for what purpose. Presently, there are two very different uses for machine translation. One major use for machine translation is for what I call "indicative translation" (others have called it "gisting" or "information gathering"). An indicative translation is not intended to be a high-quality translation comparable to the work of a professional human translator. Its purpose is only to provide a rough indication of the content of a document written in a language that the user does not read. An indicative translation into the user's first language may satisfy the needs of the user relative to the document in question, or it may be used as a basis for deciding whether to commission a high-quality human translation of the document. A machine translation system for indicative translation can take on any kind of text and is likely to produce a somewhat useful, though rather ugly, result. The ugliness usually stems from how the raw output is produced. The source language is manipulated by replacing source-language words with target-language words. As Humphreys put it at the 1991 Aslib conference on Translating and the Computer (5), such systems perform "extensive cosmetic surgery" on the source text, rather than really translating it. This description may be overly unkind, but nevertheless, the result is a non-natural language which is interpreted by the user as natural language thanks to his or her intelligence, not the intelligibility of the raw output. A second major use of machine translation is for what I call "publication-quality translation". A publication-quality translation must be comparable to a professional human translation intended for publication. Raw machine translation is seldom up to publication before post-editing by a skilled human bilingual. In using machine translation as a step towards publication-quality translation, an important economic and human issue is the amount of post-editing required to bring the raw output up to publication standards. If the gap between the raw output and the finished product is too wide, the post-editing process will be excessively expensive for the user and cruelly tedious for the human post-editor. Current machine translation systems for publication-quality translation are highly sensitive to text type. Usually, in order to keep post-editing down to a reasonable level, the source text must be an example of controlled language, such as a Xerox photocopier maintenance manual. If the source text is more general, then it must be highly predictable and not require reference to pragmatics for its interpretation. More likely, however, a machine translation system for publication quality will be tailored to one knowledge domain at a time. This tailoring involves access to a domain-specific terminology database. A combination of indicative and publication-quality translation is conceivable. Some human post-editors, but not all, are able to make limited corrections to an indicative translation, keeping the cost of post-editing down while making the text somewhat more readable. Just as food is labeled in some countries to indicate nutritional content as a service to the consumer, perhaps a machine translation system should be clearly labeled as to the type or types of text it is intended to translate and whether it should be expected to produce indicative translation or provide a basis for publication-quality translation and what amount of post-editing should be expected. Of course, no system can produce useful output if the dictionary is not well-made and appropriate. But it is unlikely that a single machine translation system would be equally suitable for both indicative and publication-quality translation. A publication-quality system would be lost in the variety of texts presented to an indicative translation system, and an indicative translation system would not be able to produce sufficiently high quality raw output because it would not take advantage of the properties of a controlled-language text. Software localization requires publication-quality translation. Thus an indicative quality machine translation system, although useful for some tasks, will not be useful as a tool in software localization. 2. A Formal Definition of Grammar for Controlled LanguagesConsider a machine translation system that is labeled for use as a tool in the process of producing publication-quality translations of controlled-language texts. Such systems usually attempt a complete syntactic analysis of each sentence of source text. As mentioned above, the successful use of such a system requires minimizing the amount of post-editing required on the raw output. One factor that heavily influences the ability of the system to perform syntactic analysis, directly affecting the quality of the raw output and thus the amount of post-editing, is the degree of match between the syntactic structure of the source sentences and the formal grammar that defines what structures the system is expecting. Note that the system is not expected to accept all sentences of the source language that a human would accept. By definition, a controlled language is a formal language which resembles naturally produced human language but is not identical to it. My second proposal is that computational linguists create an explicit formal definition of syntax for controlled languages that can be studied independently of any particular computer implementation. This definition could be as simple as a context-free grammar such as those used in computer science to define the syntax of programming languages. Another possibility would be to use some branch of mainstream generative grammar, such as the principles and parameters approach or the head-driven phrase structure approach, or a branch of dependency grammar. Along with the core definition would be a mechanism for modifying the definition to allow or disallow certain constructions and for testing the modified grammar for proper format and internal consistency. Essential to this already substantial task is the non-trivial task of developing verification software that can be used to check a source text for compliance with a formal definition. Some computer programming languages have such verifiers, sometimes called "lint" programs. This controlled-language syntax checker should be used as early on in the document production chain as possible, preferably by the author. In some environments, the syntax checker could even be integrated with the word processor that is used by the author to create the text. What is called a grammar and style checker in today's word processors does not go far enough toward a complete syntactic analysis, but is certainly an important step along the way. The grammar of a controlled language should be carefully crafted to strike a balance that reduces ambiguity without so limiting the range of allowable constructions that authors feel suffocated. Some syntax checkers do exist, but they are still too proprietary. I am calling for a public standard for defining the grammar of a controlled language. The same grammar would be used by both a syntax checker and a controlled-language machine translation system. This is a big project, but an appropriate one for a university, where it could be viewed as a public service. If text is re-checked for compliance with the syntax expected by the machine translation system just before it goes into the system, then there should be no syntactic analysis errors, and the quality of the raw output should be noticeably higher. One does not expect top performance from an automobile if the type of fuel does not match the design of the engine. Diesel engines should be fed only diesel fuel. It is also possible that grammar checkers could be told which domain or domains are allowed in the source text and thus enter into the territory of semantics, detecting ambiguities of reference caused by allowing multiple domains. As a colleague, Klaus Schubert, has suggested, an ideal grammar checker for machine translation is a filter that tells you whether the text is likely to produce high-quality raw machine translation and, if not, how to remedy that. To my knowledge, the formal grammar of a software product has not yet been defined, and perhaps this would be a worthwhile project. One factor in localization that works against such a grammar is the tricks that are used to make a label or message fit into a limited space. 3. Format Use During Translation, not just Format PreservationTypically, the format of a translation is expected to match the format of the source text. It is well-established that in this case a machine translation is more effective if it can preserve the format of the source text in the raw translation. My third proposal for improving machine translation systems is a public standard to facilitate not the preservation of formatting information during machine translation but the use of that information to improve the quality of the raw machine translation. The obvious basis for such a standard would be SGML. SGML is an international standard for formally defining the structure of a class of documents. Specifically, my proposal is to develop a method of associating elements of an SGML DTD (Document Type Definition) with an inventory of element types (such as headings and bibliographic references) that are useful to a machine translation algorithm. One of the best-known applications of SGML is called HTML and is used to define the structure of a page on the World Wide Web. A simple example of an element type would be a heading. Headings are often noun phrases rather than full sentences, and the machine translation grammar can benefit from knowing that a particular piece of text is a heading. Another simple example is to distinguish among various uses of quotation marks. A literal quotation of what someone said should be treated differently from a quotation that indicates an unusual or made-up word. Using SGML, the above distinctions can be made available to guide computer processing even where not all the distinctions are visible to the end user. Some work on the use of format during machine translation is being carried out at Carnegie Mellon University. Hopefully, some of the fruit of such projects will soon become available to all machine translation developers. To preserve the format of a text in the output text without taking advantage of it during translation, is, from the perspective of the MT system, like eating the skin of a peach and throwing away the inside. Admittedly, it may be a long time before all source texts are marked up in SGML when they are authored, but there is a portion of this proposal that could be implemented very soon. In addition to marking up the body of a text with format codes, there should be a standard translation request header before the body of the text which would include the language of the source text, the desired target language or languages, and other specifications of how the translation should be processed by a machine (or by a human, for that matter). In software, we must, of course, distinguish between user interface and help files. Help files, though typically marked up using RTF, could be automatically converted to RTF from HTML. As software becomes more integrated with the Web, this is already happening. Then machine translation systems could be designed to use the format codes in HTML to improve the quality of the output. 4. Universal Terminology InterchangeNo matter how well the syntax of a controlled-language source text has been checked and its format specified, the raw translation cannot be of high quality unless an appropriate bilingual terminology database (sometimes called a "termbase") is available to the machine translation system. An appropriate termbase is one that contains source-language term pairs which match the domain of the source text and the desires of the requester of translation. The quality of the termbase may well be more important to the quality of the raw output than is the quality of the syntactic analysis component. As Robin Bonthrone points out in the September 1996 issue of the LISA Forum Newsletter, the "availability of validated terminology" is essential to translation quality. A present hindrance to the use of terminology is the diversity of formats in which termbases are laid out, which makes it difficult to use them with different machine translation systems. My fourth proposal is further efforts toward the development of a standard format for the interchange of terminology in machine-readable form. There are several international projects now attempting to define a terminology interchange format that could be used by both computer tools for human translators and machine translation systems. One is the TRANSTERM project, which recently produced its final report. Another effort is phase two of the MARTIF project. MARTIF (a project of ISO TC/37) was limited in its first phase to pre-negotiated interchange between termbases designed for human use. Phase Two of the MARTIF project is intended to build on the work in Phase One in various ways and with various partners. One potential partner is OTELO (a European Union project). A presentation on terminology interchange is being prepared for the March 1997 LISA Forum, and a workshop on the subject is scheduled for the International LSP Conference next August in Copenhagen. A widely-accepted standard format for terminology interchange would allow the transmission of terminology along with a source text. In most source texts, some specialized terms need to be translated consistently. Concept entries for those terms would be extracted from an appropriate termbase and passed along with the document in a universal interchange format. The termbase subset would then be converted, more or less automatically, to the internal representation used by the machine translation system and consulted during the translation process. Of course, we are not talking about function words (grammatical markers such as prepositions and conjunctions), since these are coded very differently in different systems and it would be very difficult to find a universal format for the information which the systems need about such words. Initially, verbs, adjectives and adverbs would not be interchanged either. However, specialized terms are typically nouns or phrases that can be treated as if they were a noun, and these are more straightforward to encode for use by multiple systems. Even so, the universal encoding of the various features used by machine translation systems on nouns would be an extremely challenging project. It remains to be seen whether a terminology interchange format will permit automatic interchange among different machine translation systems, but the rewards justify a considerable effort in this regard and even a partially automatic interchange might be worthwhile. An interchange format would also allow shared development of a very friendly dictionary update module. 5. Links Between a Source Text and a TermbaseOnce most machine translation systems take advantage of the format codes in the source text and most source texts are delivered with machine-readable termbase subsets, my fifth proposal would be to mark terms in the source text at authoring or soon afterwards with the help of an editorial assistant, and then link them to a termbase. This would improve the quality of the source text and facilitate later translation. Using SGML, a term can be marked unambiguously in the source text even if it consists of several words, and the markup need not be visible in the presentation of the text to the end user. Potential BenefitsThe five proposals just made are ambitious yet realistic: they assume no breakthroughs in linguistic theory or computer software or hardware. Although they will require a lot of hard work and a spirit of cooperation among developers, translators, and linguists, the potential benefits are substantial. Consumer labeling for machine translation systems would reduce disappointment by users. Systems for gisting and systems for producing publication quality output are not interchangeable. A standard way of defining syntax for controlled languages would permit higher quality output, while comparisons of two systems would be easier if they both accepted the same formal grammar. For publication-quality systems, the emphasis should not be on continuously broadening the range of structures that are accepted. Neither should next-generation production systems attempt to produce high-quality output for dynamic and creative texts. The emphasis should be on defining controlled languages that are as restricted and easily processed as possible, while allowing sufficient structures and concepts to express what needs to be said. Then effort can be put into improving the quality of the raw output for a given controlled language. The third, fourth, and fifth proposals (marking and using SGML format codes in the source text, accompanying the source text with a termbase subset in a universal format, and linking the text with the termbase subset) would benefit not just machine translation and post-editors but also human translators who do not use machine translation. These last three proposals would allow more sophisticated and effective translator productivity tools. ReferencesThanks are expressed to the colleagues who provided valuable feedback on this paper, especially Roald Skarsten, Karin Spalink, Klaus Schubert, Michael Sneddon, and Arle Lommel. Deborah Fry suggested some revisions for the LISA audience as opposed to the Aslib audience. 1) Snell-Hornby, Mary. 1988. Translation Studies: An Integrated Approach. Amsterdam/Philadelphia: John Benjamins Publishing Company. 2) Lakoff, George. 1987. Women, fire, and dangerous things: what categories reveal about the mind. Chicago: University of Chicago Press. 3) Melby, Alan and C. Terry Warner. 1996. Translation and Free Will. A paper presented at an international symposium on historical and theoretical aspects of translation held at the Geneva school for translators and interpreters in October 1996. 4) Melby, Alan K. with C. Terry Warner (1995) The Possibility of Language. Amsterdam: John Benjamins Publishing Company. 5) Humphreys, R Lee. 1992. Proceedings of the 1991 Aslib conference "Translating and the Computer 13", page 93. London: Aslib. |
![]() 8-12 December 2008 |
||