LISA Home page [© 2010 • ISSN 1420-3693 • www.localization.org]
© 2010 SMP Marketing • ISSN 1420-3693 • www.localization.org
Making Money with Machine Translation
Every Cash Cow Starts Out as a Calf!

Monika Röthlisberger, Group Manager Language Technologies, CLS Communication

Several of our customers told us, “Our people keep sending confidential texts out to free machine translation systems on the internet. This is clearly a security issue; we must offer them a secure alternative. I don’t want to see any more of our documents on unidentified web servers. Do something about it!” If a customer actively asks for such a service, there must be some money in it, we thought, and that’s how our machine translation (MT) adventure began at CLS Communication.


Monika Röthlisberger

In the beginning, it truly was an adventure. Being used to working with off-the-shelf language technology products, such as translation memory systems, we felt a little like Robinson, Adam and Eve in one person, as we had to cut back the jungle and bite into a number of poisonous apples. Many things were not as easy as we had expected. With time and experience, we have been able to cut down the adventure part and now work with a system we can safely and proudly offer to our customers in exchange for payment. This article describes CLS Machine Translation today and takes a look at some of the issues we have had to overcome during the “adventure phase.”

CLS Machine Translation Today

To our users, CLS MT looks very similar to any free translation service on the web. Via a web interface, it supports translation of text fragments, as well as entire documents. The interface allows the user to specify translation parameters such as source and target language or subject. Since we developed our own interface, we can easily adapt it to a customer’s corporate identity. Users have the choice between fifteen language versions (German-English, English-German, German-French, French-German, English-French, French-English, English-Spanish, Spanish-English, Spanish-German, German-Spanish, English-Italian, Russian-English, English-Russian, Russian-German, and German-Russian). The system now accepts documents in several formats, including .doc, .rtf, .htm and .txt.

Web Interface to CLS Machine Translation

Web Interface to CLS Machine Translation

Terminology

CLS Machine Translation integrates 30,000 terms in four languages (German, English, French and Spanish) from our terminology database, CLSTerm, and 1,250,000 translation units of translation memory data. The terminology mainly covers the financial, insurance, legal and telecoms sectors. Our MT team has coded an additional 30,000 entries, including the names of companies and people, as well as unknown words identified in customer texts. As our MT team is permanently available for dictionary coding, it is able to quickly integrate a new customer’s terminology into the system to serve that particular customer.

Security

Security concerns were the mother of our MT project, as I explained earlier, so security is still a top priority: Data transfer between customers and our MT server is encrypted, based on SSL technology. Some customers even prefer to have their own dedicated MT server, accessed via a direct line, so that their data never travels over the internet. The system boasts a current uptime of 99.9%, with the MT team offering technical and linguistic support during office hours.

Uses

Most customers use CLS MT directly through its web interface, sending entire texts or just looking up single words and expressions in other languages. Currently, this type of use is what generates most of the money made with the service. For some of our customers, texts translated by the machine are post-edited by human translators at reduced rates compared to human translation. Such texts are typically internal documents that customers use purely for information purposes.

Turning the Prototype Into a Product

When we first installed our MT engine and performed a couple of tests, everything ran smoothly. This, of course, was because we were all very “kind” users, in the sense that we only used the system as described in the documentation. However, when we made CLS MT available to our first test customers, system stability was a big issue. The system would crash when a customer sent a document that was too big, a Word document(!), or just because it did not feel like working that day, it seemed. We could not possibly offer an MT service on the market with considerable downtime, nor were we able to teach all of our customers how to convert a Word document to RTF, so this was a very urgent issue. In other words, we needed to turn the prototype into a product. First, we had to closely monitor the system “by hand,” restarting the servers whenever necessary (nice job looking at uptime controls all day…). Later, architectural changes reduced the number of crashes, and monitoring was automated.

Terminology Database Data Is Not MT Terminology data

As CLS already had a terminology database (TDB) of more than 60,000 entries, we expected to be able to simply transfer them from one system to the other, exporting them from the terminology management system and importing them into the MT terminology module. However, in many respects, TDB data is not MT terminology data. TDB data’s target audience is human, while MT terminology data’s target audience is a machine.

Even though our TDB is very elaborately structured, including the definitions, usage and context sources for each entry, etc., the only fields that were of real use to the MT module were the actual terms along with their gender. As the two systems used very different classification systems, the subjects could not be transferred and recycled either. Moreover, our TDB system is concept-centered, i.e. synonyms such as investment fund, unit trust and FCP (and their French equivalents fonds de placement, fonds d’investissement and FCP) appear in the same entry. However, the MT’s terminology module is word-centered, i.e. the terms investment fund, unit trust and FCP all have their own entries and transfer definitions to the target terms. Consequently, all TDB entries including synonyms had to be hand-coded if no term pairs were to be lost. In order to automatically transfer these entries, they also needed to contain information about when each term pair is to be used.

To deal with this in a pragmatic way, we decided to automatically import the first of the synonyms in each language, thus sacrificing all of the other possible term pairs in each entry. In a time-consuming process, we then had to comb through all the imported entries to check the imported term pairs. In the meantime, the CLS terminology team is working away, and of course, we need to include their current work in the MT system, too. As the issues described above cannot be solved quickly, the transfer of the new TDB entries created by the terminologists still involves a large amount of hand-coding.

TM Data for Humans Is Not TM Data for Machines

Translation Memory (TM) data was surprisingly easy to export from the TM system and import into the MT system. Here, the issue was more with the actual TM content. Again, TM data for human translators is not necessarily TM data for machines. Oftentimes, source and matching target segments depend on a specific context and are not of a generic nature. In one of the problematic segments, the German title Zweck ‘purpose’, for example, translated into What you can use your bank card for, which was fine in the particular letter the segment stemmed from. In most other contexts, however, this would be a rather awkward translation. For a human translator, this is easy to spot, and she will not make any mistakes because of this. The machine, however, will take any match for granted…

In order to deal with this problem we have had to comb through the translation memories – partly using automated scripts, partly through a manual process – and delete such non-generic and context-specific sentence pairs. As this is quite labor-intensive, we cannot possibly integrate newly produced TM content into the MT system without the time delay required for cleaning the data.

In summary, this is the advice that we learned the hard way while bringing CLS Machine Translation from the jungle to the market place:

  1. Be aware that many commercially available MT systems are not really comparable to mature IT products yet. This means that the actual MT engine may work well, but the “accessories” necessary for professional use are often still in a prototype phase. Time and money must be invested.
  2. Transferring terminology data from a terminology database to an MT system may sound easy at first, but it is not. Very often, parallel coding of the same data cannot be avoided. Again, time and money must be invested.
  3. Transferring translation memory data to an MT system may sound straightforward, too, but TM data needs revising before it can be used in MT. Once again, time and money must be invested.

Making Money?

If you’ve been reading carefully, you may have noticed that the term money appears more often next to the term invest than next to the term make. This clearly mirrors the situation during the first phase, the “adventure phase,” when the actual product is being defined and set up. There is money in MT. However, if you want to make some, you need to be very patient and plan for the long-term. Every cash cow starts out as a calf!


Monika Röthlisberger joined CLS Communication six years ago and is now responsible for translation memory and machine translation support for translators. She is a trained translator and terminologist and began her career in the language industry eight years ago with TRADOS Switzerland. Röthlisberger can be reached at monika.roethlisberger@cls.ch.




Contents


LISA Business Data

LISA Publications Catalog

Industry Insights Reports

Best Practice Guides

Surveys

QA Model

Forum Summaries and Presentations

LISA Globalization Consulting Network

Webinars and TouchPoint Advisory Calls


Join LISA

Subscribe


Upcoming Events

LISA Forum USA
(Foster City, California, April 13–16, 2010)

LISA@Chinasoft Fair
(Chengdu, China)

LISA Forum Asia
(Suzhou, June 28–July 1, 2010)

LISA Forum Europe
(Budapest, October, 2010)

LISA Forum India
(New Delhi, December, 2010)


Open StandardsTBXTMX

Terminology SIG

Job and CV Postings