As a case study, this paper will demonstrate how Euterpe, the large terminological database of the Parliament of the European Communities, can be made accessible by means of a Web Server interface in a form that can be used by both the Parliaments translators in their in-house Intranet environment and, in a later state, by users all over the world to query the rich contents of Euterpe.
The paper first introduces Euterpe, then goes on to discuss the possibilities and difficulties to access multilingual terminological data on basis of Intranet or Internet technology. Finally a short description of our on-going project and a summary will be given.
Euterpe: the database of the Parliament of the European Community
Euterpe is the terminological database of the European Parliament and extensively covers political, scientific, technical, medical and educational domains in 11 European languages and Latin. It was created in the 80’s by using MultiTerm and has grown by now to more than 160,000 entries.
Within the Parliament a distinction is drawn between the terminology division and the translation division. Members of the terminology division are responsible for creating new entries, updating existing entries and, in general, ensuring the consistency of the database. They need full access to the database through a sophisticated terminology management system. Members of the translation division are end users who do not create or edit entries but use the existing database as an online reference during their translation work. They do give feedback to terminologists in case of incomplete or incorrect entries. For this type of users, a read-only access to the database is sufficient and their main interest lies in flexible searching facilities and easy copying of search results into their text processor. Currently the access to Euterpe is achieved by using MultiTerm in read-only mode or loading Euterpe into MultiTerm Dictionary, possibly together with other databases, and using the fast and flexible search and data exchange facilities Multi-Term and MultiTerm Dictionary provide.
Recently the Parliament decided to upgrade its computer infrastructure from the vintage text-mode based DOS to the more modern graphical MS-Windows operating system. The in-house information system will be converted to be based on Web technology and Netscape Navigator will be the standard Web Browser on all desktops. This development led us to launch a project that allows accessing Euterpe through Web Browsers. We expect that embedding Euterpe as resource into the common information flow and using one single tool to access all information will increase the attractivity and usability of the overall system to all users.
Terminology on the WEB
What advantages has Internet technology to offer
The Internet and the Web allow an easy and platform-independent access to a huge variety of remote resources. Resources may range from plain informational texts through hierarchically structured and interlinked documents to highly complex data managed by (relational or other) database management systems. The seamless integration of all these resources into one information system is achieved by Web-Servers and Browsers that hide the different storage techniques and data formats from the user.
Users are thus relieved from knowing and using many different tools to search for and access information. No need to clutter up the desktop with a window for Email, News, database access, each with its own idiosyncratic interface. Instead, users only need to learn one single Browser. Browsers present all information by a nice and easily understandable graphical interface. Navigation through resources works by the user clicking on highlighted links or consulting her history of already visited pages. The integration of Email into Browsers combines access and exchange of information and enables users to not only consume but actively and successfully participate in the very complex information flow.
Technical possibilities
Browsers
Web Browsers provide a consistent interface to a variety of resources thereby hiding the different sources, the underlying storage and representation differences and much more. For users, the consistent view of information thereby achieved simplifies not only the access of and navigation through complex information structures but also hides the hardware and operating system idiosyncrasies among different platforms. Accessing the Web from a Windows PC or a Macintosh or a Unix machine requires no tedious learning of new software interfaces. Temporarily using a colleague’s machine just to look something up is possible without any problems.
Browsers are more and more endowed with sophisticated rendering and layout facilities as well as embedded programming languages:
- Font size and color attributes can be used to mark important elements of texts such as the search term in a term entry and in general increase the readability and understandability of complex information.
- Tables can be used to present simple or lightly complex tabular information in its most appropriate form. The current prototype implementation of the Euterpe interface uses tables to display terms.
- Frames allow the division of the main presentation window into subwindows with dedicated contents: one frame can always display navigation links and help information while others display the searchterm, the hitlist and the current terminological entry. Frames at a fixed position and with fixed contents greatly increase the user’s understanding of the browsed resource and her ability to navigate without “getting lost in hyperspace”. Frames will in a later stage of the project be used to provide for an interface resembling the MultiTerm Dictionary.
- Embedded (interpreted or compiled) languages such as Java or Javascript increase the versatility of the Browser’s presentation engine and the flexibility of user interaction. Through configurable options the interface can be adopted to individual users and users’ needs. Embedded programs also increase the maintainability of the interface for the information supplier because the program source needs only to be maintained in a single place and can be easily updated by downloading it from the central repository.
Despite their enormous potential to increase the flexibility of Browsers, the recent embedding of programming languages still suffers from security and trust problems. The access to operating system resources such as external files or fonts to render text in different languages is still limited because of possible security infringements of the user’s machine. Further developments are needed in this area to allow safe yet useful operations.
Servers and data sources
Web Servers provide flexible means to interface with existing external databases. The oldest and most widespread interface is the Common Gateway Interface (CGI) and other, more powerful interfaces like ODBC are being developed to access (relational) databases. This interfacing allows:
- The flexible integration of virtually any existing database into the Web. Databases do not have to be copied or converted into special formats in regular intervals, thus always enabling direct access to the most up-to-date information.
- The proper separation between the external access routines and the internal storage structures and access mechanisms. An interface program hides the database-specific details and handles all necessary data conversions automatically and transparently to the Server.
- An additional layer in which security checks and logging can be performed on a per database basis. Whereas Intranet access should always be possible, external Internet access must be subject to access and accounting policies and, in case of sensitive data, comprehensive logging of all accesses must be performed.
- Distributed databases. The Web Server and the database engine need not run on a single machine nor need the data to be stored in a single place. This leads to high availability and overall performance of the system. Together with firewalls, a distributed architecture can be used to increase the security of company- or institution-internal data by routing all requests through a single authorizing server that enforces proper access restrictions.
Although today powerful SQL databases exist, like the Oracle WebSystem or Microsoft SQL, that can be efficiently accessed by ODBC drivers or that are even able to directly act as a HTTP-Server and to convert their data to HTML, they’re not useful for managing terminological data. The following reasons speak against using relational databases:
- Terminological entries are not fixed size table structures. They are variable length and highly structured data and pressing them into relations means to break them into small pieces and then suffering from size and integrity checking limitations of the DBMS as well as sluggish data access.
- SQL databases support (up to now) only one 8-bit character set per database. Multilingual entries however require different character sets to encode terms of non-Western languages like Greek, Hebrew, Russian, etc. Languages like Chinese or Korean that have to use non-8-bit codings cannot be handled at all. No database system does yet support the unifying Unicode character set, which would solve all language representation problems immediately.
- Indexing of terms cannot be handled in the standard SQL way of binary sorting the textual (ASCII) representation of lemmata. Language dependent sorting algorithms must use extensive linguistic knowledge about the language a term belongs to and, e.g., suppress punctuation. Multilingual entries must be sorted according to several languages in parallel, which no standard database currently supports.
- Term retrieval must use error-tolerant algorithms so that the search for an entry not only retrieves exact matches (which might be none in case of misspellings or mismatching whitespace and punctuation) but also all similar matches. Especially in the language industry this is an important feature and its missing can render an otherwise valuable database useless.
The Trados MultiTerm and MultiTerm Dictionary are based on fulltext databases in which terms are stored in SGML format. All terms are indexed according to the language they belong to. The use of an error-tolerant indexing mechanism allows for the retrieval of terms similar to the search term. All results are ranked and presented to the user in form of a hit list from which she may choose the correct term. For these reasons, the MultiTerm database and powerful indexing routines will be used in the Euterpe project.
Data exchange protocols and Multilingualism in the Web
The Web’s promise to be one huge information source includes the access to multilingual documents. However, the currently used protocols HTTP and HTML suffer from their fixation on western languages that can be coded in 7-bit US-ASCII or, at most, in 8-bit ISO-8859-1. To avoid confusion about the meaning of the two ubiquitous abbreviations, let us first have a look at how information exchange is handled by these two protocols and straight out the roles they play in the Web:
HTTP stands for HyperText Transfer Protocol and is used as a transfer protocol between Browsers requesting a resource (a text document, a sound recording, a terminological entry, etc.) and Servers locating and delivering the resource. Resources are named by so-called Uniform Resource Locators (URL). The protocol works like this:
- The Browser receives a uniform resource locator (URL) from the user.
- The Browser requests the resource specified by the URL from the appropriate Server by sending it a HTTP GET request. The URL is thereby communicated in form of a sequence of octets, which represent characters encoded according to ISO-8859-1.
- The Server parses the URL, locates and retrieves the document, and then constructs a request reply. The reply consists of a header describing the type of content that follows, and the content. Several content types are possible, like “text/html” or application dependent contents like “application/post-script”. The contents of the resource are transmitted as is. An occasional additional header line “content-encoding” may notify the client that the document does not consist of lines of text but actually is a compressed or otherwise binary file.
- The Browser reads the header’s content description and, using a mapping table, decides how to read the content and how to present it to the user. Most binary files like compressed files or video data are not handled by the Browser itself but instead sent to an external application program or simply stored in a file.
Note that HTTP is but one information transmission protocol amongst others like SMTP (for Email), FTP (for generic file transfers) or TELNET (for interactive 2-way communication with a remote machine). Since HTTP is a very simple and stateless protocol – compared to, say, the highly complex stateful SMTP – and it is sufficient to transfer any file, it has developed to the one protocol that today operates the Web.
HTML now stands for HyperText Markup Language. HTML is a document format, not a transfer protocol. HTML-coded documents are text files that adhere to a special internal format, the HTML markup. As documents, HTML files are communicated over HTTP by the server specifying the content-type “text/html” in the header and appending the file as is.
HTML markup was initially designed to specify the content structure of a document. It consists of SGML-like tags that specify header lines (
), highlighted or emphasized text parts () and so on. Little provision for layout was foreseen in the initial design of HTML and this is what most of the page designers suffer from today. More possibilities to influence the layout are nowadays available by tables or frames.
A much stronger limitation of HTML consists of the inherently missing support for multilingual data. Imagine a termbank entry containing information in English and Greek (we leave aside non-European languages like Chinese for the moment). An HTML document containing that entry has no means to specify that the Greek information is coded in another character set (ISO-8859-7) than the English information (ISO-8859-1). Specifying, let alone switching, character encodings in one document is not possible. Displaying the document using an ISO-8859-1 font will map all Greek characters to some funny European accented or special characters, thus rendering the document unreadable.
So in the end we are faced with the following problems:
- Universal Resource Locators are not universal at all. They can only specify resources that can be named by ISO-8859-1 western characters. Search for a document named with non-Western characters or for a Greek term in a termbase is not possible unless there is a way to communicate the language and the character coding in which the URL is given.
- Character sets cannot be specified for parts of a document. Proposals exist for introducing special tags for switching languages and character sets, but up to now no agreement has been reached.
Unicode is a 16-bit character set that could solve all problems of representing multilingual documents all at once. Since 16 bits are enough to represent any character in any language, documents could contain all possible characters peacefully side-by-side. Despite the increasing globalization and development of the information society, the ultimate solution Unicode is far from being anywhere near in our 8-bit ANSI dominated world. Using Unicode at the moment is hindered by the following reasons:
- Proper support of operating and window systems is missing. Receiving user input and displaying texts in Unicode is not possible on all machines.
- Use of Unicode or one of its transfer encodings would require non-trivial changes to Browsers and Servers. Browsers must accept URLs in Unicode from the user and display Unicode documents correctly. Both Browsers and Servers have to use a Unicode-enhanced transport protocol to successfully transmit URLs and documents.
- Individual solutions are not viable because they violate the Web’s universal philosophy. Real multilinguality must be subject to international agreement.[1]
In the case of Euterpe we implemented a solution that successfully covers Greek terms. An example of a multilingual entry containing Greek is given at the end of this paper.
Copyright and commercial issues
Making a valuable resource like a termbank accessible on a network may quickly run into copyright problems. Even in the case of institutions or companies employing their own terminologists to create term entries, copying entries from other sources leads to questionable copyright situations on the resulting termbank. In most cases the copyright situation is unclear and the termbank cannot be easily made publicly accessible. Furthermore, term creation is a very costly activity and therefore many owners of terminology are not willing to make their data publicly accessible at all.
If companies are in the (legal) state to make their databases accessible on the Internet, the databases most often constitute the very ground of the companies’ business operation. If the publication is not restricted to marketing purposes, access to the database can be allowed only when users pay a usage fee. Fees can be collected by, e.g., creating an account that is regularly filled by credit card payments and from which a small amount of money is subtracted every time she accesses the termbase. Money transfers from and to such accounts must be performed very safely and carefully. Yet unsolved problems with safe money transfers include:
- Accounting and (credit card) payment procedures are not yet standardized or generally accepted for use. They must involve transaction semantics which is by now not foreseen in the HTTP protocol. Transactions also slow down network communication; if casual access to information is punished by long waits, clients will refrain from using the service.
- Secure authentication schemes are necessary to guarantee that the provider charges the correct clients account and that the client does not transmit virtual money or the right to debit an account to a fraudulent supplier.
- Any communication involving authentication and payments must be strongly encrypted so that nobody else than the involved parties is able to decrypt and use the transmitted information. Governments usually do not like strong and safe encryption schemes they cannot control. They try every possible way to prohibit their (non-governmental) use or at least get their hands on a backdoor that guarantees them post-transactional traffic-tracing.
- Mutual trust in general is required between the provider and the client and, depending on the protocol, a third independent authority may be involved. Independent authorities are things governments usually do not want to see either. Their existence is like strong encryption only tolerated if the government can control them (or is identical to them, which is worse).
Most commercial applications will therefore, at least in the next future, be based on Intranet solutions allowing no or only strongly controlled access from the outside world.
Current implementation
MultiTerm on the WEB has been implemented for the European Parliament to integrate Euterpe into the general in-house information system. The current implementation is based on the rich experience of Trados MultiTerm and consists of a new and fully object-oriented redesign of the main data handling and interface routines. All classes are based on Unicode to process data in all possible languages. The client/server architecture constructed around a central database enables us to achieve high performance through the Web-Interface. Error tolerant search routines allow meaningful results even in case of misspellings or inexact or partial queries. Multilingual display and query of entries can be handled. Flexible access restrictions may be enforced based either on the machine a request is originating from or on a standard password protection scheme.
If our tests within the Parliament are successful, an opening of the database to the Internet is under consideration.
Summary
From our experience in the field we expect the following developments in the future:
- Better support for multilingualism by including Unicode into all operating systems and user software.
- Inclusion of termbases into general information systems will broaden their usage to more and more non-professional users.
- Difficult copyright situations will in the near future lead mainly to Intranet-based solutions. The main distribution channels will remain books or conventional electronic publishing, e.g., CD-ROM.
The authors can be contacted at:
TRADOS Benelux S.A.
Av. de Tervuren 303
1150 Brussels, Belgium
Notes
[1] For instance, the Accent Multilingual Browser is based on an Unicode-enhanced HTML dialect and allows for display of multilingual text. The HTML enhancements have been proposed to the IETF for acceptance as an international standard. Changing HTML, however, is only part of the solution. The transmission protocols and URL naming are not covered by Accent´s proposal.
Figure 1: Euterpe sample entry for “non-biodegradable detergent”