|
In this issue…
Knowledge-Based Translation Technology in Japan: The EDR Project
In Japan, knowledge processing and natural language processing are perceived as the core of information technology. Natural language processing plays an increasingly important role in electronic knowledge representation. Japanese major investments into the development of adequate technology in the form of very large-scale knowledge bases, are the foundation on which advanced software technology of the future will be based. Both the creation of an efficient framework for knowledge management and an approach emphasizing multimedia for the broad range of knowledge to be handled are considered to be of central future relevance. They are connected by an approach based on document processing and natural language processing. Although all knowledge representation media (natural languages, formal languages, and picture languages as well as images and sounds) are taken into consideration, at present particular emphasis is placed on natural language processing, because natural language is the medium with the most versatile expressive capacity and the most highly developed ability for symbolic representation. Natural language is therefore the central medium to Japanese very large-scale knowledge base development. At the same time, there has been increasing awareness in Japan of the importance of access to research and development as well as technological and industrial progress abroad. In order to overcome critical language barriers between Japan and other countries, considerable funds have been allocated, both by the Japanese government and Industry, to the development of suitable natural language processing tools. Japanese corporations such as Fujitsu, NEC, Hitachi, Sharp, Toshiba, Oki Electric Industry, Mitsubishi, and Matsushita, have developed machine translation systems and have been cooperating in a large-scale dictionary and knowledge base development project. The EDR Electronic DictionaryRather than sponsoring isolated developments, a comprehensive nine year national project was planned and implemented by the Japanese Ministry of International Trade and Industry (MITI). Joint funding, which is expected to amount to about 14 billion Yen, has been provided by the government and the eight industrial corporations mentioned above. The overall objective of the project is to achieve inter-industrial and international cooperation in the development of a general dictionary specification, a development methodology, and support systems. In 1986, the Japan Electronic Dictionary Research Institute Ltd. (EDR) was established and started developing the EDR Electronic Dictionary. The EDR Electronic Dictionary encompasses both the Japanese and the English languages. English was adopted as a foreign language because of its international significance, but EDR plans to establish cooperative relations with research groups outside the English speaking community. Cooperation has already been established with China, Thailand, Malaysia, and Indonesia. The EDR electronic dictionary has the desired properties of a very large-scale knowledge base, due to a balanced and systematic connection between what can only be understood by humans and what can be understood by computers. Two types of knowledge representation media have been used in the electronic dictionary, namely natural language and a formal or dictionary description language. Natural language for human users is provided in establishing word meanings through examples. Formal language to be understood and processed by computers is used in representing the contents of the concept dictionary (cf. below) and grammatical information. The EDR dictionary is composed of a number of inter-related large-scale dictionaries. Four types of dictionaries are distinguished, namely: the word dictionary; the concept dictionary; the co-occurrence dictionary and; the bilingual dictionary. 1. The Word DictionaryThe word dictionary is divided into the general vocabulary dictionary and the technical term dictionary. The general vocabulary dictionary is further subdivided into Japanese and English dictionaries of 200 000 words each. The technical term dictionary covers the information processing domain. It is divided into Japanese and English dictionaries, each containing 100 000 words. Apart from the lexical entries, the word dictionary contains headword information, morphological and syntactic information as well as semantic information in the form of concept identifiers with concept illustrations. The concept represents the meaning of a word, and the concept identifier is used as an index or interface for computers to refer to the concept dictionary. The development of the word dictionary has been concluded. 2. The Concept DictionaryThe concept dictionary contains knowledge on the concepts which are linked to the word dictionary. It provides a knowledge base for computers in a similar way as human beings have to rely on knowledge they already have in order to be able to understand sentences. The dictionary contains information on the 400 000 concepts defined in the word dictionary. Depending on the type of information, it is divided into the concept classification and concept description dictionary. The concept is defined as an abstract entity used to represent meanings of sentences. Most concepts are entities corresponding to the meanings of words. Sentence concepts are concept relation representations and defined as extensions of word concepts. Each concept has a concept identifier and a concept illustration. A concept identifier is a number uniquely assigned to a concept. A concept illustration gives examples of the use of a concept. A concept description is a description of the concept in natural language, either by a lexicographer based on his intuition or an abstraction from a set of descriptions in the EDR Corpus (cf. below). A concept classification shows the hierarchical relationships between concepts. Abstract concept descriptions enable the computer to judge whether a sentence concept is meaningful, to convert it to the concept of a similar sentence and to judge the degree of similarity between sentence concepts. The computer is, at present, only able to understand a small fraction of the meaning of a word due to the limited number of concept relations defined so far. A considerable, yet at present sufficient, amount of approximation is therefore employed. The concept dictionary is thus a collection of the simplest sentences comprehensible to the computer, a kind of text base. The development of the concept classification has been concluded. The concept description component has been developed and is currently in an improvement and expansion phase. 3. The Co-occurrence DictionaryThe co-occurrence dictionary is created for each of the two languages separately, with 300 000 entries each. It consists, for each entry, of typical word collocations in the subject domain of information processing. In addition to the collocations listed, co-occurrence relations are defined in terms of syntactic relationships in which words occur. This dictionary is used to select the most suitable word to express a given concept in relation to other words. It is particularly useful in selecting corresponding words in the target language in machine translation. The co-occurrence dictionary has been developed in connection with concept description. It is, at present, improved and expanded. 4. The Bilingual DictionaryThe bilingual dictionary consists of an English to Japanese and Japanese to English component. Each bilingual dictionary consists of 300 000 entries, including general and technical terms. The dictionary contains the translation equivalents of the Japanese and English headwords, supplementary information, and the correspondence relations between headwords and their translation equivalents. Correspondence relations are defined in terms of four categories: equivalence relation, synonymous relation, subset relation, and superset relation. The development of the bilingual dictionary has been concluded. The EDR Corpus Apart from this extensive and complex set of databases, a large corpus consisting of 500 000 sentences, 250 000 of them English, the other half Japanese, has been compiled. For each sentence, semantic and syntactic structures have been encoded. The corpus is central to the construction of the EDR dictionary and to the formation of a substantial basis for various types of natural language processing research. The EDR Corpus is created as follows:
At present, the results of the syntactic and semantic structures assigned to the sentences are reviewed with a view to improving the system. The Development ObjectivesThe dictionary was designed according to the following strategy: Surface level information, which is heavily language dependent, is completely separated from deep (semantic) level information. Surface information is stored in the word dictionary, whereas semantic information is stored in the concept dictionary. Interfaces between headwords and concepts are established in the word dictionary. This approach ensures that semantic information can be shared among various dictionaries in each language. Second, language dictionaries can be developed independently of one another. Information which depends on specific grammatical rules and algorithms has been excluded. Future Plans and ExploitationThe research and development of the EDR electronic dictionary will be completed at the end of the fiscal year of 1994. After that, all final results of the project will be distributed on a commercial basis according to the following policy:
It will be necessary to maintain the EDR dictionary continuously after 1994. It is expected that the maintenance cost will be covered by the income achieved through sales. The lexical knowledge acquisition support systems developed in the course of the EDR dictionary will be used after completion of the project. There are plans to develop them into a dictionary development plant or a dictionary factory. Further plans comprise the development of a very large-scale knowledge base into a knowledge archive. Knowledge archives represent an attempt towards technology for knowledge itself leading to new integrating software technology. Rather than designing natural language interfaces for knowledge based systems or by using knowledge based systems in natural language understanding, both components are expected to be integrated in the design of knowledge archives. Knowledge archives will contain knowledge documents in a knowledge representation medium which can be objectively observed and analyzed. Some knowledge documents can be represented in Japanese, others in formal languages, in foreign languages, in picture languages, images and sounds. Knowledge archives are expected to function as the most universal expert system. The archives will have knowledge extraction functions, support functions for creating knowledge documents, knowledge storage and retrieval functions as well as knowledge translation and communication functions supporting the translation of large volumes of knowledge documents and their widespread and efficient transmission and exchange. ConclusionIn Japan, substantial progress has been made in the field of computerized knowledge representation within a limited period of time. Cooperative industrial and public efforts have led to an integrated approach to large-scale dictionary compilation which will soon be commercially exploited. A long term view of the development of information processing and artificial intelligence has been adopted in such a way that the results will not only be exploited for translation purposes in the short term, but also, in the medium and long term, be further developed for the design of expert systems and of software technology increasingly taking advantage of an integrated approach to comprehensive multilingual and multimedia knowledge processing. It will be important to understand the implications of natural language processing for future software technology development and therefore adopt a unified and cooperative approach to knowledge based strategies of creating multilingual terminology resources and corpora. Watch for further developments in Japan by systematically collecting information about ongoing research and development there. It is also highly advisable to evaluate and take advantage of suitable commercially available tools. The information contained in this article was collected during the Fourth Machine Translation Summit, held in Kobe, Japan, July 20-22, 1993. Three talks were presented on the EDR project. A comprehensive study on commercially available Japanese Translation Technology, with a documentation and comparison of the numerous Japanese computer translation tools is being undertaken. Please sign up for a report by December 1, 1993. The study will begin in January, reports will be available as from April, 1994. The price of the report is $1800. |
LISA Business Data Forum Summaries and Presentations LISA Globalization Consulting Network Webinars and TouchPoint Advisory Calls LISA Forum USA LISA@Chinasoft Fair LISA Forum Asia LISA Forum Europe LISA Forum India Open Standards • TBX • TMX |
||