|
In this issue…
The ELRA/ELDA Language Resource Survey
Jeffrey Allen presents the preliminary findings of the 1999 ELRA/ELDA Language Resources Survey, providing a general overview of how respondents—researchers as well as product developers—are using language resources today. About ELRA and ELDAThe European Language Resources Association (ELRA) was created in 1995 as a non-profit organization to collect, market and distribute language resources and manage the diffusion of general information in the field of language engineering. In this field, the term “Language Resource” (LR) refers to sets of language data and descriptions in machine-readable form, which are used specifically for
in general, as core resources for
This language data (i.e., speech, written and terminology databases) is produced and distributed for use in a variety of language engineering systems and applications. The tasks of the association, and its distribution center (the European Language resources Distribution Agency, or ELDA), are to identify and collect existing resources as well as to distribute this information to potential users in Europe and beyond; e.g., through
ELRA/ELDA also conducts a series of market surveys designed to determine the current and potential market for LRs. One series of surveys is being conducted as part of the European Language Resources - Packaging and Production project (LE-4335), and a report on the preliminary findings of this survey has already been published (Allen, 1999). The following excerpt of that report is printed here with ELRA’s consent. General information on the ELDA 1999 surveyThis report describes the current state of an ongoing survey aimed at determining the needs of users with respect to available and potentially available LRs. As part of the market monitoring activities outlined in the LE4-8335 project, the main objective of this survey is to provide concrete figures as a basis for a more reliable and workable business plan for ELRA and ELDA, and to determine investment plans for sponsoring the development of new resources. These results only reflect the information obtained from the summer 1999 user needs questionnaire conducted primarily among ELRA non-members. The survey consisted of direct contact with personalized messages sent to 667 individual addresses. The full range of questions in the study include:
It is very important to note that the summer 1999 questionnaire was not sent to regular ELRA members or clients, as was the case with earlier questionnaires. This does not mean that past LR clients members or clients were not contacted, but rather that the intention was not to use the current customer list as a basis for obtaining information about LR user needs. Addresses were extracted and compiled from a single database of contact addresses. While the general objective was to contact as many different players as possible, we acknowledge the fact that a single database is not exhaustive. In addition, those individuals contacted for this survey were known to possibly be more interested in written LRs, since ELDA focused on improving its network of contacts in the written and terminology LR fields in 1999. Survey statisticsOf the nearly 670 questionnaires sent out to language engineering specialists, 17.5% returned as bad addresses. Those addresses have since been removed from or corrected in the database. After adjustment for invalid addresses, a total of 16.4% (90 respondents) of all recipients returned a completed questionnaire. Additional follow-up strategies are underway to contact those recipients who did not respond, and to contact other individuals who were not included in the summer 1999 batch. Each LR type was divided into basic non-annotated data and annotated data. 30% of respondents expressed interest in basic speech data and 29% in annotated speech data. This results in a round figure of 30% of participants that seek speech LRs. Of those respondents working on written LRs, 28% seek basic data and 42% seek annotated data for syntactic bases. A large group of respondents is interested in lexical databases, with 54% looking for basic data and 63% for annotated data. The fourth major LR type is text databases, and of the respondents interested in this type, 63% seek basic data and 58% annotated data. The survey shows that approximately 1/3 of the 90 respondents are interested in speech LRs and approximately 2/3 in written LRs. This is a completely different audience than was targeted in our survey efforts in 1997, 1998, and early 1999, demonstrating that ELDA’s efforts to target the area of written LRs in 1999 have been successful. Speech processingOne section of the questionnaire aims at gathering information about the type of work being done in the speech domain. The users in this section and all subsequent sections are divided into researchers and product developers. According to the findings, the main types of work are speech recognition and speech synthesis. 30% of respondents are involved in speech recognition research and 14% in speech recognition product development. On the other hand, 24% of respondents are involved in speech synthesis research and 9% in speech synthesis product development. Next come those working on the development of speech databases (22% for research and 19% for product development) and speech analysis (22% for research and 6% for product development). The lower end of the spectrum includes speech coding (9% for research and 4% for product development), followed by speech workstation software (8% for research and 4% for product development). These results show that users of speech LRs are primarily involved in speech recognition and speech synthesis work (between 1/4 to 1/3 of respondents). Additional statistics on each speech processing subtopic are provided in the full report that is available to ELRA members. Text processingThe questionnaire also includes a section on general types of text processing systems, which produced the following results:
A large proportion of the respondents (including those working on speech and text processing) are involved in automatic machine translation (41% for research and 23% for product development), followed by terminology management tools (32% for research and 23% for product development), translation memory applications (19% for research and 16% for product development), grammar checkers (22% for research and 18% for product development), style checkers (20% for research and 11% for product development), and spell checkers (19% for research and 17% for product development). Multi-media and Multi-modal LRsOne of the most recent demands for LRs falls in the area of multi-media and multi-modal data. The survey shows that 50% of all respondents are interested in multi-media data and 35% in multi-modal data. Approximately 10% of all respondents expressed interest in one of several types of multi-modal LRs for research. Product development is still low, but this is to be expected in such a new area of research. The figures represent a sharp increase over the information obtained in the 1997 autumn survey in which only 1/18th of the surveyed participants were interested in multi-modal LRs. Languages neededAnother questionnaire section inquired about the languages desired for LR data. These statistics help us understand user needs both in terms of what is currently being offered and what has yet to be developed. Taking into account that each respondent could tick more than one language box in the questionnaire, the percentages refer to the total number of individual boxes ticked on language, not to the total number of respondents:
The full report provides more detailed information on each of the above-mentioned areas. The results obtained from the summer 1999 user needs questionnaire were aimed at complementing information already received from ELRA members and customers, and to determine if the trends among non-ELRA member institutions are similar. We have taken the results of this and previous questionnaires as a basis for redesigning our strategy for further work. This includes extending the survey to cover a larger base of recipients and targeting other domains specific to the human language technology field. The current questionnaire results are therefore setting the benchmark for future survey work. These results are also helping ELDA rework its overall marketing strategy for promoting language resources. Additional survey workAn updated version of this survey (Allen and Choukri, 2000), based on the analysis of 250 questionnaire responses, will be presented at ELRA’S upcoming second international Language Resources and Evaluation Conference (see Industry Events). ELDA is also participating in the Gates for an Enhanced Multilingual Resource Access (GEMA) project (MLIS-5021), which aims at serving the end-user community (translators, terminologists, etc.). The GEMA project will also study and specify the needs expressed by different types of portal users. See www.elda/fr./proj/gema.html for more information, or fill out an online version of the GEMA survey at: www.elda.fr/proj/gemasurv.html Please contact Jeff Allen (jeff@elda.fr) for more information or to participate in these surveys. ReferencesALLEN, Jeffrey. 1999. Report on ELDA’s Survey of Language Resource User needs. In European Language Resources Association (ELRA) Newsletter, Vol. 4 No. 4, October– December 1999, pp. 8-9. ALLEN, Jeffrey and Khalid CHOUKRI. 2000. Survey of Language Engineering needs: a Language Resources perspective. Paper to be presented at the Second International Language Resources and Evaluation Conference (LREC2000) to take place 31 May–2 June 2000, Athens, Greece. Jeffrey Allen
|
![]() 8-12 December 2008 |
||||||||||