LISA Home page [© 2010 • ISSN 1420-3693 • www.localization.org]
© 2010 SMP Marketing • ISSN 1420-3693 • www.localization.org
Globalization of Voice Applications: It’s Only the Beginning! (part 2)
Installment 2 of 2

Ashish Vora and Curtis Tuckey, Voice Laboratory, Oracle Corporation

Globalizing software (creating software for multiple languages and locales), and the follow-on process of localization, is challenging enough for “normal” software products and not-too-complex web sites. However, when it comes to one of the “newest kids on the block,” voice-enabled applications, the fun really begins.

There are only a handful of voice technology providers who have attempted to create globalized solutions, and Oracle Corporation is one of them. Recently, LISA interviewed Curtis Tuckey, Director, and Ashish Vora, Senior Speech Applications Engineer, at Oracle’s Voice Laboratory in Chicago in the U.S., to gain insight into their vision for voice application globalization. In installment one, the two men outlined Oracle’s voice applications strategy, as well as the business and technical challenges that lie ahead.

In the second installment in this issue of the Globalization Insider, they:


  1. outline current trends in voice applications standards;
  2. describe the very real challenges presented by voice application globalization;
  3. and provide recommendations for content creators and localization vendors who are preparing to become preferred service providers to voice applications developers.

If you would like to meet Curtis Tuckey or Ashish Vora in person to increase your knowledge of voice-enabled applications, plan to attend their presentations at the LISA FORUM EUROPE: “Managing Content - Moving Markets: Streamlining Global Workflow Through Content Management,” to be held in London from June 30-July 3, 2003.

Ashish Vora and Curtis Tuckey

What standards exist in the voice applications industry? What groups are driving these standards?

Unfortunately, there are very few standards in voice applications development. Within the Internet application space, we have started to see an effort at standardization driven by the emergence of markup-driven application development languages such as VoiceXML. The VoiceXML specification actually incorporates aspects of several other specifications including Speech Recognition Grammar Specification (SRGS), Semantic Interpretation of SRGS (SI) and Speech Synthesis Markup Language (SSML). There is another proposal for a voice application development language called Speech Application Language Tags (SALT) that is being pushed by Microsoft. Additionally, there are a variety of standards and specifications for lower level, telephony-related issues, including Call Control XML (CCXML), Session Initiation Protocol (SIP), Parlay, Java APIs for Integrated Networks (JAIN), etc.

quote

There are a variety of groups driving these standards efforts. Most of the VoiceXML efforts (as well as CCXML) are being driven by the W3C. The W3C is also actively exploring the creation of a new language for multimodal application development (applications with both visual and voice-based interfaces). VoiceXML will form a significant part of this new language. Microsoft’s SALT proposals are being driven by the SALT Forum. Many of the various telephony specifications have their own working groups helping to drive the definition of the specification. For example, SIP is being driven by the SIP Forum, Parlay is organized by the Parlay Group and JAIN is being led by the Java Community Process.

What standards does Oracle support in this field and why?

quote

Oracle supports a number of standards in the field of voice applications. The Oracle9iAS platform supports all of the W3C proposals in the context of voice application development. We are active participants in working groups related to VoiceXML Interoperability and Conformance. We feel strongly that VoiceXML provides a good model for application development and that there is a large enough development community behind the specification to ensure its success.

Why does Oracle view globalization as one of the critical driving factors in the adoption of voice-enabled applications?

quote

Globalization is a driving factor in the adoption of voice-enabled applications quite simply because having more applications available in more languages increases the reach of any software offering. More specifically, there are two main reasons to treat globalization as a critical factor in voice applications:

  1. As stated earlier, voice applications are unique in that they turn any type of phone into an Internet device, with no requirements for network connectivity or device processing power. As such, voice applications have the capability to greatly democratize access to information, even more so than the visual Internet has been able to do. The potential for such a wide audience for applications necessitates the globalization of these applications – it is simply not reasonable to expect all users to be able to communicate in English (or any other particular language for that matter). Furthermore, even those with fluency in a particular language may have trouble interacting with voice applications, as many of the recognition models for ASR are built upon speech samples of native speakers, not simply fluent speakers.
  2. Having a global software offering exposes voice applications to new markets altogether and will help educate and create greater mindshare among new types of users. This in turn will help drive further adoption of voice applications in the future. As stated earlier, voice is the most natural medium for communication, but its success as an application modality will be determined by how effectively it is embraced by users. The only way to make sure that the user experience of voice applications is effective is to expose it to the widest array of users possible.

What are the shortcomings of current internationalization/localization practices as applied to voice apps?

Without going into too much technical detail (please refer to Globalization of Voice Applications: Issues, Approaches and Challenges for the Future for a more in-depth treatment of this question), current internationalization/localization practices for screen-based applications within Oracle follow four main guidelines:

  • All user interface components (resources) are separated from the application functionality. These resources are then placed in an external resource bundle that can be easily translated.
  • Interactions that rely on freeform user input are minimized since they are much harder to deal with in the context of globalization.
  • Text incorporated into binary resource files such as images, icons, etc. is minimized (with the goal of total elimination) since translating such text often requires a complete re-implementation of the binary file for each language.
  • Output resources that concatenate static and dynamic information together are minimized. If these types of resources are required, placeholders are used in the resource string to denote where dynamic data is to be inserted at runtime.

For voice application development, several of these guidelines fall short, namely the guideline to minimize freeform user input interactions and binary resource files. Many interactions in voice applications tend to approximate freeform user input because there are often a variety of different inputs (synonyms) that map to a particular behavior. As voice applications become more sophisticated and make use of more conversational interfaces, this problem is exacerbated as it becomes necessary to do semantic evaluation of the input being passed to the application. As far as binary resource files go, many voice applications make use of professionally recorded audio files to output content to users that are analogous to image files in terms of their complexity for translation.

Beyond these shortcomings, voice applications also create certain new requirements. Foremost among these is the need to properly present all data to the user in a way that achieves maximum understandability. Because there is no visual or spatial awareness associated with a voice application, it is imperative that voice applications properly format content so that it is free of abbreviations and symbols that may have ambiguous pronunciations. This is especially true for various types of content that need to be presented in a locale-aware fashion – dates, times, currencies, etc. In screen-based applications, this content must be formatted according to the conventions of a particular locale, but often this simply affects the ordering of elements. The final representation of the information still relies on numeric and symbolic information that a user can interpret when viewing it, e.g., a string written as “6/2/2003” can be interpreted as a date that means the second day of June in the U.S. versus the sixth day of February in Great Britain. For voice applications, this level of formatting is insufficient.

There is a larger variety of platforms to which to write speech applications than for screen-based Internet applications. If your goal is to write a truly portable voice application, this fact – combined with the variations in the implementation of the VoiceXML specification by different platform providers – presents a huge set of challenges. Even if you are only planning on running against a single VoiceXML browser, variations in the underlying ASR and TTS engines can cause your application to create a different user experience, or in the worst case, not work at all.

How is Oracle addressing these shortcomings?

Oracle has defined the Voice Globalization Framework to cover two main aspects: application output and application input. On the application output side, Oracle has put together two pieces of technology. The first of these is the Structured Datatype Expansion Framework (SDEF) that takes various primitive datatypes as input and formats it as the fully spelled-out, correctly localized interpretation of that datatype. The SDEF allows application developers to write applications free of abbreviations and symbolic representations of data. Once content has been expanded correctly, the remaining task is to associate pre-recorded audio files with that content to achieve a professional sounding interface. In order to accomplish this, Oracle has created the Concatenative Speech Server (CSS) that is a domain-specific, text-to-speech synthesis system. Basically, application developers create application- or domain-specific libraries that contain mappings between text strings and audio files. The CSS can then use these mappings to match strings of textual content and replace them with the matching audio file reference.

Voice application input presents its own set of challenges. Again, more detailed information on these issues is provided in the whitepaper. Basically, voice application input is facilitated through the use of grammars that are codified representations of words or phrases that may be spoken by a user. There are a variety of grammars that require internationalization; for the initial version of the Voice Globalization Framework, Oracle decided to address one of these – the VoiceXML Builtin Grammars.

In an effort to simplify voice application development, the VoiceXML specification defines certain basic input grammars for a handful of basic datatypes such as dates, times, numbers, digits, etc. Unfortunately, the implementation of these builtin grammars varies greatly from implementation to implementation of VoiceXML. Furthermore, the specification provides very little direction on how these grammars are to be handled for other languages. Therefore, in an effort to create some standardization around this, Oracle has created the Oracle Global Builtin Grammars (OGBG) that enforce a standard set of functionality on the builtin grammars, both across VoiceXML platforms and across languages.

There has been on-going work in Natural Language Processing (NLP) for many years. In an ideal world, what should NLP technology be able to deliver for globalized voice-enabled applications?

In an ideal world, the promise of NLP is conversational voice interfaces with a minimal amount of effort required to constrain the types of input at application development time. Thus, an application developer could write an application without really knowing what a user might say, and the NLP processing engine would be able to recognize arbitrary speech and perform some useful instructions based on its recognition results.

Unfortunately, the reality is that NLP is a really difficult problem, and we have yet to see it done in an effective way, even for English. Expanding the complexity of this problem to many other languages only increases the challenges that NLP researchers face, but we certainly look forward to breakthroughs in this space in the years to come.

What recommendations can you provide to content creators and localization vendors to enable them to become preferred vendors to voice applications developers?

Here’s what they can do to prepare:

  1. Develop linguistic and grammar expertise for localization of input and output grammars. This is by far the most challenging aspect of the globalization process for voice applications, so having a vendor that understands these issues is critical.
  2. Develop processes that allow the developers to provide context for translatable resources, particularly those resources that concatenate static and dynamic information together. Make sure that your processes maintain this context once translators start work.
  3. Participate in the testing of these applications and create a process that allows for a feedback loop. Voice applications require a fair amount of tuning to ensure their success. For example, the output of voice applications is often directly tied to the input phrases. It is possible in testing the application that the input phrase being accepted actually causes many misrecognitions for the ASR engine and therefore requires modifications – when this happens, it also means that the output phrase corresponding to the input phrase must also be changed. Without a feedback loop, the application will not be very usable.
  4. Once a localization vendor has developed experience with voice application localization, we strongly encourage their involvement from the earliest stages of the voice application design phase to ensure that the designs that are implemented are truly localizable.
  5. For content providers, we have one more recommendation:

    For structured content feeds, e.g., for weather, stock quotes, etc., provide data in regular data formats, most preferably XML. This will provide the greatest flexibility for utilizing the other aspects of the Voice Globalization Framework.

What can LISA do to help bridge the gap between all of the various stakeholders (platform providers, voice applications developers, NLP researchers, content and localization vendors, etc.)?

quote

We think LISA is in an excellent position to help drive innovation among the different groups that interact with the voice application development process. In particular, we would like to see LISA take an active role in the following areas:

  1. There are currently a variety of standards being proposed by the voice technology community for how to represent input and output grammars for maximum translatability. It would be good to have LISA members involved in the definition of these. This will become especially critical as interest in conversational and natural language interfaces increases.
  2. Once we have a defined representation for input and output grammars, how do we translate them? To put it another way, to construct a grammar requires knowledge of three pieces of information: a semantic concept (what the grammar means), the words spoken that map to this concept (how to present the grammar), and the syntactic description of the grammar (how the grammar is represented in code). From a translatability standpoint, we need to figure out how to represent these three pieces and then successfully translate the latter two. We need linguists to study this problem.
  3. There are several shortcomings with existing internationalization schemes as defined by various programming languages (see Translator X, “What Planet Are They On?” Globalization Insider, Vol. XII, No. 2.3). One of the biggest problems we see at Oracle is that it is extremely difficult to provide context for strings in resource bundles. In Java, this context can only be provided as comments incidental to the actual resource string, rather than as an integral part of it. As a result, if any type of preprocessing is used to extract strings from the resource bundle, the context is often stripped away. We would like to see LISA help coordinate an industry standard format for resource bundles (perhaps similar to XLIFF) that includes better support for contextual information.

Are there additional requirements for support organizations when supporting global voice-enabled applications?

From a production standpoint, there are not any special requirements for supporting global voice-enabled applications. But, when it comes to deployment of these applications, there can be some requirements placed on support organizations. For example, though there is no special configuration required of the actual application server, a support organization will have to build up the voice gateway infrastructure in each of the target languages. Additionally, once an application goes live, it is necessary to have help desks trained on the various language-specific versions.


For more detailed technical information, please consult Globalization of Voice Applications: Issues, Approaches and Challenges for the Future, a white paper by Ashish Vora, available only on the LISA web site.


Curtis Tuckey is Director of the Voice Laboratory at Oracle Corporation. Before joining Oracle, he held various research and development positions at Motorola, Lucent Technologies, AT&T and General Motors. He holds a Ph.D. in mathematics from the University of Wisconsin and can be reached at curtis.tuckey@oracle.com.

Ashish Vora, is a Senior Speech Applications Engineer in the Voice Laboratory at Oracle Corporation. He has developed a set of voice applications that ship with Oracle9i Application Server Wireless & Voice, co-authored an integration and acceptance process for voice gateway vendors and created an architecture to simplify the globalization of voice applications. He holds a B.S. degree in Computer Science from Stanford University and can be reached at ashish.vora@oracle.com.




Contents


LISA Business Data

LISA Publications Catalog

Industry Insights Reports

Best Practice Guides

Surveys

QA Model

Forum Summaries and Presentations

LISA Globalization Consulting Network

Webinars and TouchPoint Advisory Calls


Join LISA

Subscribe


Upcoming Events

LISA Forum USA
(Foster City, California, April 13–16, 2010)

LISA@Chinasoft Fair
(Chengdu, China)

LISA Forum Asia
(Suzhou, June 28–July 1, 2010)

LISA Forum Europe
(Budapest, October, 2010)

LISA Forum India
(New Delhi, December, 2010)


Open StandardsTBXTMX

Terminology SIG

Job and CV Postings