LISA Home page [© 2010 • ISSN 1420-3693 • www.localization.org]
© 2010 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…


Designing Cross-Cultural Speech Applications

José Elizondo, Peter Crimmin and Paul Greiner, Speechworks International, Inc.

Speech recognition technology is changing our understanding of the computer-human interface. As a companion to the Internet and mobile phone technology, speech recognition gives users extended access to information and enables them to perform transactions anytime, anywhere. Speech offers a more natural, human quality to user interfaces. Its unique characteristics challenge technology providers and designers, especially for localization and internationalization issues.


Keeping it real

Attention to word order, semantics, tone of voice, inflection, and cadence, as well as social norms, can contribute to making interactions between humans and machines more like those between humans. Speech interfaces are new, but speech itself is not. Our brains process much more than bare words written on a page when listening to spoken language, and speech interfaces should feed those expectations. Consider the following:

“What is the first number?”
“5”
“And the second number?”

How should the last line be spoken? “And the second number?” emphasizes that the second input is a number. “And the second number?” contrasts the ‘second’ number from the ‘first’. “Aaannnd… the second number?” stretches the word ‘and’ as a transition from the first collection. This last technique suggests the idea of a person writing the first number and transitioning to the second. The technique achieves two goals: it very subtly assures the user that the transaction is progressing as planned; and it implies the existence of an interactive personality.

The psychological confidence of users has a demonstrable effect on the accuracy of the speech recognizers. There is a self-fulfilling prophecy that the technology works better when users believe in the capabilities. Conversely, when users lose faith in the system, accuracy decreases. One way this happens is when users ‘hyper-articulate’ in the mistaken belief that a forced, unnatural manner of speaking will improve recognition: “Newark! Ennnn-Eeeee-DuhBulYouuuu-Aaaaa-Arrrr-Kayyyyy!” Thus, all designs must predict, influence, and support the user’s psychology.

When errors occur, and the user’s discomfort and uncertainty grow, failures can spiral out of control: reduced recognition accuracy causes more frequent error-correction dialogues, which further increase frustration, extend telephone calls, and reduce success rates. Part of supporting the user’s psychology includes the diffusion of negative reactions to error.

“For all other calls, say ‘pound.’”

Users of touch-tone systems expect the same menu structure with every call, and this predictability helps ‘repeat’ callers to become experts at rapid navigation. They quickly dial touch-tone sequences to perform the desired transactions. Some experts program their favorite sequences into the speed-dial feature of their telephones!

Callers speaking to live agents have different expectations. They know they can interrupt at any time, and their comments can deviate from the pre-ordained sequence of the agent’s script. After several calls with the same agent, both caller and agent form a social bond, and the dynamics of the conversation reflect this bond.

The expectations for speech systems lie between touch-tone systems and live humans. They expect menu structures that never change, and when speech replaces an existing touch-tone system the users expect an exact match in the structure. Users also assume they can interrupt speech systems with tangential questions just as they do with live agents. By necessity, the interface design must allow for a degree of flexibility in the conversation because users will attempt deviations.

In a sense, users are caught between their expectations for machine automation and human spontaneity, and designers must borrow techniques from both realms. By transferring conventions from other realms to speech, the designer can leverage the user’s previous knowledge and facilitate learning.

But conventions can impede innovation. As the technology begins to handle undirected conversation, designers too must broaden their horizons. A technology called “How may I help you” (HMIHY) is already available in the marketplace. Just as a live agent would start a conversation with an open question like “How may I help you?”, this technology supports an unconstrained style of interaction where users can respond in a freer way. Designers cannot afford to lag behind the deployment curve. Consider the folly of designs like the following:

“For questions about your bill, say ‘one.’
To report service outages, say ‘two.’
For all other calls, say ‘pound.’”

This speech prompt was designed by a touch-tone designer, and an unimaginative one at that. To reach maturity, designs for speech interfaces need more independence than conformance to inherited touch-tone conventions.

“Ham sandwich, ham sandwich, ham sandwich!”

Prompts must be clear and unambiguous and they must lead the user to provide needed information. The key factor is prompting in a natural manner. About six years ago, a test system was offering a voice menu of food items and querying the user to select one:

System: Your options are ham sandwich and pizza. Which one would you like?
User: Ham sandwich.

(At this point, the system is unable to hear the user’s input due to a noisy connection)

System: I’m sorry. I couldn’t hear you. Please repeat your selection.
User: Er… ham sandwich, ham sandwich, ham sandwich!

Humorously, the user says the choice again and again, but this was not the desired result of the prompt. An alternative wording was significantly more successful, “Please say your choice again.” This anecdote shows the difficulty of predicting, even in seemingly simple cases and even when dealing with only one language, how users will respond to prompts.

The way you phrase a question influences the answer you get. For example, compare the following:

“What is the destination of your flight?”
“Please say the arrival city and state for your flight.”

Which is better depends on the particular design situation. In response to “What is the destination of your flight?” the user might say, “I’m going to Boston” or “Boston Logan Airport.” The question feels natural and does not impose obvious constraints on the answer. However, it requires complementary design of the recognizer’s vocabulary and grammar to handle the variety of possible answers. If the user says “Portland”, the application might require an additional step to clarify “Portland, Maine” or “Portland, Oregon.”

On the other hand, “Please say the arrival city and state for your flight” is more directed and likely to elicit a more focused answer like “Boston, Massachusetts.” But how will users respond if they are flying to another country? What if they know the airport name but not the city? What if there is more than one airport in the destination city?

“Parlez-Vous Français?”

Multi-lingual speech systems often let users choose their preferred language at the beginning of a telephone call, and the system must ascertain the user’s preference.

Web pages can display a list of languages and let the user choose. The names persist on the screen for as long as the user cares to study them. One approach for speech is to offer the languages one by one:

“To continue in English, say English.
“Para servicio en español, diga español…”

A sequential list might work well for few languages, but what if there are eight or ten? How will users feel when confronted with a series of messages that they do not understand? Will they hang up their phones assuming that the system does not handle their language? Or will they wait? Which languages should be spoken last? A mistake on a web page can be corrected with a simple click. But once a speech application begins communicating in the wrong language, the caller is at the mercy of the speech system’s intelligence. The system requires a clever design to allow for a graceful recovery.

Choosing a language from a list is an example of the “option menu,” a common component in automated telephone systems. For speech, a perfect translation of a list is not enough. Length (number of words) becomes critically important because speech is linear and transitory, and users must remember their options. The duration of each item on the list adds to the burden on the user’s memory. Technologies such as “barge-in” (allowing the user to interrupt messages) enable users to act immediately upon hearing a choice. But this technology alone is not a solution.

More than words

Differences in psychology, social convention, shades of meaning, and perceptions of time all present localization challenges. Speech recognition systems require a deep understanding of the user audience. Often, knowledge of the user’s purpose and psychology is transferable to new languages and geographies, but the social, political, and linguistic differences remain. Each of these areas presents design issues superimposed on the goals of the application and limited by the capabilities of the technology.

Automated translation tools are being used more successfully in certain contexts. Consider the voice portal, which is the telephone equivalent of a web site. There are times where dynamic content is delivered (for example, horoscopes, news flashes, and email). No static translation is available, and automated translation and text-to-speech technology can come to the rescue.

But automated translation tools cannot localize the speech interface itself. The script for the dialogue interaction still requires careful consideration of social and cultural issues, predictable patterns of speech, and complementary vocabulary and grammars for the recognizer.

Are you male or female?

In English, we can address another person with the genderless “you.” But other languages have stricter rules for gender agreement. Some require agreement with the speaker, others require agreement with the person being spoken to, and still others require both. The most difficult case is when agreement is determined by the gender of user: how does the speech system ascertain the gender? It could explicitly ask, but users might find the question offensive, and at a minimum the question makes the system appear unintelligent. The user’s trust of the system might be lessened. The question might even be a source of legal liability if construed as preferential treatment for one gender over another.

Conventions for gender agreement can differ between written text and speech. In Spanish the words for “welcome” are “bienvenido” and “bienvenida” (male and female). In a printed form or in a web page you can print “bienvenido(a)” to include both cases, but in speech you would need to explicitly say both. However, a phrase like “Bienvenido o bienvenida al Banco XYZ” (“Male welcome or female welcome to Bank XYZ”), would not only sound stupid to users, awkwardly calling attention to the ignorance of the speech application; it would also violate the cultural norm of using the male form of the word as a neutral choice. The standard phrase “Bienvenido al Banco XYZ” is inclusive without being awkward or offensive.

When in Rome…

A faithful translation is not enough for a speech interface to work for a new language or culture. The function of the dialogue in these systems is not solely to communicate information. It is also to persuade or direct the user to respond in a specific way. At a minimum, questions need to be carefully worded in each language to produce a focused answer in that language.

While researching a bilingual application, we visited a call center to observe conversations between callers and human representatives. The calling populations for each language corresponded to groups from different cultures. We quickly noted that some calls were substantially longer than others. Callers would add unsolicited information and would engage in tangential conversations with the agents. Interestingly, this happened primarily with female callers talking to female agents. In response, we experimented with a deep male voice for our speech application, modeled after our observations of interactions with live male agents. Our tests showed that users were more focused and deviated less from the questions that were asked. (Keeping users engaged and avoiding digression is critical to speech applications. Recognizers only understand words they expect to hear; users must speak “in-vocabulary” words.) This case poses interesting questions about the impact of culture-specific gender dynamics on computer-human interactions. An example of research on this area is Cliff Nass’ work on the effect of gender on people’s liking and trust of different voices.

Pronunciation and politics

Speech applications also speak to callers and their choice of pronunciation has ramifications. Should the Spanish branch of a bilingual system in the United States speak city names using the Spanish pronunciation (“mee-ah-mee”, for Miami) or the anglicized version? Should it translate all city names to pure Spanish, even if they are better known in their English form (“Salt Lake City” vs. “Ciudad del Lago Salado”)? Should a system in Quebec, Canada speak street names with Francophone or Anglophone pronunciations (“pee-nuhf” or “pie nine” are real-life examples of how people pronounce the Montreal street “Pie-IX”)? Your choice could be interpreted as a political statement on a heated regional issue. You might appease one group of users while enraging another.

Finding your voice

One goal of a speech interface design is to create a persona with whom users identify and interact. The voice of the application communicates the persona with more than words; tone, inflection, rhythm, and pacing often carry more semantic information than the actual words spoken. As critical as this element is, the intelligent use of the voice faces numerous barriers:

  • There is little precedence for using subtleties of voice in traditional telephone applications (i.e. touchtone systems). This leads to under-estimation and trivialization of the need.
  • Existing development processes in corporations are often streamlined in ways that prevent close contact between designers of the voice interface and the voice talents who make the recordings.
  • Companies buying speech systems may have preferred voice talents who are used as “branding” elements for their radio and television advertising. But great-sounding voices on radio and television may sound muffled and unclear when down-sampled to the narrow bandwidth of the telephone, and deep authoritative voices that are good for branding purposes can intimidate speech users.

If the voice is key to the personality of the system, then localization is key to the casting of the voice. A voice speaking with sophisticated British accent would sound out of place in a system targeted to American users buying lawnmower parts. A Quebecois dialect might not be understood by a user in France. Mistakes like these can make the sponsoring company appear amateurish and uncaring. But the rules are not so clear to suggest a policy of always matching the dialect and accent of target users. A strong regional accent might be part of a branding strategy, it can send a message of national identity, or perhaps it engages users who also speak with heavy accents.

In addition to the recorded voice of a voice talent, automated text-to-speech (TTS) engines are used because information is often dynamic (it is not possible to predict and pre-record all needed messages). Just as the dialect and accent affects the choice and coaching of a voice talent, the languages and accents of TTS engines must be localized. Research on topics like emotional and expressive synthesized speech will need to include cultural sensitivity considerations to support localization for TTS.

TTS pron mgmt

Speech systems often rely on database information to produce pronunciations for TTS engines and for recognizing words and sentence structures that are decided in real-time, dynamic situations. A mismatch in character sets can ruin the accuracy of a recognition engine. This system might have identical pronunciations for very different words (the French words “côte” and “coté” for example).

Abbreviations must also be handled in a localized fashion. Consider the Mexican name “Lic. Ma. Antonieta García Hdez” which abbreviates “Licenciada” (title), “María” (first name), and “Hernández” (second surname). Failing to expand these abbreviations before sending them to a text-to-speech engine or before loading them into a speech grammar may cause bad TTS pronunciations and poor recognition accuracy.

Applications with a BANG!

People from different cultures can surf the web because conventions for navigation icons have become de facto standards. While not immune to failure, visual icons have great potential to bridge language gaps.

Similar conventions are forming for speech technologies too. This is an exciting time of exploration for interface designers as new ideas are emerging quickly and finding their way into deployed systems:

  • Audio icons are short “tag lines” that identify a company, slogan, or product.
  • Audio location markers are background audio elements that perform a role called “back-channel communication.” That is, they communicate at an almost subconscious level. The user asks for a weather report and the next prompt has the sound of a thunderstorm in the background. The back channel implicitly communicates that the user has arrived at the correct location in the application.

Audio icons and location markers present new layers for localization efforts. The field is wide-open for researchers to confirm or reject these and other new ideas, especially with respect to cross-linguistic issues. Research will provide the basis for de facto and real standards, and studies are underway. The European Telecom Standards Institute has established a group to develop a standard for the smallest, common set of voice commands for a speech system to be usable.

”No really, it’s great!”

The most successful speech systems are the ones that are enjoyable to use. But deploying such systems requires a careful study of user perceptions. Every good speech system involves usability tests to study the ease of use, progression of user learning, perceived speed of conversations, and emotional responses of users.

But conducting usability trials in multiple languages and cultures requires methodologies that address the needs of those locales. In some cultures or sub-groups, people avoid criticism and confrontation of any kind. Trials must be designed with this in mind. These audiences may provide indirect clues that can be understood with a clever strategy for interpreting results.

In one set of tests, we observed that Spanish-speaking participants were unanimous in their overly polite responses. They would make excuses for the system’s errors. Users who were unable to accomplish any of their assigned tasks and who were visibly frustrated by the system’s behavior, spoke about the system in flattering terms. They were single-mindedly focused on giving positive responses. All of this contrasted to the responses of English-speaking participants who were balanced in their praise and criticism.

The English-speaking participants entered the study with the expectation of a system that provided good service and that would speak all help messages in English. They assumed that their criticism would be used to improve the system.

The Spanish speakers had a very different mindset. They were accustomed to difficult systems that only speak English. They had come to expect second-rate service, and they were surprised and delighted that even help messages were spoken in Spanish. Having an interface in their own language was the overriding consideration, and they were immediately predisposed to enjoy the system and to rate it in a positive way.

In addition, culturally induced politeness and formality played a role in the Spanish-speakers mindset. Participants tended to blame themselves for system mistakes. They would not report errors or problems for fear of showing incompetence or technology illiteracy. Some participants showed concern that the designers might get into trouble for design flaws or sub-optimal features.

A usability trial must consider how different populations give feedback. The tester must know how to interpret data obtained from different cultures. In some cases, alternative techniques are needed to elicit more truthful and complete opinions from the participants. In this sense, “localization” applies not just to the test, but also to the testing methods and the interpretation of results.

You say potato…

Every linguistic community across the globe has its own unique solutions to the task of communicating ideas. The most effective and natural-to-use speech interface designs are sensitive to the intricacies of these different linguistic systems. Localized speech interfaces should capture all the communicative signals of the target language, which go far beyond the bare words to include the manner of expression, cultural convention, and psychology. Anything less will distract the user from interacting naturally—and successfully!—with the system. Interface designers must not only understand what they want users to say, but also what their users are thinking when they say it.

About the authors

José L. Elizondo, International Projects Specialist
José taught at the Instituto Tecnológico y de Estudios Superiores de Monterrey in Guadalajara, Mexico as an Associate Professor of Scientific Research Methodology and Mathematics. In 1996, he joined the product development team at SpeechWorks. His most recent efforts focus on project management and user-interface design for multilingual speech-recognition systems, as well as internationalization issues for application development. He studied Electrical Engineering, Humanities and Music Composition at MIT and Harvard University.

Peter Crimmin, Senior Technical Writer
Peter Crimmin has a background in localization from his years at Digital Equipment Corp. where he helped define standards for international documentation. His work on Digital’s early graphical user interfaces prepared him for efforts with telephone touch-tone systems and most recently speech. He is currently in charge SpeechWorks’ international documentation.

Paul Greiner, Applied Linguistics Engineer
Paul Greiner earned his doctoral degree in Germanic Linguistics from the University of California and has seven years of experience in user interface design in products ranging from toys, electronic learning aids, other voice-activated consumer electronics and, currently, over-the-telephone automated speech recognition systems for SpeechWorks International. This includes multi-lingual and mono-lingual designs, taking in seven languages.

About SpeechWorks International, Inc.

SpeechWorks is headquartered in Boston, Massachusetts and has offices around the world. Complementing the self-service model of e-business, SpeechWorks speech solutions, including the SpeechSite™ product, Speechify™ text-to-speech engine and SpeechSecure™ module, let customers direct their calls, obtain information and complete transactions automatically, simply by speaking naturally over any phone, anytime.

SpeechWorks customers include some of the world’s most sophisticated customer service innovators such as America Online, Continental Airlines, E*Trade, Microsoft and United Airlines. SpeechWorks strategic partners include leading corporations such as AT&T, Avaya, Dialogic, an Intel Company, InterVoice-Brite, and Net2Phone.


SpeechWorks International, Inc.
695 Atlantic Avenue
Boston, Massachusetts 02111
Tel: +1-617-428-4444
Fax: +1-617-428-1122
http://www.speechworks.com




Contents


LISA Business Data

LISA Publications Catalog

Industry Insights Reports

Best Practice Guides

Surveys

QA Model

Forum Summaries and Presentations

LISA Globalization Consulting Network

Webinars and TouchPoint Advisory Calls


Join LISA

Subscribe


Upcoming Events

LISA Forum USA
(Foster City, California, April 13–16, 2010)

LISA@Chinasoft Fair
(Chengdu, China)

LISA Forum Asia
(Suzhou, June 28–July 1, 2010)

LISA Forum Europe
(Budapest, October, 2010)

LISA Forum India
(New Delhi, December, 2010)


Open StandardsTBXTMX

Terminology SIG

Job and CV Postings