Embedded Digital System

Speech Technology Overview

What Is Speech Technology?

In the mid- to late 1990s, personal computers started to become powerful enough to make it possible for people to speak to them and for the computers to speak back. Today, the technology is still a long way from delivering natural, unstructured conversations with computers that sound like humans, as envisioned for the future by 2001 A Space Odyssey and Star Trek.

However, speech technology is delivering some very real benefits in real applications right now. For example:

Many large companies have started adding speech recognition to their IVR systems. Just by phoning a number and speaking users can, for example, buy and sell stocks from Charles Schwab, check flight information with United Airlines, or order goods from Office Depot. The systems respond using a combination of pre-recorded prompts and an artificially generated voice.
Microsoft Office XP users in the United States, Japan, and China can dictate text to Microsoft Word or PowerPoint® documents. Users can also use issue commands and control menus by speaking. For many users, particularly in Japan and China, dictating is far quicker and easier than using a keyboard. Office XP will speak back too. Excel, for example, reads back text as the user enters it into cells. This saves users from checking back and forth from screen to paper.

The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).

On This Page:

Speech Recognition: Overview
Speech Recognition: Potential Applications
Speech Synthesis: Overview
Speech Synthesis: Potential Applications

Speech Recognition

Speech recognition, or speech-to-text, involves capturing and digitizing the sound waves, converting them to basic language units or phonemes, constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right). The figure below illustrates this high-level description of the process.

Figure 1 - Speech recognition process flow

Recognizers-also referred to as speech recognition engines- are the software drivers that convert the acoustical signal to a digital signal and deliver recognized speech as text to your application. Most recognizers support continuous speech, meaning you can speak naturally into a microphone at the speed of most conversations. Isolated or discrete speech recognizers require the user to pause after each word, and are currently being replaced by continuous speech engines.

Continuous speech recognition engines currently support two modes of speech recognition:

Dictation, in which the user enters data by reading directly to the computer.
Command and control, in which the user initiates actions by speaking commands or asking questions.

Dictation mode allows users to dictate memos, letters, and e-mail messages, as well as to enter data using a speech recognition dictation engine. The possibilities for what can be recognized are limited by the size of the recognizer's "grammar" or dictionary of words. Most recognizers that support dictation mode are speaker-dependent, meaning that accuracy varies on the basis of the user's speaking patterns and accent. To ensure accurate recognition, the application must create or access a "speaker profile" that includes a detailed map of the user's speech patterns used in the matching process during recognition.

Command and control mode offers developers the easiest implementation of a speech interface in an existing application. In command and control mode, the grammar (or list of recognized words) can be limited to the list of available commands-a much more finite scope than that of continuous dictation grammars, which must encompass nearly the entire dictionary. This provides better accuracy and performance, and reduces the processing overhead required by the application. The limited grammar also enables speaker-independent processing, eliminating the need for speaker profiles or "training" of the recognizer.

Speech recognition technology enables developers to include the following features in their applications:

Hands-free computing as an alternative to the keyboard, or to allow the application to be used in environments where a keyboard is impractical (e.g., small mobile devices, AutoPCs, or in mobile phones).
A more "human" computer, one users can talk to, may make educational and entertainment applications seem more friendly and realistic.
Voice responses to message boxes and wizard screens can easily be designed into an application.
Streamlined access to application controls and large lists enables a user to speak any one item from a list or any command from a potentially huge set of commands without having to navigate through several dialog boxes or cascading menus.
Speech-activated macros let a user speak a natural word or phrase rather than use the keyboard or a command to activate a macro. For example, saying "Spell check the paragraph" is easier for most users to remember than the CTRL+F5 key combination.
Situational dialogs are possible between the user and the computer in which the computer asks the user, "What do you want to do?" and branches according to the reply. For example, the user might reply, "I want to book a flight from New York to Boston." The computer analyzes the reply, clarifies any ambiguous words ("Did you say New York?"), and then asks for any information that the user did not supply, such as "What day and time do you want to leave?"

Potential Applications for Speech Recognition

The specific use of speech recognition technology will depend on the application. Some target applications that are good candidates for integrating speech recognition include:

Games and Edutainment

Speech recognition offers game and edutainment developers the potential to bring their applications to a new level of play. With games, for example, traditional computer-based characters could evolve into characters that the user can actually talk to.

While speech recognition enhances the realism and fun in many computer games, it also provides a useful alternative to keyboard-based control, and voice commands provide new freedom for the user in any sort of application, from entertainment to office productivity.

Data Entry

Applications that require users to keyboard paper-based data into the computer (such as database front-ends and spreadsheets) are good candidates for a speech recognition application. Reading data directly to the computer is much easier for most users and can significantly speed up data entry.

While speech recognition technology cannot effectively be used to enter names, it can enter numbers or items selected from a small (less than 100 items) list. Some recognizers can even handle spelling fairly well. If an application has fields with mutually exclusive data types (for example, one field allows "male" or "female", another is for age, and a third is for city), the speech recognition engine can process the command and automatically determine which field to fill in.

Document Editing

This is a scenario in which one or both modes of speech recognition could be used to dramatically improve productivity. Dictation would allow users to dictate entire documents without typing. Command and control would allow users to modify formatting or change views without using the mouse or keyboard. For example, a word processor might provide commands like "bold", "italic", "change to Times New Roman font", "use bullet list text style," and "use 18 point type." A paint package might have "select eraser" or "choose a wider brush."

Speech Synthesis

Speech Synthesis, or text-to-speech, is the process of converting text into spoken language. This involves breaking down the words into phonemes; analyzing for special handling of text such as numbers, currency amounts, inflection, and punctuation; and generating the digital audio for playback. There are additional functions that the synthesizer performs, but Figure 2 below illustrates this high-level description of the process.

Figure 2 - Text-to-speech process flow

Software drivers called synthesizers, or text-to-speech voices, perform speech synthesis, handling the complexity of converting text and generating spoken language. A text-to-speech voice generates sounds similar to those created by human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape, and tongue position. Although easy to understand, the voice produced by synthesis technology tends to sound less human than a voice reproduced by a digital recording.

Nevertheless, text-to-speech applications may be the better alternative in situations where a digital audio recording is inadequate or impractical. Generally, consider using text-to-speech when:

Audio recordings are too large to store on disk or expensive to record.
The application response requires short phrases.
The application cannot predict what it will need to communicate or alternative responses vary too much to record and store all possibilities. For example, speaking the time is a good use for text-to-speech, because the effort and storage involved in concatenating all possible times is manageable.
The user prefers or requires audible feedback or notification. For example, audible proofreading of text and numbers helps the user catch typing errors missed by visual proofreading.

Potential Applications for Speech Synthesis

The specific use of text-to-speech will depend on the application. Here are some potential applications:

Games and Edutainment

Text-to-speech allows the characters in an application to "talk" back to the user instead of displaying speech balloons. Although it is also possible to use digital recordings of the speech, an application could use text-to-speech instead of recordings in the following cases:

The concatenated word/phrase text-to-speech can always replace recorded sentences. An application designer can easily pass the desired sentence strings to the text-to-speech engine.
The inevitably non-human quality of synthesized text-to-speech makes it ideal for character voices that are supposed to be robots or aliens.
If the application cannot afford to have recordings of all the possible dialogs or if the dialogs cannot be recorded ahead of time, then text-to-speech is the only alternative.

TOP HOME

Embedded Digital System Co.,Ltd. CANADA 嵌入数码系统公司 Email: embedigital@yahoo.com