 |
Speech Technology Overview
|
 |
|
 |
|
 |
What
Is Speech Technology?
In the mid- to late 1990s, personal computers started to become
powerful enough to make it possible for people to speak to them
and for the computers to speak back. Today, the technology is
still a long way from delivering natural, unstructured
conversations with computers that sound like humans, as envisioned
for the future by 2001 A Space Odyssey and Star Trek.
However, speech technology is delivering some very real benefits
in real applications right now. For example:
- Many large
companies have started adding speech recognition to their IVR
systems. Just by phoning a number and speaking users can, for
example, buy and sell stocks from Charles Schwab, check flight
information with United Airlines, or order goods from Office
Depot. The systems respond using a combination of pre-recorded
prompts and an artificially generated voice.
- Microsoft
Office XP users in the United States, Japan, and China can
dictate text to Microsoft Word or PowerPoint® documents.
Users can also use issue commands and control menus by
speaking. For many users, particularly in Japan and China,
dictating is far quicker and easier than using a keyboard.
Office XP will speak back too. Excel, for example, reads back
text as the user enters it into cells. This saves users from
checking back and forth from screen to paper.
The two key underlying technologies
behind these advances are speech recognition (SR) and
text-to-speech synthesis (TTS). |
 |
|
 |
 |
|
 |
Speech
Recognition
Speech recognition, or speech-to-text, involves capturing and
digitizing the sound waves, converting them to basic language
units or phonemes, constructing words from phonemes, and
contextually analyzing the words to ensure correct spelling for
words that sound alike (such as write and right). The figure below
illustrates this high-level description of the process.

Figure 1 - Speech recognition process flow
Recognizers-also referred to as speech recognition engines- are
the software drivers that convert the acoustical signal to a
digital signal and deliver recognized speech as text to your
application. Most recognizers support continuous speech, meaning
you can speak naturally into a microphone at the speed of most
conversations. Isolated or discrete speech recognizers require the
user to pause after each word, and are currently being replaced by
continuous speech engines.
Continuous speech recognition engines currently support two modes
of speech recognition:
- Dictation, in
which the user enters data by reading directly to the
computer.
- Command and
control, in which the user initiates actions by speaking
commands or asking questions.
Dictation mode allows users to dictate
memos, letters, and e-mail messages, as well as to enter data
using a speech recognition dictation engine. The possibilities for
what can be recognized are limited by the size of the recognizer's
"grammar" or dictionary of words. Most recognizers that
support dictation mode are speaker-dependent, meaning that
accuracy varies on the basis of the user's speaking patterns and
accent. To ensure accurate recognition, the application must
create or access a "speaker profile" that includes a
detailed map of the user's speech patterns used in the matching
process during recognition.
Command and control mode offers developers the easiest
implementation of a speech interface in an existing application.
In command and control mode, the grammar (or list of recognized
words) can be limited to the list of available commands-a much
more finite scope than that of continuous dictation grammars,
which must encompass nearly the entire dictionary. This provides
better accuracy and performance, and reduces the processing
overhead required by the application. The limited grammar also
enables speaker-independent processing, eliminating the need for
speaker profiles or "training" of the recognizer.
Speech recognition technology enables developers to include the
following features in their applications:
- Hands-free
computing as an alternative to the keyboard, or to allow the
application to be used in environments where a keyboard is
impractical (e.g., small mobile devices, AutoPCs, or in mobile
phones).
- A more
"human" computer, one users can talk to, may make
educational and entertainment applications seem more friendly
and realistic.
- Voice
responses to message boxes and wizard screens can easily be
designed into an application.
- Streamlined
access to application controls and large lists enables a user
to speak any one item from a list or any command from a
potentially huge set of commands without having to navigate
through several dialog boxes or cascading menus.
- Speech-activated
macros let a user speak a natural word or phrase rather than
use the keyboard or a command to activate a macro. For
example, saying "Spell check the paragraph" is
easier for most users to remember than the CTRL+F5 key
combination.
- Situational
dialogs are possible between the user and the computer in
which the computer asks the user, "What do you want to
do?" and branches according to the reply. For example,
the user might reply, "I want to book a flight from New
York to Boston." The computer analyzes the reply,
clarifies any ambiguous words ("Did you say New
York?"), and then asks for any information that the user
did not supply, such as "What day and time do you want to
leave?"
|
 |
|
 |
 |
Potential
Applications for Speech Recognition
The specific use of speech recognition technology will depend on
the application. Some target applications that are good candidates
for integrating speech recognition include:
Games and Edutainment
Speech recognition offers game and edutainment developers the
potential to bring their applications to a new level of play. With
games, for example, traditional computer-based characters could
evolve into characters that the user can actually talk to.
While speech recognition enhances the realism and fun in many
computer games, it also provides a useful alternative to
keyboard-based control, and voice commands provide new freedom for
the user in any sort of application, from entertainment to office
productivity.
Data Entry
Applications that require users to keyboard paper-based data into
the computer (such as database front-ends and spreadsheets) are
good candidates for a speech recognition application. Reading data
directly to the computer is much easier for most users and can
significantly speed up data entry.
While speech recognition technology cannot effectively be used to
enter names, it can enter numbers or items selected from a small
(less than 100 items) list. Some recognizers can even handle
spelling fairly well. If an application has fields with mutually
exclusive data types (for example, one field allows
"male" or "female", another is for age, and a
third is for city), the speech recognition engine can process the
command and automatically determine which field to fill in.
Document Editing
This is a scenario in which one or both modes of speech
recognition could be used to dramatically improve productivity.
Dictation would allow users to dictate entire documents without
typing. Command and control would allow users to modify formatting
or change views without using the mouse or keyboard. For example,
a word processor might provide commands like "bold",
"italic", "change to Times New Roman font",
"use bullet list text style," and "use 18 point
type." A paint package might have "select eraser"
or "choose a wider brush."
|
 |
|
 |
 |
Speech
Synthesis
Speech Synthesis, or text-to-speech, is the process of converting
text into spoken language. This involves breaking down the words
into phonemes; analyzing for special handling of text such as
numbers, currency amounts, inflection, and punctuation; and
generating the digital audio for playback. There are additional
functions that the synthesizer performs, but Figure 2 below
illustrates this high-level description of the process.

Figure 2 - Text-to-speech process flow
Software drivers called synthesizers, or text-to-speech voices,
perform speech synthesis, handling the complexity of converting
text and generating spoken language. A text-to-speech voice
generates sounds similar to those created by human vocal cords and
applies various filters to simulate throat length, mouth cavity,
lip shape, and tongue position. Although easy to understand, the
voice produced by synthesis technology tends to sound less human
than a voice reproduced by a digital recording.
Nevertheless, text-to-speech applications may be the better
alternative in situations where a digital audio recording is
inadequate or impractical. Generally, consider using
text-to-speech when:
- Audio
recordings are too large to store on disk or expensive to
record.
- The
application response requires short phrases.
- The
application cannot predict what it will need to communicate or
alternative responses vary too much to record and store all
possibilities. For example, speaking the time is a good use
for text-to-speech, because the effort and storage involved in
concatenating all possible times is manageable.
- The user
prefers or requires audible feedback or notification. For
example, audible proofreading of text and numbers helps the
user catch typing errors missed by visual proofreading.
|
 |
|
 |
 |
Potential
Applications for Speech Synthesis
The specific use of text-to-speech will depend on the application.
Here are some potential applications:
Games and Edutainment
Text-to-speech allows the characters in an application to
"talk" back to the user instead of displaying speech
balloons. Although it is also possible to use digital recordings
of the speech, an application could use text-to-speech instead of
recordings in the following cases:
- The
concatenated word/phrase text-to-speech can always replace
recorded sentences. An application designer can easily pass
the desired sentence strings to the text-to-speech engine.
- The inevitably
non-human quality of synthesized text-to-speech makes it ideal
for character voices that are supposed to be robots or aliens.
- If the
application cannot afford to have recordings of all the
possible dialogs or if the dialogs cannot be recorded ahead of
time, then text-to-speech is the only alternative.
|
TOP
HOME
Embedded Digital System
Co.,Ltd. CANADA
嵌入数码系统公司
Email: embedigital@yahoo.com
copy right © 2002 All Rights
Reserved
|
|
|
|
|
|