THE VOICE OF TECHNOLOGY The Java Speech API
|
|
by Steven Meloan
In the 1986 Star Trek movie, "The Voyage Home," the
crew of the Enterprise traveled back in time to 20th-century
Earth. Audiences laughed uproariously as Scotty naively picked
up the mouse from a nearby PC, and commanded -- "Compu-tor,
Compu-tor!"
Just a little over ten years later,
however, no one's laughing anymore. Speech technology is
quickly becoming a commonplace means of getting data into and
out of computational systems. Speech synthesis systems are
reaching ever new levels of clarity and sophistication, and
speech recognition systems -- from industrial
command-and-control technologies, to dictation services -- are
achieving impressive new degrees of speed and accuracy. But
the world of speech technology is still one of proprietary,
platform-specific systems, filled with arcane APIs that,
particularly from a development perspective, require a
significant background in speech technology.
[Java Speech
technology] has a very strong foundation -- the
object-oriented approach that the Java programming
language is built upon. - John Earle, Chant Inc.
|
The Java Speech API ("JSAPI"),
officially released this past October, offers a cutting-edge
alternative to this status quo -- bringing the many benefits
of the Java platform to the burgeoning world of speech
technology. The JSAPI 1.0 specification provides a standard,
secure, and cross-platform means of incorporating
state-of-the-art speech technology into Java technology-based
applets and applications -- for use on the desktop, in
telephony servers, and in small portable and embedded devices.
In the Beginning
The JSAPI specification was developed by Sun Microsystems,
Inc. in collaboration with leading speech technology
companies: Apple Computer, Inc., AT&T, Dragon Systems,
Inc., IBM Corporation, Novell, Inc., Philips Speech
Processing, and Texas Instruments Inc.
"We laid out five basic goals in the early design phase of
JSAPI," says Andrew Hunt, principal investigator for Sun
Microsystems' speech technology group. Those goals were:
- To provide support for speech synthesizers, and for both
command-and-control and dictation speech recognizers.
- To provide a robust, cross-platform, cross-vendor
interface to speech synthesis and speech recognition.
- To enable access to state-of-the-art speech technology.
- To support integration with other capabilities of the
Java platform, including the suite of Java Media APIs.
- To be simple, compact, easy to learn, and
internationalized (able to process languages in addition to
English).
Extending Speech Technology
"In developing the API specification, Sun's focus was on
enhancing developer productivity by making JSAPI easier to
learn and use," says Hunt. "Now that the specification is
released, Sun is working with speech technology companies to
provide implementations of the API. Because Sun does not own,
and is not developing, speech recognition or speech synthesis
technology, we have no current plans to release an API
implementation, but this does facilitate working with the
speech companies since we are not in direct competition."
...we had customers
come to us and ask that our speech engines be made
accessible through JSAPI [the Java Speech API]. - Tom
Morse, Lernout & Hauspie
|
The Java Speech interface
specification was spawned through a lengthy open development
process -- with major contributions from the partner companies
and other speech companies, feedback from the application
developer community, and months of public review and comment.
"We worked closely with Sun on the definition of JSAPI,"
says Steven De Gennaro, manager of enterprise conversational
systems for IBM. "As the definition evolved over various
releases, we tried to have an implementation available through
our alphaWorks site -- where we put out early previews of
technology -- running on top of our existing speech engines.
That was helpful to developers, because they could then write
applications against it, and we could find out what was good
about the spec, what needed work, and where we needed to
address design issues."
"Sun was very smart in how they approached bringing speech
technology to the Java platform," says John Earle, president
of Chant Inc. Chant is a software services company primarily
focused on speech-enabled applications. They, too, provided
feedback during the development of the Java Speech API. "Sun
had the advantage of being able to look out on the horizon and
see historically how some of these other speech technologies
had evolved," says Earle. "So, even though this is only
release 1.0 of the JSAPI specification, it's actually based
upon very mature technology. And it also has a very strong
foundation -- the object-oriented approach that the Java
programming language is built upon."
"Simply put," says Andrew Hunt, "Sun wanted to produce an
API that allowed the speech companies to put their latest and
greatest technologies onto the Java platform." The Java Speech
API is currently an extension to the Java platform -- part of
the Java Media APIs. These are a suite of class libraries
providing rich media and communications capabilities --
including audio/video playback, capture, conferencing, 2D and
3D graphics, animation, advanced imaging, and more.
"The current JSAPI-enabled engines are based around
existing speech technologies that the companies have,"
explains Hunt. "They take their existing proprietary APIs, and
wrap them in Java-technology based code to implement the Java
Speech API. This reduces costs to speech companies, as well as
time to market. And as the speech technology companies update
their existing engines, that technology becomes available to
the Java Speech technology community."
In order to better facilitate natural speech capabilities,
two companion specifications are included with the Java Speech
API -- the Java Speech API Markup Language (JSML), and the
Java Speech API Grammar Format (JSGF). With JSML, an XML
technology-based markup language used with speech
synthesizers, explicit pronunciations can be specified for
words, phrases, acronyms, and abbreviations, as well as
control of pauses, boundaries, and emphasis. Meanwhile, the
JSGF provides a facility to define grammars for
speech-recognition systems. "Both of these specifications can
also be applied separately from JSAPI," adds Hunt, "and, in
fact, several companies are now using them separately from the
Java platform."
Going To Market
Already, IBM and Lernout & Hauspie -- both respected
developers of speech synthesis and recognition software
engines -- have provided interfaces to their technologies
through the Java Speech API. "We were already following and
participating in the definition of JSAPI," says Tom Morse,
director of engineering for Lernout & Hauspie. "We hadn't
officially started to implement the interface ourselves,"
Morse reports, "but then we had customers come to us and ask
that our speech engines be made accessible through JSAPI -- so
it was really the needs of our customers that brought that
about." The company now has their TTS3000 text-to-speech
product enabled through JSAPI, as well as their ASR 1600, a
grammar-based, continuous speech-recognition engine.
In the meantime, IBM's "ViaVoice SDK for the Java
Environment" product implements most of JSAPI 1.0 for the
recognition and synthesis engines used in the company's
well-known ViaVoice product. It supports continuous dictation,
command-and-control, and speech synthesis (all in multiple
human languages), and is currently available for Microsoft
Windows and Windows NT. "Speech engines today are very
computationally intensive," explains De Gennaro. "They're all
native code. So while JSAPI provides an interface for
programmers, it still ultimately maps into the native code
underneath it." New Java Speech releases, with additional
capabilities, will continue to be made available through IBM's
alphaWorks site.
Most recently, IBM announced a brand new implementation of
JSAPI for their ViaVoice product line. "We've added Java
Speech technology support for our ViaVoice for Linux toolkit
that we released a few weeks ago," says De Gennaro. "This puts
Java Speech technology on yet another platform, and can be
downloaded from our alphaWorks site."
Chant, meanwhile, has long been active in facilitating
speech-enabled software development. The company's
object-oriented SpeechKit middleware supplies high-level
speech-oriented components for use in Java, C++, Visual Basic,
and other programming languages. "Before JSAPI was released,"
says Chant's Earle, "we hooked through the Java Native
Interface in order to provide developers who use the Java
platform with a way of voice-enabling their UIs to the speech
engines that are commonly available."
Now, the company has begun to incorporate the Java Speech
API into their tools of the speech technology trade. "After
JSAPI was released," says Earle, "we began integrating it
underneath our class technology. When using our product, it
will be transparent to the person writing in the Java
programming language whether they're even using a
JSAPI-enabled engine or not. We'll isolate them from that, so
they can still just concentrate on the application."
Into the Future
The next step in solidifying JSAPI's place as an
industry-wide speech standard is in providing more
speech-engine options, and across more platforms. "We expect
that more Java technology implementations of speech engines
will appear over time," says IBM's De Gennaro, "and that more
native-code engines will become available on Java
technology-enabled platforms. That will provide the runtimes
necessary for greater numbers of Java Speech
technology-enabled applications."
Speech technology is already increasingly found on the
computer desktop -- in speech synthesis (text-to-speech) and
dictation. And desktop dictation systems now offer speed and
accuracy rivaling that of even the fastest typists. But many
of the driving forces for the proliferation of the JSAPI
specification may yet prove to be outside of the traditional
desktop world -- in telephony systems, Web browsers, and small
portable and embedded devices.
"Applications are starting to pop up all over the place
where you can replace touch-tone systems with voice," says
Andrew Hunt. "If you've ever been through ten menus of
touch-tone, you know how painfully slow that can be. But with
voice, you can speak hundreds, or thousands of words. So this
is another area in which Java technology is starting to have a
presence, by being a standard interface with which you can
deploy telephony applications across multiple platforms."
The main focus for us
in the NT environment, beyond what we currently support,
will be Java and JSAPI technology-based
initiatives. - Chris Czekaj, Lucent Speech Solutions
|
In January of this year, Lucent
Technology formed Lucent Speech Solutions, an organization
chartered to deliver speech recognition, text to speech, and
speech verification technologies to both internal and external
customers. "Bell Labs pioneered the development of the initial
algorithms for speech recognition and text to speech,"
explains Chris Czekaj, business development manager within
Lucent Speech Solutions. "Now," he says, "we're scaling them
down to deliver these same technologies and algorithms outside
the AT&T network--while improving on them, from both a
hardware and a software perspective."
Customer-Driven Advances
Java technology and the Java Speech API specification are
poised to play a major role in this endeavor. "The main focus
for us, beyond what we currently support, in the NT
environment, will be Java and JSAPI technology-based
initiatives," says Czekaj. As with Lernout & Hauspie, this
push has been, in part, customer driven. "We've had several
potential clients that have expressed interest in such
capabilities," he says. "As a result, we're training our
developers and researchers in Java and JSAPI technologies, all
the fundamentals, in order to start developing our core
technologies in that environment."
Another huge growth area for speech technology is in the
small, portable, and embedded system market -- tailor-made to
the scalability and cross-platform nature of Java technology.
"We're seeing a lot of interest in the Java platform on small
devices, and also on the server side," says De Gennaro. "To
provide speech services in both of those environments, it
makes sense to have a standard API."
"If you've got a computer the size of your cell phone,"
explains Hunt, "then you can't do keyboard entry with typed
input, because there's simply not enough space to put in a
keyboard. And in many instances, there's also not enough room
for a full display. In such instances, a microphone and a
speaker are a much more powerful and flexible input and output
method."
Speech Technology on Consumer Devices
There's a great fit
between JSAPI and where we see speech going in
general... - Steven De Gennaro, IBM
|
As an extension to the Java
platform, the Java Speech API can also be provided as an
extension to PersonalJavaTM
and EmbeddedJavaTM platforms
-- Java technology-based application environments designed
specifically to operate on constrained devices with limited
computing power and memory. "There's a great fit between JSAPI
and where we see speech going in general," says De Gennaro,
"-- conversational interfaces that involve small devices,
smart phones, and cell phones."
Such embedded systems will undoubtedly find their way into
a whole range of business applications -- from the medical
field, where hands-free, eye-free work may be required during
surgery, to the warehouse/industrial space, to the automobile,
to remote email and calendar access, and even for voice-driven
Web access, such as form filling.
And getting back to the desktop, speech technology may
increasingly act as an adjunct to the current traditional GUI
-- to speed configuration settings in software applications
("give me 12-point, Helvetica, bold"), and to notify users of
background activities such as printer status.
Facilitating Accessibility
Finally, speech technology will increasingly facilitate an
endless array of accessibility features for people with
disabilities -- a potential user base of 43 million. The IBM
Special Needs group has released a toolkit called the Self
Voicing Kit (SVK) that works with Java technology Project
Swing components to provide audio output features for people
with disabilities. The SVK is built on top of the Java Speech
API and the Java Accessibility API, and is available from
IBM's alphaWorks site.
A Greater Pool of Developers
Having been developed with the input and insight of leading
speech technology experts, the Java Speech API brings with it
not only cross-platform standardization, but also
industry-leading ease of use. "We've tried to enable average
developers, with no special background in speech technology,
to be able to learn the API, and to use it effectively," says
Hunt. "At the moment, there may be only a few thousand
experienced speech developers around the world," he
conjectures, "but probably hundreds of thousands who can
program graphical environments. Our goal is to provide a
technology basis on which we can start to increase the number
of people who write speech-enabled applications -- from the
thousands, into the hundreds of thousands, and even beyond."
See Also
Java
Speech
API
(http://java.sun.com/products/java-media/speech/)
Chant
Inc.
(http://www.chant.net)
IBM's
Speech For the Java
Platform
(http://www.alphaWorks.ibm.com/tech/speech)
IBM's
ViaVoice
(www.software.ibm.com/speech)
IBM's
Beta ViaVoice SDK for
Linux
(www.software.ibm.com/is/voicetype/dev_home.html)
Lernout &
Hauspie
(http://www.lhs.com)
Lucent Speech
Solutions
(http://www.lucent.com/speech)