Cohen Acoustical, Inc. Los Angeles, CA
This paper reviews some of the current technology available for creating the illusion of 3 dimensional sound. Examples of systems designed for headphone and loudspeaker presentation are discussed. In addition, we address the topics of idealized pinnae fu nctions, audition enviornment, reproduction media, image robustness, localization, and spaciousness.
INTRODUCTION
Traditional recordings presented over both headphones and loudspeakers are spatially impoverished. We have known for many years that this is due to the inability of a microphone to pick up directional information in the same manner as the p innae. Therefore, Intensity Stereo reproduction techniques do not offer a recreation of a "natural" sound field. Resulting recordings are clearly deprived of the spatial information that is part of our rich natural sonic environment.
An obvious aspect of what Gaver(1986) has termed 'everyday listening' is the fact that we live and listen in a three dimensional world. A primary advantage of the auditory system is that it allows us to monitor and identify sources of information from all possible locations. Hearing is not limited to the direction of gaze(See Wenzel, 1990), nor the limited aural viewing position or sweet spot of a conventional stereo system where the listener sits at one of the apexes of an equilat eral triangle formed with the loudspeakers.
There are benefits to three dimensional presentation both in terms of headphone displays and over loudspeaker systems. Current research efforts have focused on two particular areas. The primary one is virtual acoustic displays for virtual environment systems and the second area is signal processing for ambience and reverberation processors for the entertainment industry. Mc AdamS (1983) has pointed out, that we have just begun to understand that an important aspect of image processing is the distinguishing of different sound sources in order to be able to form images of the environment. The auditory system must be abl e to decide which elements belong together or come from the same source and which elements come from different sources.
Wenzel (1990) has written,
The advent of digital signal processing technology has finally made the creation of these sonically natural and realistic environments a possibility (Cohen,E.A. 1989a,b,c; Cohen,E.A., 1988).
Examples of Technology for Binaural Processning Over Headphones
The virtual auditory display (Wenzel et.al, 1988) takes advantage of the capabilities of the human listener to identify the direction and distance of multiple sound sources. Martens (1989b) has written,
A number of research group have taken similar approaches. They utilize Head Related Transfer Functions (HRTF's) to provide directional cues and then generate some form of ambience processing to simulate the listener's environment. A great de al of the effectiveness of the 3-Dimensional simulation depends on the method used to derive the HRTF's and the quality of the ambience. HRTF's are derived from a variety of sources including normative mannequins, human subjects, and cadavers (Genuit,1986 ; Kendall & Rogers, 1982; Kendall & Martens,1984; Lehnert & Blauert, 1989: Posselt et.al, 1986; Persterer, 1989; Wenzel et. al., 1988).
Wenzel (l990) goes on to describe an approach developed at NASA Ames which,
The NASA-Ames project also includes the modeling of room acoustics as a component of its virtual auditory environment. (Begault, 1987; Begault et. al., 1990)
The Convolvotron
The Convolvotron was developed as part of the NASA Ames project. It is a high speed digital audio signal processing system that is designed to deliver three dimensional sound over headphones. It currently utilizes the HRTF's measured by Wightman (1989a). The Convolvotron consists of a two card set designed by Scott Foster of Crystal River Engineering for an industry compatible PC. The system is controlled by the host PC with calls to a library written in Microsoft C. 128 parallel multiply/accumulate/ shift processors on the-two card set provide computational speeds on the order of 300 MIP's. The system was a 50 kHz sampling rate and 16 bit conversion. In its mode for 3-D presentation, four independent sound sources are filtered with large time varyin g filters that are intended to compensate for the head motion of the listener and/or the possible motion of audio sources. As the listener changes the position of his or her head, the perceived location of the sound source should remain constant (e.g. s ound perceived to come from in front of the listener will change smoothly to the right side of the listener when he or she turns 90 degrees to the left) .
The AKG CAP 340 and Associated Hiqh Performance Digital Audio Signal Processing System
The heart of this system (Persterer, 1989a) is the use of filters to simulate the HRTF's, including the interaural time delay difference. The filter outputs are subsequently fed to headphones. Along with Kendall and Martens (1984) Persterer takes into account the importance of room reflections as they affect sound localization. He implements a delay and then assigns a direction hv utilizinq a dedicated filter pair.
Persterer recognizes,
He accomplishes the required calculations by utilizing a 32- bit floating point processor with a cycle time of 100 ns. A host computer (an HP 9000/series 300) generates the processor controls and provides the user interface. The audio interface is capab le of handling up to 32 inputs and outputs and offers a 48 kHz sampling rate. The delay module provides up to 40 milliseconds of delay.
Binaural mixing software (SPATMIX) has been developed for the CAP 340. It is structured for the binaural processing of up to 32 input signals enabling sets of one direct sound and three reflections to be simulated. Special filters simulate the absorbtion properties of three materials.
The CAP 340 has been used by Theile (1990) and Rebescher (1990) for research on High Definition Stereo Television. They were investigating optimum loudspeaker confiqurations.
The Focal Point 3-D Audio System. The Focal Point 3-D Audio System, developed by Bo Gehring, is a Macintosh II application which uses a widely available inexpensive Macintosh II accelerator card as its signal processor. The intention of the focal point system is applications relatin g to virtual environments and future aircraft cockpits. The systems binaural technology is also based on head-related transfer functions. The sampling rate is 44 KHz, analog I0 is 16 bits and the internal processing is 24 bits, producing c omputational accuracy of 144 dB. This system is modular and has at least four 3-D channels which can be separately placed and moved by mouse, keyboard and RS-232 port commands. Several sets of binaural HRTF's can be used at the same time. The Focal P oint 3-D Audio System includes head-tracking and has a typical Macintosh interface, and like the CAP 340 is designed with binaural mixing in mind.
Preliminary listening experiments have revealed large timbre differences dependent upon choice of pinnae sets (this illustrates the importance of the method of obtaining HBTF's). This in turn generates differences in localization characteristics. Both of these attributes have been observed to vary widely in different parts of a sphere modeling the auditory space.
A Three Dimensional Signal Proceqaing System Designed for Both Headphone and Loudspeaker Reproduction
VS-l Spatial Sound Processor - Auris Corporation
The Auris Corporation is developing a 3-D spatial sound processor called the VS-1 (VS for "virtual space"). It can be used for headphone applications such as the virtual auditory display, and also in loudspeaker presentation. Here the intention is for the artist/engineer/producer using the system to be able to position individual sound elements within a simulated three dimensional space. They expect the sound elements would be monophonic tracks containing little or no spatial information, while the o utput would typically be stereo and designed for either loudspeaker or headphone reproduction . LCBS (left, center, right, surround) is another output option. There are two parts to the processor's acoustical simulation design. The first part ca ptures the acoustics of the head pinnae, and torso that are responsible for perceived direction. This provides the user with three dimensional panning through the full range of azimuth and elevation. The second part captures the acous tics of a user specified room or environment. This environmental simulation also includes directionalizing acoustics and creates the illusion of the full three dimensional environment. The Auris Group feels that the most important part of their modeli ng is that it captures the spatio-temporal distribution of sound in a natural environment. The time, intensity and direction of reflected sound changes in response to the position of a sound source, and the listener in the model room. The combin ation of directional and environmental simulations provides the user with control of source distance and environmental shape.
In the VS1 processor design, virtually all of the signal processing algorithms employed in the processor are designed for dynamic control by the user. The direction and distance of a sound source can be smoothly varied and is intended to automatically include acoustic features such as Doppler shift and air absorption. The dynamic steering of sound sources in three dimensions is implemented with time varying filtering based on the continuous interpolation of directional transfer funct ions that are stored in the processor's memory. The environmental simulator contains elements similar to conventional reverberators but the gain, delay and filtering of reflected sound is designed to be continually responsive to the movement in the model environment. Discrete reflections are individually directionalized while the late reverberant field is spatially diffuse. Environmental sound processing will include static filters that capture absorption and transmission loss for walls, foliage, etc. Subsets of these environmental processing techniques could be used to process individual ambience tracks.
The spatial sound processor is designed for use with either headphones or loudspeakers. Each mode of reproduction has a specially optimized set of directional transfer functions stored in the processor's memory. For example, the headphone transfer f unctions are optimized for front/back discrimination. When the headphone filters are used in conjunction with a headtracking device that communicates changes in head position to the processor, the result for the listener is immersion in an audito ry virtual environment. The simulated reflected sound is designed to provide clear cues to the absolute spatial position of sound sources within the virtual environment. In typical home listening situations, only the listener located directly betw een the speakers will hear good center imagery. For other listeners seated in a variety of locations, the sound images collapse to the nearest loudspeaker due to the precedence effect. The spatial sound processor incorporates technology to maximize the stability of three-dimensional sound images for listeners closer to one loudspeaker than another The processor includes environmental simulations which stabilize the relative position of the images by providing both distance and diffuse field cues. This technology is intended to allow spatially diffuse reverberant fields to surround all listeners.
The VS1 is designed to provide a standard AES EBU interface. Optional output is either standard analog stereo output or optional LCRS for film production. Sampling rates are 44.1 or 48 KHZ. It is also designed to be fully MIDI compatible and pl ans to have SMPTE locking. It is also intended to be controllable from a Macintosh Computer, a remote control panel, or a computer workstation.
Virtual Sound Source Localization Abilities Both headphone and loudspeaker presentation systems require, "the careful psychophysical evaluation of listener's ability to accurately localize the virtual or synthetic source" (Wenzel, 1990) .
Several Investigators have attempted to identify what features of measured head-related transfer functions, (HRTF's) support particular directional distinctions (Blauert, 1969,70; Hebrank and Wright, 1975; sloom, 1977; Watkins, 1978). But, the comple xity of the spectral profile presented to the ears has made it difficult to formulate a comprehensive model of human directional hearing cues for sound from any azimuth or elevation angle. The paper by Wenzel (1990) speaks to the issue of individual diffe rences.
Blauert (1988) has suggested that for successful three dimensional sound presentation over headphone it is necessary to measure each potential listener's HRTF . However, as Wenzel (1990) notes,
Preliminary data (Wenzel et al 1988b) suggests that using non-listeners specific transforms to achieve synthesis of localized cues is at least feasible. For experienced listener's, localization performance was only slightly degraded compared to a subject's inherent ability, even for the less robust elevation cues, as long as the transforms were derived from what one might call a "good localizer". Further, the fact that individual differences in performance particularly for elevation cou ld be traced to acoustical idiosyncrasies in the stimulus suggest that it may eventually be possible to create a set of "universal transforms" by appropriate averaging (Genuit, 1986) and data reduction techniques (e.g. Principal components analy sis) or perhaps even enhancing the spectra of empirically derived transfer functions (Durlach and Pang, 1986).
Martens (1987) used principal components analysis (PCA) on spectral variation between HRTF's in an attempt to reduce the amount of data necessary to specify the directionally dependent spectral cues. He found that effective transfer functions could be resynthesized from just a few principle components that captured simple distinctions such as front versus rear, and central versus lateral sound directions.
Kendall and Martens (1984) created a complete sphere of simulated transfer functions using pole-zero approximations to measured HRTF's, but optimized for loudspeaker rather than headphone reproduction. Perceptual evaluation showed that their tran sfer functions could support 3D spatial imagery over loudspeakers but the filters produced timbral changes that were unacceptable for professional audio (Martens, et. al., 1986). Kendall, Wilde and Martens (1989) reported on a system that was relativ ely less sensitive to listener position and possessed improved timbral constancy. Their current system attempts to compensate for crosstalk without cancellation.
For headphone presentation , Wenzel (1990) points out that,
In general, these data suggest that most listeners can obtain useful directional information for an auditory display without requiring the use of individually tailored HRTF's. However a caveat is important here. The results described above are based on analyses in which errors due to front/back confusions were resolved for free field versus simulated free field stimuli.
Experienced listeners exhibited front/back confusion rates of about 5% versus 10% and inexperienced listeners show average rates of about 22% versus 39%. Although the reason for such confusions is not completely understood they are probably due i n large part to the static nature of the stimulus and the ambiguity resulting from the so-called "cone of confusion" (see Blauert 1983). Several stimulus characteristics may help to minimize these errors. For example, the addition of d ynamic cues correlated with head motion and well controlled environmental cues derived from models of room acoustics may improve the ability to resolve these ambiquities."
Begault and Wenzel at NASA-Ames are currently working on methods for disambiguating front and rear localized positions of HRTF-processed speech. They are in the process of gathering subjective responses to stimuli processed using a number of strategi es, including the addition of early and late reflected energy, "boosted bands" (Blauert, 1983), and compensation for headphone transfer functions.
A good question to ask about 3D auditory displays is how important is it to include reverberation in the headphone signals. As Martens (1989b) explained,
CONCLUSIONS
For the individual listener, the system providing the best fitting pinnae functions will succeed in creating the most powerful illusion of three dimensional space. Pretty much anybody with even a small personal computer and access to a set of hea d-related transfer functions will be able to do some degree of successful binaural listening. Most currently commercially available systems however, can barely perform the basic computations necessary to calculate the pinnae functions themselves in real time. Faster algorithms and larger computer systems will be necessary for the successful simulation of the acoustic environment that is so important in faithful representation of a three dimensional auditory world. In addition, incorpora ting the important transformations of room reflections will be necessary for successful image formation.
Acknowledgements
The author would like to thank the numerous contributions of her colleagues, acoustical researchers and musicians who have shared their information and opinions. In addition she would like to thank the following people for their comments, criticisms an d discussions of the issues presented in this paper; Ioan Allen, Randy Begault, Scott Foster, Dave Griesinger Bo Gehring Tomlinson Holman, Gary Kendall, Bill Martens, Brian C.J. Moore, Alexander Persterer, Barb Wav and Beth Wenzel.
References
Blauert, J. (1983) Spatial Hearing, The MIT Press, Cambridge, MA
Blauert, J. (1969,70) Sound Localization in the Median Plane, Acustica, 22, 957-962. Begault, Durand R. (1987) Control of Auditory Distance Unpublished Ph. D. Dissertation U.C. San Diego
Begault, D. & Wenzel, E.M. (1990) Techniques and Applications for Binaural Sound Manipulation in Man-Machine Interfaces, NASA-Ames Technical memorandum TM 102279
Bly, S. (1982) Sound and Computer Information Presentation unpublished doctoral thesis, (UCRL-53282), Lawrence Livermore National Laboratory and University of California, Davis, CA
Bloom, P.J. (1977b) Creating source elevation illusions by spectral manipulation, J. Aud. Enq. Soc. 25 560-585.
Calhoun, G.L., Valencia, G. and Furness T.A. III (1987) Three Dimensional Auditory Cue simulation for Crew Station Design and Evaluation Proc. Hum. Fac. Soc., 31, 1398-1402.
Cherry, E.C. (1953) Some Experiments on the Recognition of Speech with One and Two Ears, J. Acoustical Society of America, 22, 61 62.
Cohen, E.A. (1988) Evolution and Innovation in Stereo Television, 1982 - Future, Paper presented the 130 SMPTE Conference, New York
Cohen, E.A. (1989) 3D Sound Fiction, Fantasy, and Fact, Paper presented at the 87th Audio Engineering Society Convention, New York
Cohen, E.A. (1989) 3D Sound for Film and Television, Paper presented at the 131st SMPTE Conference, Los Angeles
Cohen, E. (1989) Evaluation Methods and the Appropriate Use of 3-D Sound Processing for Cinema, Paper presented at the October, 1989 Uniatech Conference, Montreal, Canada.
Colquhoun, W P. (1985) Evaluation of Auditory Visual and Dual Mode Displays for Prolonged Sonar Monitoring in Repeated Sessions, Hum. Fact . 17. 425-437.
Deatherage, B.H. (1972) Auditory and Other Sensory Forms of Information Presentation,
H.P. Van Cott and R.G. Kincade (editors), Human Engineering Guide to Equipment Design (revised addition) Washington, D.C : U.S. Government Printing Office, 123 160.
Doll, T.J., Girth, J.M., Engelman, W.R. and Folds, D.J. (1986) Development of Simulated Directional Audio for Cockpit Applications, USAF Report No. AAMRL-TR-86014.
Durlach, N.I. and Pang, X.D. (1986) Interaural Magnification, J. Acnl~.stical Societv of America, 80,1849-1850.
Edwards, A.D.N. (1989) Sound Track: An Auditory Interface for Blind Users, Hum. Comp. Interact., 4, 45-66.
Gaver, W (1986) Auditory Icons; Using Sound and Computer Interfaces, Hum. Comp. Interact , 2, 167-177.
Genuit, K. (1986) A Description of the Human Outer Ear Transfer Functions By Elements of Communication Theory, Proceeding 12th I Q, Toronto, paper B6-8.
Griesinger, David (1990) Binaural Techniques for Music Reproduction, Audio Engineering Society Conference on Psychoacoustics, May, 1990.
Kendall, G.S., & Martens, W.L. (1984) Simulating cues of Spatial Hearing in Natural Environments, Proceedings of the 1984 International Computer Music Conference, D. Wessel, Ed.
Kendall, G.S., & Martens, W.L. (1989) Production and Reproduction of Three-Dimensional, Spatial Sound for Stereo Loudspeakers. Paper presented at the Audio Engineering Society 87th Convention, New York, October, 1989.
Kendall, G.S. & Rogers (1982) The Simulation of Three Dimensional Localizations Cues for Headphone Listening, Proceedings of the 19~2 International Com~l~ter Mu.sic Conference
Lehnert, H. and Blauert, J. (1989) A Concept for Binaural Room Simulation, ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, N.Y.
Loomis, J.M., Herbert, C. and Cicinelli, J.G. (1989) Active Localization of Virtual Sound Sources, submitted to J. Acoust. Soc. Am.
Martens, W.L. (1987) Principal Components Analysis and Resynthesis of spectral Cues to Perceive Direction. Proceedings of the International Computer Music Conference 1987, S.
Tipei & J. Beauchamp, Eds.
Martens, W.L., Kendall, G.S. Freed, D.J. , Ludwig, M.D., & Karstens, R. (1986) The evaluation of Digital Filters for Controlling Perceived Direction in Stereo Loudspeaker Reproduction. Paper presented at the Audio Engineering Socie ty 81st Convention, Los Angeles, November, 1986
Martens, W.L. (1989) Spatial Image Formation in Binocular Vision and Binaural Hearing, Martins, Conference Proceedings, Montreal,
O'Leary, A. and Rhodes, G. (1984) Cross Modal Effects on Visual and Auditory Object Perception, Perception and Psychophysics, 35, 565-569
Patterson, R.D. (1982) Guidelines for Auditory Warning Systems in Civil Aircraft. Civil Aviation Authority No. 82017 London.
Posselt, C., Schroter, J., Opitz, M., Divenyi, P., Blauert, J. (1986) Generation of Binaural Signals for Research and Home Entertainment, Proceeding of the 12th ICA, Toronto, paper No.B1-6.
Persterer, A. (1989a) A Very High Performance Digital Audio Signal Processing System, ASSP Workshop on applications of Signal Processinq to Audio and Acoustics, New Paltz N.Y.
Persterer, A. (1989b) A Very High Performance Digital Audio Signal Processing System, 13th International Congress on Acoustics, Yugoslavia
Rebscher, R. (1990) Enlarging the Listening Area by Increasing the Number of Loudspeakers. Paper presented at the 88th Audio Enqineerinq Societv Convention Montreux 1990.
Theile, G. (1990) On The Performance of Two-Channel and MultiChannel-Stereophony. Paper presented at the 88th Audio Engineering Society Convention, Montreux, 1990.
Warren, D.H., Welch, R.B. and McCarthy, T.J. (1981) The Role of Visual Auditory "Compellingness" in the Ventriloquism Effect: Implications for Transitivity Among the Spatial Senses. Perception and Psychophysics, 30, 557-564.
Watkins, A.J. (1978) Pyschoacoustic aspects of synthesized vertical local cues. J. Acoust. Soc. Amer., 63, 1152-1165.
Wenzel, E.M., Wightman, F.L., and Foster, S.H. (1988a) A Virtual Display System for Conveying Three-Dimensional Acoustic Information, Proceedings Human Factor Society, 32, 86-90.
Wenzel, E.M., Wightman, F.L., Kistler D.J. and Foster, S.H. (1988b) Acoustic Origins of Individual Differences in Sound Localization Behavior, J. Acoust. Soc. Am. , Volume 84, S79.
Wenzel, E.M. (1990) Virtual Acoustic Displays, Conference on Human Machine Interfaces for Teleoperators and Virtual Environments, Santa Barbara, CA.
Wightman, F.L. and Kistler, D.J. (1989a) Headphone Simulation of Freefield Listening I: Stimulus Synthesis. J. of Acoust. Soc. Am.., 85, 858-867.
Wightman, F.L., and Kistler, D.J. (1989b) Headphone Simulation of Freefield Listening II: Psychophysical Validation. J. of Acoust. .Soc. Am.. 85 868-878.