Brief Introduction to Spatial Audio Reproduction

Dylan Menzies

Spatial audio reproduction is the term given for systems and processes that can give the impression of sounds coming from different directions and locations around the listener. Ideally this experience should be indistinguishable from listening to real sources. Spatial audio was originally important for domestic music reproduction, but has now moved into many areas, including games, user interface displays, and virtual and augmented reality.

On the face of it designing such systems seems fairly straightforward. In practice there are a wide variety of different approaches and combinations, each with particular advantages and limitations. The reproduction process can also depend on how the sound scene is represented. All this can be quite confusing to the uninitiated. The aim here is to carefully present the basic aspects and show how they work together.

 

Headphones – Binaural

Headphones make it possible to inject sound directly into the ear, directly controlling the binaural signals that pass to the inner ears. Source signals are modified by the scattering of the head and torso to form the binaural signals. The scattering can be captured in transfer functions, the Head Related Transfer Functions (HRTFs), and used to simulated the physical process in order to produce binaural signals that evoke images in the desired locations.  It is more common in practice to deal with the time domain form, Head Related Impulse Response (HRIR). Note this is distinct from the most common use of headphones, which is the playback of stereo signals (see below). Some practical difficulties are that HRTFs can only be stored with limited resolution, and each person has different HRTFs. The resolution can be improved using specialised interpolation methods, some of which are based on methods first used in loudspeaker reproduction (see below). HRTFs can be chosen as a best fit across the population, or better, they can be adapted for each individual from a database, possibly using geometric data taken from the head.

If the listener’s head is assumed stationary then the HRTFs needed for each image location are fixed. However in many applications, such as VR/AR, the head  moves significantly so must be tracked in order to calculate the correct transfer functions for the images. The resulting dynamic system, although simple in concept, is fairly complex and computationally demanding.

In natural scenes reflections from surfaces are important because they provide cues about source localisation and the environment. In headphone reproduction reflective signals must be reproduced directly in the headphones.  It is common to record  transfer responses including the reflections following the direct signal, known as Binaural Room Impulse Response (BRIR). In this way complex reflections can be reproduced in a binaural signal without reproducing each reflection individually.

 

Loudspeakers – Panning

The Stereo system created by Blumlein is one of the oldest spatial reproduction systems, and still today the most common. Each source signal is mixed in phase into the left and right channels. The image position is controlled by varying the left/right gain balance. This can be produced either by manual mixing or by recording a scene using a stereo microphone. The directivity functions of the capsules implement the mixing directly. The binaural signals of the listener in this case are mixtures of the loudspeaker signals. If a listener is centrally located then in the low frequency region below 1500Hz, the binaural signals are similar but with a time offset depending on the gain balance. This mimics the offset that occurs when a real source is presented, the offset depending on source direction. The time offset is known as the  Inter-aural Time Difference (ITD), and provides a strong cue for localisation. A useful feature of panning is that the image direction does not depend on the listener head size.

Stereo panning can be extended for more complex loudspeaker systems. For a surround horizontal array of loudspeakers, such 5.1, images can be panned pairwise, ie between the nearest two loudspeakers. Vector Base Amplitude Panning (VBAP) extends stereo panning to 3 dimensions by mixing between 3 nearby loudspeakers for each image. VBAP is typically used in an Object-Based context, which means that a sound scene is represented by all the separate sound sources. Each source is reproduced separately by first working out which 3 loudspeakers to use, and then the gains. However it is also possible to use VBAP in a Channel-Based context, in which the scene is represented with a fixed number of channels and these are decoded to the loudspeakers. Stereo is the simplest example of this.

In the 1970’s manufacturers started providing horizontal 4-loudspeaker Quadraphonic systems. It was found that pair wise panning could not produce clear lateral images. This provided some of the motivation for the Ambisonics methodology developed by Gerzon. The classical Ambisonics decoding process mixes the low frequency part of a source signal over all the loudspeakers with negative as well as positive gain, so that the pressure and velocity of the incident target sound field are reproduced accurately, provided the head is in the sweet spot. This ensures that the low frequency ITD is accurate for any target image and listener head direction, and so lateral images are solid, unlike the pairwise approach. Ambisonics is channel-based, like stereo. The encoding, B-format, is equivalent to the signals from ideal omni and figure-of-eight microphones. These can be easily decoded to a great variety of loudspeaker arrays, which makes B-format flexible as well as compact. Ambisonics and B-format can be extended naturally to encodings with more channels and higher resolution. This is called High Order Ambisonics (HOA). In many cases the listeners are distributed, and few are in the sweet spot, for example in concerts. The low frequency cues are confused, but the high frequency cues are improved by the greater resolution. There are a variety of Ambisonic decoding strategies depending on frequency range and application. The mode-matching method was previously mentioned for a single listener, whereby components of the sound field are physically reproduced. However mode-matching is a poor choice for higher orders and frequencies because for many arrays the problem is badly conditioned. A popular alternative approach is to first define panning functions for each loudspeaker, based for example on VBAP. This ensures energy for each source is localised on the array, and so is acceptable for listeners in different locations.

Ambisonics and VBAP can be applied in binaural systems to generate virtual loudspeaker arrays that are then decoded to binaural signals. B-format can be easily transformed to match sound field rotation, which supports dynamic head rotation in binaural reproduction.

From the forgoing discussion it will be clear that the simple panning strategies, Stereo and VBAP, have an essential defect which is that each image direction depends on the orientation of the listener’s head, for fixed panning gains. Depending on the configuration, the error in image direction from that desired can be small or large. Compensated Amplitude Panning (CAP) takes into account the head orientation so that the resulting ITD cue matches the target image. With only two loudspeakers it is possible to produce stable images in any direction, and for nearly any head orientation. Dynamic ITD cues make this possible. These are short histories of ITD which the hearing system correlates with changes of head orientation to deduce image direction. CAP can be extended for near-field imaging by controlling the Inter-aural Level Difference (ILD) cue. This requires some simple filtering, so can be viewed as an extension of simple panning methods.

Loudspeakers – Sound field Control

With enough loudspeakers, basic acoustic theory shows that it is possible to control the sound field over a region of space to a desired accuracy. If the sound field matches that of a target sound scene then listeners entering the region will experience the target. Mode-matched 1st order Ambisonic decoding provides sound field control over a region that can enclose a human head, up to 700Hz. The key detail here is that the incident sound field should be accurate where the significant scattering surfaces are placed.

Mode-matching with HOA, where possible, extends the sound field accuracy at the central point, and also the size of the surrounding region that achieves a nominal level of accuracy.  With densely packed loudspeakers it is possible to control the whole region within the surrounding loudspeaker array, so that multiple listeners can move in this area.  An extension of HOA, Near-Field HOA (NF-HOA), is need to make this work correctly. NF-HOA takes into account the finite distance of the loudspeakers, which is particularly important for reconstruction near the loudspeakers. This comes at the cost of more complex decoding filters. A less costly and less accurate alternative for this case is Wavefield Synthesis..

Wavefield Synthesis is a sound control method based on the ideal reproduction of a sound field on one side of a plane of continuous sources. In its basic form it is not localised, like HOA. In practice the sources are discrete loudspeakers and usually arranged along a horizontal linear boundary that can be curved and possibly closed. After approximations are made, the driving function for a simple source located anywhere outside the array is very simple, consisting of one filter, and a delay for each loudspeaker. An HOA encoding can be decoded in a wavefield style by first converting to a plane wave encoding. In Local Wavefield Synthesis reproduction effort is restricted to a sub-region, by focusing on points around the the subregion, creating a virtual sub-array. This requires less loudspeakers than the default case where the whole region is controlled.

Pressure matching is another approach to sound field control, in which the target sound field pressure and possibly velocity are specified at several points. In general this approach produces complex filters. Distributed modal matching combines pressure matching with mode matching, by specifying modes at several points, and has some numerical advantages. The modes are consistent in the overlapping regions.

 

Loudspeakers – Transaural

Panning works primarily by controlling the ITD cue. The natural question arises, can loudspeakers be used to control the binaural signals completely, as headphones can. The transfer response from the loudspeakers to the ears is a matrix of HRTFs. If this can be inverted then the inverse can be applied to the target binaural signals to produce the loudspeaker signals. This is Transaural reproduction. It turns out the inversion can be quite successful, producing an experience like binaural reproduction over headphones. Two or more loudspeakers are needed. In the Stereo Dipole  and Optimal Source Distribution (OSD)  systems the angular separation of the loudspeakers are chosen to enable filters that have minimum phase distortion in the reproduced signals. In OSD several pairs are used, each reproducing one frequency band of the whole signal. Noise from room reflections is also causes distortion. An alternative strategy is to create narrow beams towards the ears, reducing leakage into the room, although this increases phase distortion. In either case the system is implicitly designed to create maximum signal isolation between the two ears. Particularly at low frequencies this is difficult, and reproduction of all binaural signals at these frequencies suffers. This is bad news because low frequency ITD is an important cue. An alternative approach is to use stereo panning or CAP for low frequencies, and transaural reproduction only for high frequencies where it is effective. CAP can be applied in either a channel-based or object-based context.

Static transaural systems can be useful if the listener is in a known position, for example in a car, or in front of a desktop screen. Dynamic systems that respond to change of listener position and rotation are more useful, but much more complex, particularly because the inverse filter must be frequently recalculated. Such systems have been developed for virtual reality CAVE systems, and domestic spatial audio.