VAAE introduction

The Virtual Acoustics and Audio Engineering (VAAE) is part of the Acoustics Group within the Faculty of Engineering and the Environment at the University of Southampton. Areas of interest include Sound field control using dense loudspeaker arrays, 3-dimensional audio capture, representation and reproduction, auralization of acoustic spaces, psychoacoustics of spatial audio, microphone arrays for aeroacoustic analysis.

We have several collaborations with industrial partners and welcome enquiries for new collaboration. Our facilities include large and medium sized anechoic chambers, a 40-speaker 3D loudspeaker array, a general purpose listening room, a reverberation room, and a large assortment of equipment including high quality multichannel DA/AD, pre-amps and amplifiers, microphones, loudspeakers.

Linking Ambisonics with Compensated Amplitude Panning

The most fun you can have with two loudspeakers

We have published a new paper in IEEE Signal Processing Letters that describes how Ambisonics encodings, made famous by 360 video, can be reproduced exactly and efficiently using an extension of the Compensated Amplitude Panning (CAP) reproduction algorithm.

CAP can be viewed as an extension of panning processes often used for stereo and multichannel production. Instead of being static, the loudspeaker signals depend dynamically the precise position and orientation of the listener’s head, which are measured using a tracking process. CAP works by activating subtle but powerful dynamic cues in the hearing system. It is able to produce stable images in any direction, including behind and above, using only 2 loudspeakers, which greatly increases the practicality of surround sound in some cases. The approach bypasses conventional cross-talk cancellation design, and overcomes previous limitations in providing accurate dynamic cues for full 3D reproduction.

The system will be demonstrated in the forthcoming AES conference on interactive and immersive audio in York, UK.  Papers related to this work will also be presented at the AES convention in Dublin, Ireland, and the ICASSP meeting in Brighton, UK.

The photo shows the current experimental setup using an HTC Vive tracker device, at the Institute for Sound and Vibration Research in Southampton. Eventually more practical tracking methods will be used, for example based on video or embedded devices.

The loudspeakers can be placed in a conventional stereo configuration, however the performance improves as the listener moves towards the centre between the loudspeakers (heads 1-2 in plan view shown below). Images can be placed in the distance or at particular points in space (dots below). The listener can walk around an image (eg image 2 below), or face away from the loudspeakers towards an image (head 3 to image 1), creating an augmented reality experience. Overhead images can even be produced at different positions (eg image 3)

CAP can be used to playback conventional multichannel audio, by creating virtual loudspeakers to the side and rear.  In addition the Ambisonic decoder extension for CAP (BCAP) can reproduce Ambisonic encodings directly. Each encoding can produce a scene of any complexity consisting of far and near images, represented by the outer and inner circles below. As the listener moves the direction to each image stays fixed, which for far images enhances the impression of a fixed background. An Ambisonic encoding can be easily rotated, which can be used for example to simulate the background from inside a vehicle, or a personal head up display. The dots represent discrete images produced by instances of CAP.

 

The CAP system software will be available publicly as part of the VISR software suite that has already been released:

http://www.s3a-spatialaudio.org/

 

Other CAP related papers are linked here:

Surround sound without rear loudspeakers: Multichannel compensated amplitude panning and ambisonics

A complex panning method for near-field imaging

A low frequency panning method with compensation for head rotation

 

This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) “S3A” Programme Grant EP/L000539/1, and the BBC Audio Research Partnership.

Demonstrations of CAP and other systems at the Southampton University campus can be arranged on request.

Contact

d.menzies@soton.ac.uk           dylan.menzies1@gmail.com

 

VAAE at DAFx18

The DAFx audio technology conference took place this year in the idyllic seaside resort Aveiro in Portugal. Dylan Menzies presented a paper Surround Sound without Rear Loudspeakers: Multichannel Compensated Amplitude Panning and Ambisonics. This extends previous work on compensated panning to 3 or more loudspeakers, using an efficient formulation. The comparison is made with Ambisonic reproduction using the same loudspeaker layout. The proceedings are available free from http://dafx2018.web.ua.pt

VAAE at AES 144 2018

From May 23rd-26th, the VAAE team participated in the 144th Audio Engineering Society (AES) convention held at the NH Milano Congress Center in Milan, Italy.

The AES is internationally known as the primary audio technology society of the world. Its members include a mix of audio professionals, engineers, musicians, developers, researchers, educators and students all contributing to the greater body of knowledge dedicated to audio technology. These contributions are made through the AES journal (JAES), conference papers, engineering briefs, technical demonstrations, etc.

The VAAE made two contributions to AES 144 this year:

On the opening day, VAAE Research Fellow Dr. Andreas Franck presented a poster on the Binaural Synthesis Toolkit (BST), an “open source, portable, and extensible software library for real-time and offline binaural synthesis” currently being developed by himself and fellow VAAE staff Giacomo Costantini. The poster summarised their more detailed engineering brief titled “An Open Realtime Binaural Synthesis Toolkit for Audio Research”, which includes further coauthors Chris Pike (BBC R&D) and VAAE team leader Dr. Filippo Fazi. In addition to the poster, Dr. Franck demoed custom audio plugins that allow the use of the BST in a digital audio workstation (DAW). The demo highlighted the dynamic binaural rendering capabilities of the BST by using headphones equipped with a low-cost head tracker which is supported directly by the toolkit.

On the final day of the conference, Dr. Filippo Fazi gave a talk on the paper “Stage Compression in Transaural Audio” couthored by himself and VAAE PhD student Eric Hamdan. The paper analyses the use of Tikhonov regularisation when creating cross-talk cancellation filters and how doing so leads to a phase difference error between the left and right ear binaural signals. In the low frequency region, this phase difference error maps to a smaller interaural time difference (ITD) than intended. Given the salience of low frequency ITD, this phenomenon indicates a potential shift of the percieved virtual source image to a more forward angular position. The authors call this shifting effect stage compression. In the picture below, Dr. Fazi can be seen giving the presentation on a nice Saturday morning in Milan!

Our work and the rest of the written work presented at AES 144 can be found here.

New article: ‘A Complex Panning Method for Near-Field Imaging’

https://ieeexplore.ieee.org/document/8352907/

Stereo is one of the oldest spatial reproduction methods. At VAAE we have been experimenting with how stereo capabilities can be expanded if the listener’s head rotation and position can be tracked. It turns out quite a bit can be done. Images can be stabilised and produced in any direction, not just between the loudspeakers. In this latest work images are produced in the near-field, within a metre of the listener. This is achieved by controlling two low frequency localisation cues,  Inter-aural Time Difference (ITD) and Inter-aural Level Difference (ILD). This is not possible without the tracking though. Tracking systems are currently a bit clunky, but are advancing rapidly, as there are applications in many areas.

VAAE at the Berlin Beamforming Conference (BeBeC) 2018

The VAAE Team will be participating of the 7th Berlin Beamforming Conference (BeBeC),  starting tomorrow (05 March) at the Humboldt University of Berlin.

The BeBeC is a biannual conference on sound source localisation and quantification, with a strong emphasis on – you guessed it! – beamforming methods applied to phased array data. The BeBeC gathers a combination of academia and industry experts from around the world, with plenty of space in its two days for theoretical and practical considerations. Despite a strong emphasis on beamforming methods applied to aeroacoustics, there is plenty of general discussions including automotive industry cases, novel beamforming methods and technologies and general inverse methods.

The VAAE Team will be represented by Fabio Casagrande Hirono, who is giving a talk based on the paper “Wavenumber-Domain Method for Source Reconstruction from Far-Field Array Measurements in Flow“.  All the papers published in all editions of the Conference are open-access and can be downloaded from the Conference website; this year’s program is now available, go have a look!

Brief Introduction to Spatial Audio Reproduction

Dylan Menzies

Spatial audio reproduction is the term given for systems and processes that can give the impression of sounds coming from different directions and locations around the listener. Ideally this experience should be indistinguishable from listening to real sources. Spatial audio was originally important for domestic music reproduction, but has now moved into many areas, including games, user interface displays, and virtual and augmented reality.

On the face of it designing such systems seems fairly straightforward. In practice there are a wide variety of different approaches and combinations, each with particular advantages and limitations. The reproduction process can also depend on how the sound scene is represented. All this can be quite confusing to the uninitiated. The aim here is to carefully present the basic aspects and show how they work together.

 

Headphones – Binaural

Headphones make it possible to inject sound directly into the ear, directly controlling the binaural signals that pass to the inner ears. Source signals are modified by the scattering of the head and torso to form the binaural signals. The scattering can be captured in transfer functions, the Head Related Transfer Functions (HRTFs), and used to simulated the physical process in order to produce binaural signals that evoke images in the desired locations.  It is more common in practice to deal with the time domain form, Head Related Impulse Response (HRIR). Note this is distinct from the most common use of headphones, which is the playback of stereo signals (see below). Some practical difficulties are that HRTFs can only be stored with limited resolution, and each person has different HRTFs. The resolution can be improved using specialised interpolation methods, some of which are based on methods first used in loudspeaker reproduction (see below). HRTFs can be chosen as a best fit across the population, or better, they can be adapted for each individual from a database, possibly using geometric data taken from the head.

If the listener’s head is assumed stationary then the HRTFs needed for each image location are fixed. However in many applications, such as VR/AR, the head  moves significantly so must be tracked in order to calculate the correct transfer functions for the images. The resulting dynamic system, although simple in concept, is fairly complex and computationally demanding.

In natural scenes reflections from surfaces are important because they provide cues about source localisation and the environment. In headphone reproduction reflective signals must be reproduced directly in the headphones.  It is common to record  transfer responses including the reflections following the direct signal, known as Binaural Room Impulse Response (BRIR). In this way complex reflections can be reproduced in a binaural signal without reproducing each reflection individually.

 

Loudspeakers – Panning

The Stereo system created by Blumlein is one of the oldest spatial reproduction systems, and still today the most common. Each source signal is mixed in phase into the left and right channels. The image position is controlled by varying the left/right gain balance. This can be produced either by manual mixing or by recording a scene using a stereo microphone. The directivity functions of the capsules implement the mixing directly. The binaural signals of the listener in this case are mixtures of the loudspeaker signals. If a listener is centrally located then in the low frequency region below 1500Hz, the binaural signals are similar but with a time offset depending on the gain balance. This mimics the offset that occurs when a real source is presented, the offset depending on source direction. The time offset is known as the  Inter-aural Time Difference (ITD), and provides a strong cue for localisation. A useful feature of panning is that the image direction does not depend on the listener head size.

Stereo panning can be extended for more complex loudspeaker systems. For a surround horizontal array of loudspeakers, such 5.1, images can be panned pairwise, ie between the nearest two loudspeakers. Vector Base Amplitude Panning (VBAP) extends stereo panning to 3 dimensions by mixing between 3 nearby loudspeakers for each image. VBAP is typically used in an Object-Based context, which means that a sound scene is represented by all the separate sound sources. Each source is reproduced separately by first working out which 3 loudspeakers to use, and then the gains. However it is also possible to use VBAP in a Channel-Based context, in which the scene is represented with a fixed number of channels and these are decoded to the loudspeakers. Stereo is the simplest example of this.

In the 1970’s manufacturers started providing horizontal 4-loudspeaker Quadraphonic systems. It was found that pair wise panning could not produce clear lateral images. This provided some of the motivation for the Ambisonics methodology developed by Gerzon. The classical Ambisonics decoding process mixes the low frequency part of a source signal over all the loudspeakers with negative as well as positive gain, so that the pressure and velocity of the incident target sound field are reproduced accurately, provided the head is in the sweet spot. This ensures that the low frequency ITD is accurate for any target image and listener head direction, and so lateral images are solid, unlike the pairwise approach. Ambisonics is channel-based, like stereo. The encoding, B-format, is equivalent to the signals from ideal omni and figure-of-eight microphones. These can be easily decoded to a great variety of loudspeaker arrays, which makes B-format flexible as well as compact. Ambisonics and B-format can be extended naturally to encodings with more channels and higher resolution. This is called High Order Ambisonics (HOA). In many cases the listeners are distributed, and few are in the sweet spot, for example in concerts. The low frequency cues are confused, but the high frequency cues are improved by the greater resolution. There are a variety of Ambisonic decoding strategies depending on frequency range and application. The mode-matching method was previously mentioned for a single listener, whereby components of the sound field are physically reproduced. However mode-matching is a poor choice for higher orders and frequencies because for many arrays the problem is badly conditioned. A popular alternative approach is to first define panning functions for each loudspeaker, based for example on VBAP. This ensures energy for each source is localised on the array, and so is acceptable for listeners in different locations.

Ambisonics and VBAP can be applied in binaural systems to generate virtual loudspeaker arrays that are then decoded to binaural signals. B-format can be easily transformed to match sound field rotation, which supports dynamic head rotation in binaural reproduction.

From the forgoing discussion it will be clear that the simple panning strategies, Stereo and VBAP, have an essential defect which is that each image direction depends on the orientation of the listener’s head, for fixed panning gains. Depending on the configuration, the error in image direction from that desired can be small or large. Compensated Amplitude Panning (CAP) takes into account the head orientation so that the resulting ITD cue matches the target image. With only two loudspeakers it is possible to produce stable images in any direction, and for nearly any head orientation. Dynamic ITD cues make this possible. These are short histories of ITD which the hearing system correlates with changes of head orientation to deduce image direction. CAP can be extended for near-field imaging by controlling the Inter-aural Level Difference (ILD) cue. This requires some simple filtering, so can be viewed as an extension of simple panning methods.

Loudspeakers – Sound field Control

With enough loudspeakers, basic acoustic theory shows that it is possible to control the sound field over a region of space to a desired accuracy. If the sound field matches that of a target sound scene then listeners entering the region will experience the target. Mode-matched 1st order Ambisonic decoding provides sound field control over a region that can enclose a human head, up to 700Hz. The key detail here is that the incident sound field should be accurate where the significant scattering surfaces are placed.

Mode-matching with HOA, where possible, extends the sound field accuracy at the central point, and also the size of the surrounding region that achieves a nominal level of accuracy.  With densely packed loudspeakers it is possible to control the whole region within the surrounding loudspeaker array, so that multiple listeners can move in this area.  An extension of HOA, Near-Field HOA (NF-HOA), is need to make this work correctly. NF-HOA takes into account the finite distance of the loudspeakers, which is particularly important for reconstruction near the loudspeakers. This comes at the cost of more complex decoding filters. A less costly and less accurate alternative for this case is Wavefield Synthesis..

Wavefield Synthesis is a sound control method based on the ideal reproduction of a sound field on one side of a plane of continuous sources. In its basic form it is not localised, like HOA. In practice the sources are discrete loudspeakers and usually arranged along a horizontal linear boundary that can be curved and possibly closed. After approximations are made, the driving function for a simple source located anywhere outside the array is very simple, consisting of one filter, and a delay for each loudspeaker. An HOA encoding can be decoded in a wavefield style by first converting to a plane wave encoding. In Local Wavefield Synthesis reproduction effort is restricted to a sub-region, by focusing on points around the the subregion, creating a virtual sub-array. This requires less loudspeakers than the default case where the whole region is controlled.

Pressure matching is another approach to sound field control, in which the target sound field pressure and possibly velocity are specified at several points. In general this approach produces complex filters. Distributed modal matching combines pressure matching with mode matching, by specifying modes at several points, and has some numerical advantages. The modes are consistent in the overlapping regions.

 

Loudspeakers – Transaural

Panning works primarily by controlling the ITD cue. The natural question arises, can loudspeakers be used to control the binaural signals completely, as headphones can. The transfer response from the loudspeakers to the ears is a matrix of HRTFs. If this can be inverted then the inverse can be applied to the target binaural signals to produce the loudspeaker signals. This is Transaural reproduction. It turns out the inversion can be quite successful, producing an experience like binaural reproduction over headphones. Two or more loudspeakers are needed. In the Stereo Dipole  and Optimal Source Distribution (OSD)  systems the angular separation of the loudspeakers are chosen to enable filters that have minimum phase distortion in the reproduced signals. In OSD several pairs are used, each reproducing one frequency band of the whole signal. Noise from room reflections is also causes distortion. An alternative strategy is to create narrow beams towards the ears, reducing leakage into the room, although this increases phase distortion. In either case the system is implicitly designed to create maximum signal isolation between the two ears. Particularly at low frequencies this is difficult, and reproduction of all binaural signals at these frequencies suffers. This is bad news because low frequency ITD is an important cue. An alternative approach is to use stereo panning or CAP for low frequencies, and transaural reproduction only for high frequencies where it is effective. CAP can be applied in either a channel-based or object-based context.

Static transaural systems can be useful if the listener is in a known position, for example in a car, or in front of a desktop screen. Dynamic systems that respond to change of listener position and rotation are more useful, but much more complex, particularly because the inverse filter must be frequently recalculated. Such systems have been developed for virtual reality CAVE systems, and domestic spatial audio.