Thursday, September 12, 2019

AES Conference on Headphones, Part 2—ANC, MEMS, Manufacturing

I continue to share my notes from the recent AES Conference on Headphones. This post is about Active Noise Cancellation (ANC), Microelectromechanical (MEMS) technologies for speakers and microphones, and topics on headphones manufacturing, measurement, and modelling.

Active Noise Cancelling

Apparently, Active Noise Cancelling (ANC) is a big thing for the consumer market and is an interesting topic for research because it involves both acoustics and DSP. ANC technologies save our ears because they allow listening to music in noisy environments without blasting it at levels that damage hearing. Typically, ANC is implemented on closed headphones or earphones as their physical construction allows to attenuate some of the noise passively, especially at middle and high frequencies. Low frequency noise has to be eliminated using active technologies. Since this requires embedding electronics into headphones, even for wired models, it also gives headphone designers a good opportunity to add "sound enhancing" features like active equalization and crossfeed.

The obvious approach to active noise cancellation is to put a microphone on the outer side of the ear cup and generate an inverse sound via the speaker to cancel the noise at the eardrum. However, as there is always some leakage of the noise inside the ear, "slow" or too aggressive noise inversion will create unpleasant comb filter effect due to summing of the noise with its delayed inverted copy.

An interesting idea that helps to win some time for more precise noise cancelling is to capture the noise from the ear cup on the opposite ear, as in this case the acoustic wave of the noise will have to travel some extra distance over the head. However, as an engineer from Bose has explained to me, the actual results will depend on the plane of the sound wave with respect to the listener.

One consideration that has to be taken into account when generating an inverse noise is to avoid creating high peaks in the inverse transfer function from notches in the original function. The process that helps to avoid that is called "regularization". It is described in this AES paper from 2016.

Use of ANC puts additional requirements on the drivers used in the headphones. As low frequency noise needs the most attenuation, a high displacement speaker driver is required to produce an adequate counter pressure. This typically requires increasing the size of the driver which in turn increases the amount of distortion at higher frequencies. This paper contains an interesting analysis of these effects for two commercial drivers.

"Hear Through"

"Hear Through" is a commonly used term for describing the technology that presents the environmental sounds to the listener wearing noise cancelling headphones. This is achieved by playing the sound captured by the outer microphone into the corresponding ear (basically, performing a real-time "dummy head" sound field capture which I was describing in the previous post). The Sennheiser AMBEO headset and AKG N700NC headphones implement "hear through", however not perfectly according to my experience—the external sound has some coloration and some localization problems. Although, that doesn't affect the ability to understand speech, it still feels unnatural, and there is some ongoing research to make "hear through" more transparent.

According to the study described in the paper "Study on Differences between Individualized and Non-Individualized Hear-Through Equalization...", there are two factors that affect "transparency" of the played back sound. First, it is the fact that closed headphones act as a filter when interacting with the ear, and this filter has to be compensated. Second, it's already mentioned sound leakage. Because "hear through" involves sound capture, processing, and playing back, it has non-negligible latency that creates a comb filter with the original leaked noise. The researchers demonstrated that use of personal "hear through" equalization (HT EQ, specific both to the person and the headphones) can achieve very convincing reproduction. However, the acquisition of HT EQ parameters has to be performed in an anechoic room (similar to classic HRTF acquisition), and thus is not yet feasible for commercial products.

MEMS Technologies

I must admit, I've been completely unaware of this acronym before I attended the conference. But turns out, this technology isn't something new. A portable computer you use to read this article contains several MEMS microphones in it. The key point about this technology is that resulting devices are miniature and can be produced using the same technologies as the ones used for integrated circuits (IC). The resulting device is packaged into a surface mounted (SMD) component. The use of IC process means huge production volumes are easily possible, and the variation in components is quite low.

Initially I thought that MEMS means piezoelectric technology, but in fact any existing transducer technology can be used for engineering MEMS speakers and microphones: electret, piezo, and even electrostatic as was demonstrated in the paper "Acoustic Validation of Electrostatic All-Silicon MEMS-Speakers".

MEMS microphones are ubiquitous. The biggest manufacturer is Knowles. For example, their popular model SPH1642HT5H-1 that has high SNR and low THD costs less than $1 if bought in batches. Due to the miniature size MEMS microphones are omnidirectional across the whole audio range. Because of this fact I, was wondering whether MEMS microphones can be used for acoustic measurements. Turns out, researchers were experimenting with them for this purpose since 2003 (see this IEEE paper). However, the only commercially available MEMS measurement microphone I could find—from IK Multimedia—doesn't seem to provide a stellar performance.

Engineering a MEMS speaker is more challenging than a microphone due to miniature size. Apparently, the output sound level decays very quickly, so currently they can't be used for laptop or phone speakers. The only practical application for MEMS speakers at the moment is in-ear headphones where the effect of a pressure chamber in an occluded ear canal boosts their level a bit. A prototype of MEMS earphones was presented in the paper "Design and Electroacoustic Analysis of a Piezoelectric MEMS In-Ear Headphone". The earphone is very minimalist and can be made DIY because it basically consists of a small PCB with a soldered on MEMS speaker and a 3D-printed enclosure. The performance isn't satisfying yet, but there is definitely some potential.

Headphones Manufacturing, Measurement, and Modelling

This is a collection of notes that I've gathered from the workshop on "Practical Considerations of Commercially Viable Headphones" (a "workshop" format means that there was no paper submitted to AES), my chats with engineers from headphone companies, and conversations with the representatives of measurement equipment companies.

Speaker driver materials

The triangle of driver diaphragm properties:
The effect of the low mass is in a better sensitivity, as lower force is required to move the diaphragm. Good mechanical damping is needed for producing transients truthfully and without post-ringing. And the higher the rigidity, the more the diaphragm resembles a theoretical "piston", and thus has lower distortion.

In practice, it is hard to satisfy all of the properties. For example, the classical paper cone diaphragm has good rigidity but high mass. Rigid metal diaphragms can lack good damping and "ring". I would also add the fourth dimension here—the price. There are some diaphragms on the market that satisfy all three properties, but they are very expensive due to use of precise materials and a complicated manufacturing process.

Driver diaphragms for headphone speakers are typically produced from various polymers as they can be stamped easily. In terms of the resulting diaphragm quality, the following arrangement has been presented, from worst to best:
However, it looks like even better results are achieved with beryllium foil (used by Focal company), but these diaphragms are quite expensive.

If we step away from dynamic drivers, planar magnetic drivers are very well damped and have a lightweight diaphragm, they also move as a plane. The problem with their production is a high chance of defect, each speaker has to be checked individually, that’s why they mostly used in expensive hi-end headphones. Big companies like AKG, Beyerdynamic, Sennheiser, and Shure use classic dynamic drivers even on their flagship models.

Measurements and Tuning

Regarding the equipment, here is a measurement rig for over-ear and in-ear headphones from Audio Precision. It consists of APx555 Audio Analyzer, APx1701 Test Interface (basically a high-quality wide bandwidth amplifier), and AECM206 Test Fixture to put the headphones on.

APx555 is modular. The one in the picture is equipped with a Bluetooth module, and I've been told that it supports HD Audio protocols: AAC, aptX, and LDAC.

Besides AP's AECM206 test fixture, a full head and torso simulator (HATS), e.g. KEMAR from GRAS can be used. For earphone measurements it is sufficient to use an ear simulator as earphones do not interact with the pinna.

Companies Brüel & Kjær and Listen Inc also presented their measurement rigs and software. Prices on this equipment are on the order of tens of thousands of dollars, which is expected.

Measuring the headphones correctly is a challenging problem. There is a nice summary in these slides, courtesy of CJS Labs. The resulting frequency response curves can vary due to variations in placement of the headphones on the fixture. Usually, multiple trials are required with re-positioning of the headphones and averaging.

When measuring distortion, the first requirement is to perform it in quiet conditions to avoid external noise from affecting the results. Second, since measurement microphones are typically band-limited to the audio frequencies range, THD measurement at high frequencies can't be done adequately using the single tone method. Instead, non-linearity is measured using two tone method (intermodular distortion).

The usual "theoretical" target for headphones is to imitate the sound of good (linear) stereo loudspeakers. The deviation between frequency response of the headphones from the frequency response recorded from loudspeakers using a HATS simulator is called "insertion gain". Ideally, it should be flat. However, listening to the speakers can happen under different conditions: extremes are free field and diffuse field. So the real insertion gain of headphones is never flat, and it is usually tweaked according to the taste of the headphones designer. 

There is one interesting effect which occurs when using closed-back headphones or earphones. Due to ear canal occlusion, the sound level from headphones must be approximately 6 dB higher to create the same perceived loudness as from a loudspeaker. This is called “Missing 6 dB Effect”, and a full description can be found in this paper. Interestingly, the use of ANC could help with reducing the effects of occlusion, see the paper "The Effect of Active Noise Cancellation on the Acoustic Impedance of Headphones" which was presented on the conference.

Speaking of ANC, measuring its performance is yet another challenge due to the absence of industry-wide standards. This is explained in the paper "Objective Measurements of Headphone Active Noise Cancelation Performance".

Modelling and Research

Thanks to one of the attendants of the conference, I've learned about works of Carl Poldy (he used to work at AKG Acoustics, then at Philips), for example his AES seminar from 2006 on Headphone Fundamentals. It provides classical modelling approaches using electrical circuits and two-port ABCD model. The two-port model can be used for simulating in the frequency domain. Time domain simulation can be done using SPICE, see this publication by Mark Kahrs.

However, these modelling are more of "academic" nature. "Practical" modelling was presented by the representatives of COMSOL company. Their Multiphysics software can simulate creation of acoustic waves inside the headphones and how they travel through the ear's acoustic canals and bones. This was quite impressive.

Another interesting paper related to research, "A one-size-fits-all earpiece with multiple microphones and drivers for hearing device research" presents a device that can be used in hearables research. It consists of an ear capsule with two dynamic drivers and two microphones. It is called "Transparent earpiece", more details are available here.

Thursday, September 5, 2019

AES Conference on Headphones, Part 1—Binaural Audio

I was lucky to attend the AES Conference on Headphones held in San Francisco on August 27–29. The conference represents an interesting mix of research, technologies, and commercial products.
I learned a lot of new things and was happy to have conversations with both researchers and representatives of commercial companies that produce headphones and other audio equipment.

There were several main trends at the conference:
  • Binaural audio for VR and AR applications
    • HRTF acquisition, HCF
    • Augmented reality
    • Capturing of sound fields
  • Active noise cancellation
    • "Hear through"
  • MEMS technologies for speakers and microphones
    • Earphones and research devices based on MEMS
  • Headphone production: modeling, materials, testing, and measurements.
In this post I'm publishing my notes on binaural audio topics.

"Binaural audio" here means "true to life" rendering of spatial sounds in headphones. The task here is as follows—using headphones (or earphones) produce exactly the same sound pressure on the eardrums as if it was from a real object emitting sound waves from a certain position in a given environment. It is presumed that by doing so we will trick the brain into believing that this virtual sound source is real.

And this task is not easy! When using loudspeakers, commercially available technologies usually require use of multiple speakers located everywhere around the listener. Produced sound waves interact with the listener's body and ears, which helps the listener to determine positions of virtual sound sources. While implementing convincing surround systems is still far from trivial, anyone who had ever visited a Dolby Atmos theater can confirm that the results sound plausible.

HRTF Acquisition, HCF

When a person is using headphones, there is only one speaker per ear. Speakers are positioned close to the ears (or even inside ear canals), thus sound waves skip interaction with the body and pinnaes. In order to render correct binaural representation there is a need to use a personal Head-related Transfer Function (HRTF). Traditional approaches to HRTF acquisition require use half-spheres with speakers mounted around the person, or use moving arcs with speakers. Acquisition is done in an anechoic room, and measurement microphones are inserted into the person’s ear canals.

Apparently, this is not a viable approach for consumer market. HRTF needs to be acquired quickly and under "normal" (home) conditions. There are several approaches that propose alternatives to traditional methods, namely:
  • 3D scanning of the person's body using consumer equipment, e.g. Xbox Kinect;
  • AI-based approach that uses photos of the person's body and ears;
  • self-movement of a person before a speaker in a domestic setting, wearing some kind of earphones with microphones on them.
On the conference there were presentations and demos from Earfish and Visisonics. These projects are still in the stage of active research and offer individuals to try them in order to get more data. Speaking of research, while talking with one of the participants I've learned about structural decomposition of HRTF, where the transfer function is split it into separate parts for head, torso, and pinnaes which are combined linearly. This results in simpler transfer functions and shorter filters.

There was an interesting observation mentioned by several participants that people can adapt to “alien” HRTF after some time and even switch back and forth. This is why research on HRTF compensation is difficult—researchers often get used to a model even if it represents their own HRTF incorrectly. Researchers always have to ask somebody unrelated to check the model (similar problem in lossy codecs research—people get themselves trained to look for specific artifacts but might miss some obvious audio degradation). There is also a difficulty due to room divergence effect—when sounds are played via headphones in the same room they have been recorded in, they are perfectly localizable, but localization breaks down when the same sounds are played in a different room.

Although use of “generic” HRTFs is also possible, in order to minimize front / back confusion use of head tracking is required. Without head tracking, use of long (RT60 > 1.5 s) reverberation can help.

But knowing the person's HRTF constitutes only one half of the binaural reproduction problem. Even assuming that a personal HRTF has been acquired, it's still impossible to create exact acoustic pressure on eardrums without taking into account the headphones used for reproduction. Unlike speakers, headphones are not designed to have a flat frequency response. Commercial headphones are designed to recreate the experience of listening over speakers, and their frequency response curve is designed to be close to one of the following curves:
  • free field (anechoic) listening environment (this is less and less used);
  • diffuse field listening environment;
  • "Harman curve" designed by S. Olive and T. Welti (gaining more popularity).
And the actual curve is often neither of those, but rather is tuned to the taste of the headphone designer. Moreover, the actual pressure on the eardrums in fact depends on the person who is wearing the headphones due to interaction of the headphones with pinnae and ear canal resonance.

Thus, a complementary to HRTF is Headphone Compensation Function (HCF) which "neutralizes" headphone transfer function and makes headphone frequency response flat. As well as HRTF the HCF can be either "generic"—measured on a dummy head, or individualized for a particular person.

The research described in "Personalized Evaluation of Personalized BRIRs..." paper explores whether use of individual HRTF and HCF results in better externalization, localization, and absence of coloration for sound reproduced binaurally in headphones compared to real sound from a speaker. The results confirm that, however even with "generic" HRTF it's possible to achieve a convincing result if it's paired with a "generic" HCF (from the same dummy head). Turns out, it's not a good idea to mix individualized and generic transfer functions.

Speaking of commercial applications of binaural audio, there was a presentation and a demo of how Dolby Atmos can be used for binaural rendering. Recently a recording Henry Brant's symphony “Ice Field” was released on HD streaming services as a binaural record (for listening with headphones). The symphony was recorded using 100 microphones and then mixed using Dolby Atmos production tools.

It seems that the actual challenge while making this recording was to arrange the microphones and mix all those 100 individual tracks. The rendered "3D image" to my opinion isn't very convincing. Unfortunately, Dolby does not disclose the details of Atmos for Headphones implementation so it's hard to tell what listening conditions (e.g. what type of headphones) they target.

Augmented Reality

Augmented reality (AR) implementation is even more challenging than for virtual reality (VR) as presented sounds must not only be positioned correctly but also blend with environmental sounds and respect the acoustical conditions (e.g. the reverberation of the room, the presence of objects that block and / or reflect and / or diffract sound). That means, an ideal AR system must somehow "scan" the room for finding out its acoustical properties, and continue doing that during the entire AR session.

Another challenge is that AR requires very low latency, < 30 ms for the sound to be presented from human's expectation. The latter is tricky to define, as the "expectation" can come from different sources: a virtual rendering of an object in AR glasses, or a sound captured from a real object. Similarly to how video AR system can display virtual walls surrounding a person, and might need to modify a captured image for proper shading, an audio AR system would need to capture the sound of the voice coming from that person, process it, and render with reverberation from those virtual walls.

There was and interesting AR demo presented by Magic Leap using their AR glasses (Magic Leap One) and Sennheiser AMBEO headset. In the demo, the participant could “manipulate” virtual and recorded sound sources which also had AR presentations as pulsating geometrical figures.

Another example of AR processing application is “Active hearing”, that is, boosting certain sound sources analogous to cocktail party effect phenomenon performed by human brain, but done artificially. In order to make that possible the sound field must first be "parsed" by AI into sound sources localized in space. Convolutional Neural Networks can do that from recordings done by arrays of microphones or even from binaural recordings. 

Capturing of Sound Fields

This means recording environmental sounds so they can be reproduced later to recreate the original environment as close as possible. The capture can serve several purposes:
  • consumer scenarios—accompanying your photos or videos from vacation with realistic sound recordings from the place;
  • AR and VR—use of captured environmental sounds for boosting an impression of "being there" in a simulation;
  • acoustic assessments—capturing noise inside a moving car or acoustics of the room for further analysis;
  • audio devices testing—when making active audio equipment (smart speakers, noise cancelling headphones etc) it's important to be able to test it in a real audio environment: home, street, subway, without actually taking the device out of the lab.

The most straightforward approach is to use a dummy or real head with a headset that has microphones on it. Consumer-oriented equipment is affordable—Sennheiser AMBEO Headset costs about $200—but it usually has low dynamic range and higher distortion levels that can affect sound localization. Professional equipment costs much more—Brüel & Kjær type 4101-B binaural headset costs about $5000, and that doesn't include a microphone preamp, so the entire rig would cost like a budget car.

HEAD Acoustics offers an interesting solution called 3PASS where a binaural recording captured using their microphone surround array can later be reproduced on a multi-loudspeaker system in a lab. This is the equipment that can be used for audio devices testing. The microphone array looks like a headcrab. Here is the picture from HEAD's flyer:
When doing a sound field capture using a binaural microphone the resulting recording can’t be further transformed (e.g. rotated) which limits its applicability in AR and VR scenarios. For these, the sound field must be captured using an ambisonics microphone. In this case the recording is decomposed into spherical harmonics and can be further manipulated in space. Sennheiser offers AMBEO VR Mic for this, which costs $1300, but the plugins for initial audio processing are free.