Monday, July 10, 2023

Headphone Stereo Setup Improved, Part II

In the Part I of these post series, I presented the architecture of my headphone audio processing chain along with an overview of the research it is based upon. In the Part II (this post), I'm presenting test tracks that I use in the process of adjusting the parameters, and the framework and tools for understanding them. The description of the adjustment process thus slips to upcoming Part III of this series of posts.

A Simplified Binaural Model of Hearing

Before we dive into tracks, I would like to explain my understanding of binaural hearing mechanism by presenting a simple model that I keep in my mind. Binaural hearing a very complex subject and I'm not even trying to get to the bottom of it. I have compiled together information from the following sources:

Note that the models presented in these sources are different from one another, and as it usually happens in the world of scientific research, there can be strong disagreements between authors on some points. Nevertheless, there is a number of aspects on which most of them agree, and here is what I could distill down:

From my understanding, after performing auto-correlation and per-band stabilization of auditory images for the signals in each ear, the brain attempts to match the information received from the left and the right ear in order to extract correlated information. Discovered inter-aural discrepancies in time and level allow the auditory system to estimate the position of the source, using learned HRTF data sets. Note that even for the same person there can be multiple sets of HRTFs. There is an understanding that there exist "near-field" and "far-field" HRTFs which can help in determining the distance to the source (see this AES paper for an example).

For any sound source for which the inter-aural correlation is not positive, there are two options:

  • If the sound has an envelope (that is, a period of onset and then a decay), its position will likely be "reset" to "inside the head." This applies both to uncorrelated and anti-correlated sounds. I'm not sure about the reason of the "resetting" for anti-correlated signals, however for uncorrelated signals this is pretty obvious as no remote external sound source can produce unilateral audio images. So the brain decided that the source of the sound must be a bug near your ear, or maybe even inside it :)

  • If the sound lacks an envelope: a continuous noise or buzz for example, it can remain "outside the head," however it's position will not be determined. In the real world, I did encounter such cases in airport and shops, when a "secured" door left open somewhere far away is making continuous ringing or beeping, and the sound is kind of "floating" around in the air, unless you get youself close enough to the source of the signal so that the inter-aural level difference can help in localizing it.

An important takeaway from this is that there are many parameters in the binaural signal that must be "right" in order for the hearing system to perceive it as "natural."

The Goniometer Tool

For me, the best tool for exploring properties of the correlation between the channels of a stereo signal is the goniometer. In its simplest form, it's a two-dimensional display which shows the combined output from the left and the right channels, in time domain. Basically, it visualizes the mid-side representation which I was discussing in the previous post. Usually the display is graded in the following way:

Even this simplest implementation can already be useful in checking whether the signal is "leaning" towards left or right, or perhaps there is too much uncorrelated signal. Below are renderings of stereo pink noise "steered" into various spatial directions. I have created these pictures based on views provided by the JS: Goniometer plugin bundled with the Reaper DAW):

The upper row is easy to understand, the interesting thing though is that while purely correlated or purely anti-correlated noise produces a nice line—that's because samples in both channels always carry either exactly the same or strictly opposite values, the mix of correlated and anti-correlated noise sort of "blows up" and turns into a fluffy cloud. Also, when panning purely correlated or anti-correlated noise, it just rotates around the center. Whereas, panning of the mix of correlated and anti-correlated looks like we are "squeezing" the cloud until it becomes really thin. Finally, with initially correlated signal, adding a small delay in one channel destroys correlation of higher frequencies, and what used to be a thin line becomes a cloud squeezed from the sides.

To see the latter effect in more detail, we can use a more sophisticated goniometer implementation which also shows the decomposition in the frequency domain, in addition to the time domain. For example, I use the free GonioMeter plugin by ToneBoosters. Below is the view on the same signal as in the bottom right corner of the previous picture:

The time-domain goniometer display is at the center—the same squeezed cloud, and to the left and to the right of it we can see a frequency-domain view of correlation and panning. This is the tool which I used to get an insight into the techniques used for stereo imaging of my test tracks.

Test Tracks

Now, finally, let's get to the tracks and how I use them. Overall, these tracks serve the same purpose as test images for adjusting visual focus in optical equipment. The important part about some of them is that I know which particular effect the author / producer wanted to achieve, because it's explained either in the track itself, or in the liner notes, or was explained by the producer in some interview. With regular musical tracks we often don't know whether what we hear is the "artistic intent" or merely an artefact of our reproduction system. Modern producer / consumer technology chains like Dolby Atmos are intended to reduce this uncertainty, however for traditional stereo records there are lots of assumptions that may or may not hold for the reproduction system being used, especially for headphones.

Left-Right Imaging Test

This is Track 10 "Introduction and Left-Right Imaging Test" from "Chesky Records Jazz Sampler & Audiophile Test Compact Disc, Vol. 1". This track is interesting because apart from conventional "between the speakers" positions, it also contains "extreme left" ("off-stage left") and "extreme right" positions that span beyond speakers. This effect is achieved by adding anti-correlated signal to the opposite channel. Let's use the aforementioned GonioMeter plugin for that. This is the "center" position:

Midway between center and right:

Fully to the right, we can see that the inter-channel correlation across the frequency range is dropping to near zero or lower:

Off-stage right, channels have entered the anti-correlated state, note that the panning indicator at the top part of the time-domain view does not "understand" the psychouacoustic effect of this:

And for comparison, here is off-stage left—similarly anti-correlated channels, however the energy is now on the left side:

Considering the "extreme" / "off-stage" positions, we can see that although the stereo signal is panned to the corresponding side, the opposite channel is populated with anti-correlated signal. Needless to say, the "off-stage" positions do not work with headphones unless some stereo to binaural processing is applied. The brain is unable to match the signals received from the left and the right ear, and "resets" the source position to "inside the head." Binaural processing adds necessary leaking thus allowing the brain to find similarities between the signals from the left and the right ears and derive the position.

Following the binaural model I have presented in the beginning of the post, the "extreme" left and right positions from the "Left-Right Imaging Test" can't be matched to a source outside of the head unless we "leak" some of that signal into the opposite ear, to imitate what happens when listening over speakers. However, if the room where the speakers are set up is too "live," these "off-stage" positions actually end up collapsing to "inside the head"! Also, adding too much reverb may make these extreme positions sounding too close to "normal" left and right positions, or even push them between the positions of virtual speakers.

That's why I'm considering this track to be an excellent tool not only for testing binarual rendering, but also for discovering and fixing room acoustics issues.

Natural Stereo Imaging

This is Track 28 "Natural Stereo Imaging" from "Chesky Records Jazz Sampler & Audiophile Test Compact Disc, Vol. 3" (another excellent sampler and a set of test recordings). The useful part in this track is the live recording of a tom-tom drum naturally panned around the listener. I have checked how the "behind the listener" image is produced, and found that it also uses highly decorellated stereo. This is "in front of the listener" (center):

And this is "behind the listener":

We can see that level-wise, they are the same, however the "behind the listener" is has negative inter-channel correlation. Needless to say, correct reproduction of this recording over the headphones requires cross-feed. But there is another thing to pay attention to. As the drum is moving around the listener, in a natural setting I would expect the image to stay at the same height. In headphones, this requires both correct equalization of the frontal and diffuse components, and some level of added reverberation in order to enrich the diffuse component with high frequencies. If the tuning isn't natural, the auditory image of the drum may start changing its perceived height while moving to sides and behind the head, for example, it might suddenly start appearing significantly lower than when it was in the front of the head.

Get Your Filthy Hands Off My Desert

This is track 7 or 8, depending on the edition, of Pink Floyd's "The Final Cut" album. The track is called "Get Your Filthy Hands Off My Desert" and contains a spectacular effect of a missile launched behind the head and exploding above the listener. The perceived height of the explosion helps to judge the balance between "dark" and "bright" tuning of the headphones.

Another good feature of the track is the spaciousness. As I understand it, the producer was using the famous Lexicon 224 reverberation unit (a brainchild of Dr. David Griesinger) in order to build the sense of being in the middle of a desert.

The Truths For Sale (the ending)

This is the final 1/2 minute of the Track 4 from "Judas Christ" album by gothic metal band Tiamat. For some reason it's not a track on its own, but it really could be. I must say that I listen to this album since it was released in 2002, but never until I started digging into headphones tuning this fragment really stood out for me. It was a pleasant shock when I've realized how much externalized and enveloping can it sound. Similar to the Brian Eno's music (see below), it's very easy to believe that the droning sound of the track is really happening around you.

Being part of a metal album, this track contains a lot of bass. Perhaps, too much. It's a good test to see whether the particular headphones are too heavy on the bass side. In this case, their resonances seriously diminish the sense of externalization because, thanks to the sensation of vibration, your brain realizes that the source of the sound is on your head. That's why this track complements well the previous one when checking the balance between low and high frequencies.

Spanish Harlem

Track 12 from the album "The Raven" by Rebecca Pidgeon is an audiophile staple. It's the famous "Spanish Harlem" track, it presents acoustically recorded ensemble of instruments and a female vocal. I use it for checking "apparent source width" and also localization of the instruments when comparing between different processing tunings.

The producer of this record, Bob Katz, recommends checking for bass resonances by listening to loudness of individual bass notes in the beginning of the track. Although, his advice was addressing subwoofer tuning, it applies to headphones as well, as they can also have resonances. Luckily, bass unevenness is much less concerning with headphones.

Ambient 1: Music For Airports

This is Track 1 from "Ambient 1: Music For Airports" by Brian Eno. It doesn't have a real title, just a mark that it's track 1 on the side 1 of the original vinyl issue of this album. This is an ambient track with sound sources floating around and lots of reverb, another very good example of the power of the Lexicon 224 reverb unit.

For me, this track is special because with a more or less natural headphone tuning it allows me to get into a state of transcending inside the world built by the sound of the album. My brain starts to perceive the recorded sounds as real ones, and I get a feeling that I don't have any headphones in/on my ears. I think, this happens because the sounds are somewhat "abstract" and it makes it easier for the brain to believe that they actually can exist around me in the room. Also, the sources are moving around, and this helps the brain to build up a "modified" HRTF for this particular case.

It's interesting, that after "priming" the auditory system with this track, all other tracks listened in the same session also sound very natural. I can easily distinguish between tracks with a good natural spaciousness, and tracks that resemble "audio cartoons," in the sense that they lack any coherent three-dimensional structure. I suppose, this state is that's the highest level of "aural awareness" which usually requires a room with a controlled reverb, and a very "resolving" speaker system. I'm glad that I can achieve that with just headphones.


I could easily use the entire album "Mine" by Architect (a project of Daniel Myer, also known for the Haujobb project) for the purpose of testing source placement and envelopment. This electronic album is made with a solid technical knowledge about sound and understanding of a good spectral balance, and is a pleasure to listen to. However, I don't actually listen to this track myself during the tuning process. Instead, I render the track 5, Immaterial via the processing chain after completing the tuning in order to catch any clipping that may occur due to extra gain resulting from equalization. Below are the short-term and overall spectral views of the track:

We can see that the track has a frequency profile which is actually more similar to white noise, not pink noise, thus it features a lot of high frequency content, that is, a lot of "air." That means, if I tilt the spectrum of the processing chain in favor of high frequencies, with this track there is a higher chance to encounter clipping. The sound material on this album also uses quite massive synthesized bass. That's why it's a good track to validate that the gain of the processing chain is right across the entire spectrum.

Synthetic and Specially Processed Signals

I could actually list much more tracks that I briefly use for checking this or that aspect of tuning, but we have to stop at some point.

While "musical" recordings are useful for checking general aspects of the tuning, in order to peek into details, we can use specially crafted sounds that represent a specific frequency band, for example. Traditionally, such sounds are obtained from synthesizers or noise generators, however I've found that processed "real" sounds tend to provide more stable results when judging the virtual source position.

In my process, I use recordings of percussion instruments: tambourine, bongos, and the snare drum. By themselves, they tend to occupy a certain subset of the audio spectrum, as we can see on the frequency response graph below (the snare drum is the green line, bongos are the red line, tambourine is the blue line):

However, to make them even more "surgical," I process them with a linear phase notch filter and extract the required band. This of course makes the resulting sound very different from the original instrument, however it preserves the envelope of the signal, and thus the ability of the brain to identify it. I use the critical bands of the Bark scale, as it has strong roots in psychoacoustics.

I took these instrument recordings from an old test CD called Sound Check, produced in 1993 by Alan Parsons and Stephen Court. The CD contains a lot of good uncompressed and minimally edited recordings, and for me, it stands together with the demo/test CDs from Chesky Records.

Consumer Spatializers

So, I'm going this DIY path, however these days there exist very affordable spatializers built into desktop and mobile OSes that can do binaural playback for stereo, and even employ head tracking, after "magically" guessing your HRTF from photographs of your head and ears. For sure, I did try these, however consumer-grade spatializers do not perform well on all my test tracks. For example, the "off-stage" positions from Left-Right Imaging Test we not rendered correctly by any spatializer I tried, instead it was collapsing to inside the head. The closest to my expectation was the Apple spatializer for AirPods Pro in the "head tracking" mode, however even in this case more or less correct positioning was observed for the right "off-stage" position only.

Yet another problem with consumer-grade spatializers I tried is that for lower latency they tend to use minimum-phase filters, and these distort the phase and group delay while applying signal magnitude equalization. This essentially kills the perception of the performance space which I preserve with my processing chain where I always use linear-phase filters. Each time I tried to substitute an LP filter with an MP equivalent (in terms of signal magnitude), the reproduction was blurred down and degrading into essentially a two-dimensional representation.

If I have a budget for that, I would go with a "proper" binaural spatializer like Smyth Realizer. But I don't, and for me making my own spatializer is the only viable alternative to get the sound I want.


It's a really long road to getting to a natural reproduction of stereo records in headphones, and we are slowly making it. In the process of making anything well, good tools are of a paramount importance. I hope that the description of the goniometer, and its application to analysis of described test tracks, as well as their intended use, did help. A lot more material will be covered in subsequent posts.

Sunday, June 4, 2023

On Mid/Side Equalization

After finishing my last post on the headphone DSP chain, I intended to write the second part which should provide examples of adjusting the parameters of the chain effects for particular models of headphones. However, while writing it I had encountered some curious behavior of the mid/side equalization module, and decided to figure out what's going on there and write about it.

Let's recall last part of the DSP chain that I have proposed previously. Note that I've changed the order of the effects application, I will explain the reason at the end of the post:

The highlighted part is the pair of filters which apply diffuse field to free field (df-to-ff) or vice versa (ff-to-df) correction EQ curves to mid and side components separately. To remind you, these are intended to help the brain to disambiguate between "in front of the head" and "behind the head" audio source positions, with a goal to improve externalization. As I've found, well made headphones likely only need just one of the corrections applied. For example, if the headphones are tuned closer to the "diffuse field" target, then they already should reproduce "behind the head" and "around you" sources realistically, however, frontal sources could be localized "inside the head." For such headphones, applying a df-to-ff compensation to the "mid" component helps to "pull out" frontal sources from inside the head and put them in the front. Conversely, for the headphones tuned with a preference for the "free field," it's beneficial to apply "ff-to-df" correction to the side component of the M/S representation in order to make surrounding and "behind the head" sources to be placed correctly in the auditory image.

Now, a discovery which was surprising for me was that the application of a mid/side equalization affects reproduction of unilateral (existing in one channel only) signals. A test signal sent to the left channel exclusively, was creating a signal in the right channel as a result of passing through the mid/side equalizer. And that's even with all cross-feeding turned off, of course. This had caught me by surprise because I knew that converting between stereo and mid/side representations should be lossless and that also assumes that no signals appear out of nowhere. So, what's going on here?

The Sinusoids Algebra

What I have realized is that all this behavior appears to be surprising at first only because addition and subtractions of audio signals is in fact not very intuitive. In order to get a good grasp of it, I went through Chapter 4 of Bob MacCarthy's book "Sound Systems: Design and Optimization". It provides a very extensive and insightful coverage with just a minimal help of math. I think, it's worth stating here some facts from it about summing of two sinusoidal signals of the same frequency:

  1. When adding signals of different amplitudes, it's the amplitude of the loudest signal, and the difference between amplitudes that matter the most. The phase of the weaker signal is of a lesser significance. There is a formula to express this fact: Sum = 20*Log10((A + B) / A) (A is the level of the louder signal). Graphically the resulting levels for in-phase signals summation look like this:

  2. Only when adding or subtracting signals that have similar amplitudes their relative phase (we can also say the "phase shift") starts to matter.

  3. There is no linear symmetry between the case when the two added signals of the same amplitude are in phase, and the case when they are completely out of phase. In the first case the amplitude doubles, whereas in the second case they fully cancel each other out. People who ever tried building their own loudspeakers are well aware of this fact. This is the graphical representation of the resulting amplitude depending on the phase shift:

Another thing worth understanding is how the inter channel correlation coefficient (ICCC) depends on the relationship between the signals. This is the "correlation" gauge which we observe in plugins dealing with M/S manipulations. What plugins typically show is the correlation at the "zero lag," that is, when there is no extra time shift between the signals (in addition to the shift they already have).

As a side note, a lot of times the calculation of cross-correlation is carried out in order to find how much one signal needs to be shifted against another in time in order to achieve the maximum match. That's why "lag" is considered. By the way, here is a nice interactive visualization of this process by Jack Schaedler. However, for the purpose of calculating the ICCC we only consider the case of non-shifted signals, thus the "lag" is 0.

In the case of zero lag, the correlation can be calculated simply as a dot product of two signals expressed as complex exponentials: A(t)*B̅(t), where denotes a complex conjugate of B. Since we deal with the same signal, and its version shifted in phase, the parameters for the frequency mutually cancel each other, and what we left with is just the cosine of the phase shift. That should be intuitively clear: for signals in phase, that is, with no phase shift, the ICCC is cos(0)=1.0, for signals completely out of phase (phase shift π/2) the correlation is cos(π/2)=0, and finally, when the first signal is phase inverted compared to the second one, the ICCC is cos(π)=-1.0.

By the way, since we deal with a "normalized" correlation, that is, having the value between -1.0 and 1.0, the ICCC does not depend on the relative amplitude of the signals. Thus, for example, in-phase signals of the same amplitude have the same ICCC as in-phase signals with a relative level of -60 dB. Strictly speaking, when there is no signal with matching frequency in another channel, their correlation is not defined, however, for simplicity plugins show the ICCC of 0 in this case.

ICCC and Mid/Side Signal Placement

From the previous fact, it can be seen that ICCC actually does not fully "predict" how a pair of signals in the left and the right channel will end up being placed in the Mid/Side representation. That's because ICCC only reflects their phase shift, while the result of summation also depends on their relative levels. For a better understanding of relationship between the stereo and M/S representations we need a two-dimensional factor, and the visual representation of this factor is what the tool called "goniometer" shows. I will use it when talking about my test signals and tracks in the next post.

To round up what we have just understood, let's consider the process of making the M/S representation of an input stereo signal. If we consider each frequency separately, then we can apply the facts stated above to each pair of individual sinusoids from the left and the right channel. This adds more details to a somewhat simplistic description I provided in the previous post.

If the sinusoid in one of the channels is much higher in the amplitude than in another channel, then both summation and subtraction will produce a signal which is very similar to the stronger source signal, and the weaker signal will only make a small contribution to the result, regardless of its relative phase.

That means, a strong unilateral signal will end up being both in the "mid" and the "side" components, minimally affected by the signal of the same frequency from the opposite stereo channel. Note that if we normalize the resulting amplitudes of the "Mid" and "Side" signals by dividing them by 2, we will actually see a signal of a lower amplitude there. Here is an illustration—an example stereo signal is on the top, it has the level of the right channel lower by -12 dB. The resulting "Mid Only" and "Side Only" versions are below it:

In the case when there is no signal of this frequency in the opposite channel, then exactly the same signal will land both into both M/S components, with the amplitude divided by 2. This is the picture from the previous post showing that for the two sinusoids in the middle of the top picture:

If both channels of the input stereo signal contain the signal of a particular frequency with close enough amplitudes, then the outcome depends on the relative phase between these signals. As we know, in the "extreme" cases of fully correlated or fully anti-correlated signals, only the mid or the side component will end up carrying this frequency (this was also shown on a picture in the previous post). For all the cases of the phase lying in between, the result will get spread out between the mid and the side, below is an example for the case of a 140 deg phase offset (ICCC=-0.766) which results in a -12.6 dB reduction of the original signal level as a result of summation:

Note that the resulting sinusoids in the mid and the side channels have a phase shift from the signal in the left channel which is different both from what the signal in the right channel has, and from each other.

Since the process of decoding the stereo signal from the M/S representation is also done via addition and subtraction, the same sinusoids algebra applies to it as well.

What If We Change Just Mid or Side?

It's interesting that despite the fact that separate mid/side equalization is an old technique used by both mixing and mastering engineers, thanks to its benefits for the ear, it's side effects on the signal are not described as widely. However, if you read the previous section carefully, you now understand that making level and phase adjustments to the mid or to the side components only will inevitably affect the outcome of "decoding" the M/S representation back into stereo.

For simplicity, let's focus on amplitude changes only. Making changes both in amplitude and phase will cause even more complex effects when the signals get summed or subtracted. That means, we apply a "linear phase" equalizer. We can use either an equalizer which provides mid/side equalization directly, for example: the "LP10" plugin by DDMF, or "thEQorange" by MAAT digital. However, in fact, we can use any linear phase equalizer which provides two independently controlled channels because we can wrap it between two instances of the MSED plugin: the first one needs to "encode" stereo into the M/S representation, and the second one will produce the stereo version back from the modified signal, as shown below:

(even though MSED is completely free, if you want an alternative for some reason, there is also the "Midside Matrix" plugin by Goodhertz, also free).

Since no equalizer can affect just a single frequency only, instead of looking at sinusoiods in the time domain as we will switch into the frequency domain. My approach to testing here is to use the same log sweep in the both channels, and modify either amplitude of relative phase of the second channel, as we did before. Then I capture what comes out in the left and the right channel after an EQ applied separately to the Mid or the Side representation.

I start with the case which had initially drawn my attention: a unilateral stereo signal (in the left channel only) for which we apply some equalization to the mid component. Let's see what do left and right channels contain after we apply a simple +10 dB, Q 5 gain to the 920 Hz center frequency to the mid component only:

As you can see, indeed after this equalization a signal has appeared in the right channel! Another interesting observation is that the level of the gain for the unilateral signal is actually less than +10 dB. That's because the gain that we have applied to the mid component was combined with the unmodified (flat) signal from the side component. Only in the case when there was no side component at all—identical signals in the left and the right stereo channels—the equalization of the mid component will look like a regular stereo equalization. Certainly, it is good to be aware of that!

By the way, I tried both LP10 and thEQOrange and their behavior is the same. Considering that LP10 costs just about $40, and thEQOrange almost 15 times more, it's good to know that you can get away with a cheaper option unless you strongly prefer the UI of thEQOrange.

Now, I was genuinely interested in seeing what my FF-to-DF and DF-to-FF Mid/Side equalization do to unilateral signals. Here are some examples comparing the effect on the fully correlated signal (shades of green) with the signal induced in the opposite channel for a unilateral input:

We can see that in some cases the levels of signals induced in the opposite channel are significant and can be only -15 dB lower than the "main" signal. However, we need to recall that the FF/DF compensation comes after we have applied the cross-feed unit. That means, we never really have unilateral stereo signals. To check what actually happens, I put the "direct" path processing in front of the FF/DF unit and used the same initially unilateral test signals. This is what I've got:

These curves definitely look less frightening. Thanks to crossfeed, any unilateral signal penetrates into the opposite channel.


What have we learned from this lengthy exploration? First, it's soothed my worries about the side effects of the Mid/Side equalization. Since I only use it with much more correlated signals than the edge case of a unilateral stereo signal, the side effects are not as significant, while the win of the FF/DF compensation is audibly perceivable.

Second, looking closer at what happens during the M/S equalization helped me to reveal and fix two issues with my initial chain topology:

  1. I reordered the units in the output chain, putting the FF/DF unit before the L/R band alignment unit. That's because I have realized that individual equalization of the left and the right channels inevitably affects the contents of the mid/side representation. For example, a signal which initially was identical between the left and the right channels will obviously lose this property after going through an equalizer which applies different curves to the left and the right channels.

  2. Since for the FF/DF I actually use the MConvolutionEZ plugin—with a linear phase filter—I noticed that the approach of applying the convolution to the mid and side components recommended in the manual does not work well for my case. What MeldaProduction recommends is to chain two instances of MConvolutionEZ: one in "Mid" mode and one in "Side" mode one after another. This in fact creates a comb filter because mid and side are now processed with a one sample delay, and then get summed (I did confirm that). So instead of doing that, I wrapped MConvolutionEZ between two instances of MSED (as I've shown above) and just use it in the regular "stereo" mode. This ensures that both mid and side are processed with no time difference.

I also considered, if it's possible to create a Mid/Side equalization which avoids processing of sufficiently uncorrelated signals in order to avoid the side effects described above. A search for "correlation-dependent band gain change" led me to a bunch of microphone beamforming techniques. Indeed, in beamforming we want to boost the frequency bands that contain correlated signals, and diminish uncorrelated signals (noise). However, thinking about this a bit more, I realized that such processing becomes dependent on the signal, and thus isn't linear anymore. As we saw previously with my analysis of the approaches for automatic gain control such signal-dependent processing can add significant levels of non-linear distortion. That's probably why even sufficiently expensive mastering equalizers don't try to fight the side effects of mid/side equalization.

Tuesday, April 25, 2023

Headphone Stereo Setup Improved, Part I

It was a long time ago since I started experimenting with a DSP setup for headphone playback, which achieves more realistic reproduction of regular stereo records originally intended for speakers. This is similar to what "stereo spatialization" does. Since then, I have been experimenting with various settings for my DIY spatializer with an aim to make it more immersive and natural, and have learned new things along the way.

In this post, I would like to present an updated version of the processing chain along with the discussion of the underlying approach. Since there is a lot of material to cover, I decided to split the post into two parts. In the first part, I talk about relevant research and outline the processing pipeline. In the second part, I will tell you about the process of individual tuning of the setup.

New Considerations

My initial understanding was that I needed to model a setup of stereo speakers in a room. However, after reading more works by S. Linkwitz about stereo recording and reproduction: "Stereo Recording & Rendering—101", and "Recording and Reproduction over Two Loudspeakers as Heard Live" (co-authored with D. Barringer): part 1, part 2, I realized that a good stereo recording captures enough spatial information of a real or an artificially engineered venue, and although it was mixed and mastered on speakers, and thus was considered for a speaker reproduction, speaker playback is not the only way to reproduce it correctly. In fact, reproduction over stereo speakers has its own well-known limitations and flaws. Moreover, if the speakers and the room are set up in a way which works around and minimizes the effect of these flaws, the speakers "disappear" and we hear the recorded venue, not the speakers and the room. Thus, I realized, if I take my initial intention to the limit and strive to model this ideal speaker setup on headphones, then I just need to work towards reproducing on headphones the recorded venue itself, since an ideal speakers-room setup is "transparent" and only serves as a medium for the reproduction.

Clean Center Image

So, what is the fundamental flaw of the speaker reproduction? As many audio engineers point out, there are various comb filtering patterns which occur as a result of summation of fully or partially correlated delayed outputs from the left and right speakers. The delay occurs because the signal from the opposite speaker arrives to the ear a bit later after the signal from the "direct" speaker. There is a very detailed paper by Timothy Bock and D. B. Keele "The Effects Of Interaural Crosstalk On Stereo Reproduction And Minimizing Interaural Crosstalk In Nearfield Monitoring By The Use Of A Physical Barrier" (in 2 parts: part 1, part 2), published in 1986. Their modeling and measurements demonstrate that comb filtering increases with correlation, thus the center image which is formed by fully correlated outputs is the most affected one. Floyd Toole also expresses his concerns about the change of the timbre of the center image caused by comb filtering in his seminal book on sound reproduction, see the Section 7.1.1.

The solution for crosstalk reduction used by Bock & Keele employed a physical barrier between the stereo speakers—remember, it was 1986 and high quality audio DSP was not nearly as affordable as it is these days. In fact, their solution was sort of a prototype for the family of DSP technologies which is now known as Ambiophonics. Whereas, Floyd Toole advocates for multi-speaker setups—the more speakers, the better—so that each source ideally gets its own speaker. This is where the mainstream "immersive audio" technology is heading.

With headphones, interaural crosstalk isn't a problem by design—especially for closed back over-ears and IEMs, and the phantom center image is reconstructed ideally by our brain using correlated signals from the left and right earphones. However, it is more difficult to match binaural signals that lack a sufficient degree of correlation. We need to help the brain by making signals more coherent. Although this can also create some comb filtering, it's well under our control.

Mid/Side Processing

My takeaway from these considerations and experiments is that the center channel should be left intact as much as possible. What is the "center channel" in a stereo recording?—It's the sum of the left and right channels. In the audio engineering world, this is known as the "Mid" component of the "Mid/Side" representation. Note that "Mid" is actually more than just the center. If we consider what happens when we add left and right channels together (L+R), we can observe the following results:

  • fully correlated images sum and become twice as loud (+6 dB) over non-correlated ones;
  • uncorrelated images—those that exist in the left or right channel only, still remain, but they are "softer" than the center image;
  • reverse correlated (or anti-correlated) images—those that exist both in the left and the right channel but have their phases reversed, disappear.

The "Side" channel which is created by subtracting one channel from another (L-R) produces a complementing signal and contains anti-correlated and uncorrelated images, and fully anti-correlated images dominate.

Note that the M/S representation is a "lossless" alternative to the traditional L/R representation. The elegance of this pair of representations is that the same way as we get M/S from L/R by summing and subtracting the channels, we get the L/R back from the M/S using the same operations:

  • M + S = (L + R) + (L - R) = 2L;
  • M - S = (L + R) - (L - R) = 2R.

Thus, assuming that the processing is carefully designed to avoid clipping the signal due to doubling of the amplitude, we can convert back and forth between stereo and Mid/Side as many times as we need.

Thanks to their simplicity and usefulness, M/S encoding and decoding are built-in tools of every DAW. However, to simplify my processing chain, I prefer to use a dedicated plugin by Voxengo called MSED. The benefit of using MSED is that it can work "inline", which means it encodes stereo as M/S, processes it, and then converts back to stereo. The classical trick to make the stereo image wider in headphones is to increase the level of the side component compared to mid, see this paper for an example. We can also suppress the mid or side component entirely. This is how the stereo output looks like in this case:

We can see that the "Mid Only" output is essentially the mid component duplicated to both left and right channels, thus left and right channels become fully correlated, effectively this is what a "mono" signal is. While the "Side Only" output is still a "stereo" signal in which the left and right channels are reverse correlated.

By looking at the waveforms above, we can confirm that we get the original signal back by summing "Mid Only" and "Side Only" tracks together. Thus, it is possible to apply different processing to these parts and be sure that we preserve all the aural information from the source recording.

Stereo Reverb Of The Real Room

Even during my initial experiments, I understood that for increasing envelopment and spaciousness, a reverb must be used. What I didn't fully understand back then was that the more uncorrelated reverb impulses between left and right channels are, the better it works for listener envelopment. This idea was explored by Schroeder in his works on reverb synthesis by DSP (see the Section 3.4.4 in the "Applications of Digital Signal Processing to Audio and Acoustics" book). While correlated reverbs effectively create strong reflections—as if there were lots of hard surfaces around, and this sounds more like the ratcheting echo that we encounter in tunnels.

If you recall from my older post, initially I was using a synthesized reverb produced by the NX TrueVerb plugin. Later I switched to reverb that I extracted from the Fraunhofer MPEG-H authoring plugin. This reverb is used by the plugin for rendering objects in the binaural mode (for headphones). This reverb has more natural sounding and was seemingly recorded in some real room because after looking at its specter I could see signs of room modes. Impulses of its left and right channel were decorrelated enough—overall Inter-Channel Cross Correlation (ICCC), as reported by Acourate is less than 12%. However, while listening to reverb alone, I could still hear a slightly ratcheting echo—why is that?

I checked autocorrelation for each channel in Audacity and found that I can see lots hard reflections in them:

These reflections create comb filtering patterns that sound like the ratcheting effect. So I decided to try another reverb impulse—this time the actual reverb of my room, as it was measured using my stereo speakers. I had obtained these left and right channel impulses as a byproduct of tuning a desktop version of LX Mini speakers with Acourate—another project to write about some time later. In fact, this reverb impulse response had turned out to be much better, while the ICCC figure was just about 1% higher compared to the MPEG-H reverb. Take a look at the autocorrelations of channels:

So, in fact, it was my mistake that I was shunning away from using my room's actual reverb, considering it being "non-ideal." And thanks to the fact that headphones create a more controlled environment, I could adjust the direct-to-reverb ratio to be anything I want. As a result, I created a reverb environment which has reverb even lower than EBU requirements for a studio (EBU 3276), as follows from the analysis displayed by Acourate for a room of the same dimensions as my room:

Note that the level of reverb depends on the headphones used, and this particular graph is for the setting for open-back headphones (Shure SRH1840).

This is an improvement over my initial processing setup which was only compliant with more "relaxed" recommendations for reverb times for a "music presentations" room (DIN 18041, see the picture in the older post here).

The important thing about preparing the impulse response of the reverb is to cut out the first strong impulse of the signal, leaving only late reflections and the reverb "tail." In the processing chain, the "direct" signal comes from another track. By separating the direct signal from the reverb, it becomes much easier to adjust the ratio between their levels, and this becomes important once we try using different types of headphones, more about this in the upcoming part II.

The Effect Of Room Divergence On Externalization

Yet another finding has solidified my conclusion about the need of using a real room reverb. The paper "Creating Auditory Illusions with Binaural Technology" by Karlheinz Brandenburg et al., published in the collection of papers "The Technology of Binaural Understanding" edited by Jens Blauert and Jonas Braach describes an interesting experiment that explores the connection between the "inside the head" (ITL) localization and the room reverberation impulses used for binaural simulation. The study confirms that use of a reverb impulse which matches the room provides better externalization, while a divergence between visually predicted and aurally experienced reverb conditions incurs a confusion. This is commonly referred to as a "room divergence effect". Since it's a psychoacoustic effect, the exact outcome is somewhat complicated and depends on many parameters.

My layman understanding is that the divergence effect is especially pronounced when using open-back headphones, since they don't provide any isolation from external sounds. Thus, unless the room where you are listening to the headphones is completely isolated from the world outside, you still hear the sounds from the "real" world, processed with the real room acoustics. This forms an "expectation" to the auditory system of how external sounds should sound like. If the reverb used for headphone processing does not match this expectation, the brain gets confused, and it's more likely that the auditory image will collapse to ITL. Obviously, closed-backs and especially IEMs isolate better, and for them this effect might be less important for consideration. However, our eyes still see the room, and this can also create expectations about the reverb. Thus, using a real room reverb seems to improve chances for experiencing externalization in headphones, compared to using an "abstract" modeled reverb.

Application Of The Room Reverb

Recalling my original intention to leave the center sources intact, applying the reverb might look like a contradicting requirement. However, with Mid/Side processing it's possible to have both—the idea is that we apply a stronger version of the room reverb to the Side output, and a softer (more attenuated) version to the Mid output.

Since the Side Only output from MSED already contains uncorrelated and reverse correlated signals, "fuzzing" it even more with an uncorrelated stereo reverb does not hurt. In fact, it only makes it better—more spacious and lasting longer, giving the hearing system a better opportunity to analyze the signal. To help the brain even more, we also apply cross-feed to the result. Since cross-feeding is essentially a more sophisticated version of summing of left and right channels, it has a similar effect, that is, it amplifies correlated signals and suppresses reverse correlated signals. However, thanks to the fact that in cross-feed summing is weighted across the frequency spectrum, this effect is much weaker, the application of cross-feed does not produce a fully correlated output, and this is what we want.

When I listen to this "Side Only" reverb in headphones, the representation is fully externalized. When I stand in front of the speakers, it feels like I hear them playing. However, since I'm listening to anti- and uncorrelated parts, the audio image is "smeared" and serves only for the purpose of envelopment. For a better effect, the reverb used for the "Side Only" channel is massaged by a gentle Bessel low-pass filter with the corner frequency at 5 kHz. This simulates natural shadowing of signals that come from the back.

Leaving the center channel completely devoid of reverberation makes it sound in headphones too "dry" and too close. That's why in addition to the relatively strong room reverb applied to the Side Only output, I also apply much weaker and more delayed room reverb to the "Mid Only" component of the input signal. The idea is that this delayed reverb should be unnoticeable by our "conscious" part of the hearing apparatus, and only act as a spaciousness and distance hint to lower layers of brain processing. Thus, this extra reverb mostly relies on the precedence effect, complementing the direct sound and reinforcing it, while still being perceived as a part of it (a.k.a. "fusion").

Listening to this "Mid Only" reverb in headphones, I hear a "focused" sound of the speakers in the room. That's because the signal is formed from a "mono" signal. However, application of an uncorrelated stereo reverb "smears" it and adds some width. In order to find the desired delay and attenuation for the "Mid Only" reverb, I play a dry recording of some strong percussive instrument, like bongos, and increase the delay and reduce its level until I stop noticing the reverb. Yet, when I toggle the track with this extra reverb on and off, I can hear the change in the perceived distance to bongos. If the delay is increased too much, it "breaks" the precedence effect and the reverb turns into an echo.

Diffuse And Free Field Equalization

A lot of discussions are devoted to the topic of recommended headphone tuning. There are two cases representing the poles of the discussion. Günther Theile in his highly cited work "Equalization of studio monitor headphones" argues that the diffuse field (DF) equalization is the only correct way to tune the headphones, since this way the headphones do not "favor" any particular direction and thus provide the most neutral representation of the reproduced sound. A similar point of view is expressed by the founder of Etymotic, Mead Killion in his old blog post.

On the other side, there is the idea that the headphones must be tuned after the canonical 60 degree speaker setup, as measured in a free field (FF), or in an anechoic chamber. In practice, when listening to a "raw" (non-binauralized) stereo in headphones, none of these tunings work satisfactory for the general audience, and headphone makers usually settle up upon some compromise which keeps listeners engaged, based on an "expert opinion" or studies. One well-known example is, of course, the Harman target curve. There is also an interesting systematic approach for blending the curves of the free and diffuse field curves based on the room acoustics, proposed in the paper with a rather long title "Free Plus Diffuse Sound Field Target Earphone Response Derived From Classical Room Acoustics Theory" by Christopher J. Struck. The main idea is to find the frequency where the free field of the room turns into the diffuse field, and use that frequency as the "crossover" point for the DF and FF response curves.

Personally, I'm in the "diffuse field tuning" camp. This choice is rather logical if we aim for tonally neutral equipment. After all, we intend to apply any corrections in the digital domain and don't want to deal with undoing the "character" of the DAC, the amplifier, or the headphones that we use.

Returning to the paper by Brandenburg et al., another interesting finding which it points out is that the source directions for which achieving externalization in headphones is the most difficult are the full frontal and the full backward ones (0 and 180 degrees in the median plane). The hypothesis is that this happens due to the well-known "front-back confusion" from the Duplex theory. I decided to aid the brain to resolve this confusion by giving correlated sounds an "FF-like" frequency shape, and giving their counterparts—anti-correlated sounds a "DF-like" shape. In order to do that, I used the results of yet another paper, "Determination of noise immission from sound sources close to the ears" by D. Hammershøi and H. Møller. It provides averaged frequency shapes for FF and DF sound sources measured at various points of the ear: blocked ear canal, open ear canal, and the eardrum. Using the tabulated data from the paper, I could create "FF-to-DF" and "DF-to-FF" compensation curves. Below are the graphs of the "DF-to-FF" curves, marked with "BE", "OE", and "ED" for the measurement points listed above. The "FF-to-DF" curves can be obtained by inverting these graphs.

Since the paper uses averaged data, the curves are rather smooth except for the high frequency part starting at 6.3 kHz, which reflects the effect of the pinna filtering and the ear canal resonance. Thus, I decided to have two versions for each compensation curve: a complete one, and the one which only goes up to 5 kHz. When applying the "complete DF-to-FF at the eardrum" curve to the "Mid Only", component I could indeed make it sound more "frontal" (when using Shure SRH1840 headphones, at least). While applying the "low pass FF-to-DF at the eardrum" compensation to the "Side Only" component makes it more "enveloping."

The Effect of Adding Harmonic Distortion

Yet another surprising effect which I have discovered myself is how adding harmonic distortions affects apparent source width (ASW). By adding to the "Mid Only" reverb the 2nd harmonic, I could make it sound more "focused." While adding the 3rd harmonic to the "Side Only" reverb makes it even wider. Just to reiterate, the harmonics are only added to reverbs, not to the direct sound, thus the overall level of added harmonics is minimal.

Since I don't entirely understand the nature of this effect, I will try to find more information on its possible cause later.

The Goal of Spatialization

After going through all these lengthy explanations, you may be wondering what's the actual outcome of this? After all, there are commercially available spatializers, with no-hassle setup, with head tracking, etc. Are there any benefits besides learning more about how the human auditory system works? I've done some comparison with the spatialization available on the iOS platform, and I would claim that my DIY spatialization has higher quality. I compared the sounding of tracks which I use for tuning my setup via the iOS spatializer and mine, and I find mine to be more precise and more realistic, allowing to achieve a true state of "immersion."

It's an interesting experience which sort of comes and goes, and depends on the track and the headphones being used. After 12–15 minutes of listening, the brain gets accustomed to the reproduction and eventually starts "believing" that it actually hears the world created by the track. Headphones "disappear"—they feel no different from a hat—we "know" when wearing a hat that it's not the hat who creates the auditory world around us, and I do "know" in the "immersed" state that the surrounding sound does not originate from the headphones. The eyes start automatically following sound sources when they move, and I can really feel their presence. It's also super easy to turn my auditory attention focus from one object to another. It's really a sense of immersion, and it's similar to the feeling of "transparent reproduction" of music via speakers—sort of "audiophile nirvana."

So, yeah, for me it's more interesting to build my own setup and I believe that I can make it sound more natural than affordably priced commercial solutions. A similar thing is with speakers. Sure, there exist a lot of really good speakers, which may work fantastically out of the box, however some people, myself included, find it rewarding to build—if not design—their own.


OK, if you are still with me, let's take a look at the topology of the processing chain:

Let's go over the processing blocks of the chain. Note that the actual values for plugin parameters are only specified as an example. In the Part II of this post, I will go through the process of finding the suitable values for particular headphones and ears.

The "Input" block is just a convenience. Since my chain is "implemented" as a set of tracks in Reaper, having a dedicated input track makes it easier to switch the input source, try rendering a media file (via the master bus) in order to check for the absence of clipping, and apply attenuation if necessary.

The 3 processing blocks are wired in parallel, and in fact consist of the same set of plugins, just with different settings. The purpose of having the same set of plugins is to make time alignment easier. Although Reaper can compensate for the processing delay, sometimes this does not work right, and having the same set of plugins works more reliably.

The first processing block is for the "Direct" output. According to the principle of keeping the direct output as clean as possible, the only plugin which is engaged here is the cross-feed plugin 112dB RedLine Monitor which is set to the "classical" 60 deg speaker angle, no attenuation of the center, and emulation of distance turned off.

The "Side Reverb" block only processes the Side component, by toggling on the "Mid Mute" button on the Voxengo MSED plugin. As I mentioned above, the room reverb applied here was low-passed. The reverb is applied by MeldaProduction MConvolutionEZ. The cross-feed plugin uses a different setting than the "Direct" block—the center attenuation is set to the maximum, -3 dB and a slightly wider speaker angle: 70 deg is used. This is to avoid producing overly cross-correlated output. Then, also as explained above, the 3rd harmonic is added by using the Fielding DSP Reviver plugin.

The "Mid Reverb" block processes the Mid component only. It uses a whole version of the room reverb, with a higher delay. The cross-feed uses the same angle as the Direct output, for consistency, while the center attenuation is at -3 dB to produce more uncorrelated output. The Reviver plugin is set to add the 2nd harmonic.

The output from all 3 processing blocks is mixed together in different proportions. While the Direct output is left unattenuated, the reverb inputs are attenuated significantly. The actual values depend on the headphones used. Levels that are needed for open-back headphones are so low that the overall frequency response deviation from a flat line is within 1 dB.

The shaping of the signal that happens in the "Output" block is more significant. In fact, the whole purpose of the "Output" block is to adjust the output for the particular headphones. First, per-frequency left-right balance is corrected using the linear phase equalizer LP10 by DDMF—this is similar to the technique originally proposed by David Griesinger.

Then the Goodhertz Tone Control plugin is used to adjust the spectral tilt. The slopes are set to 0% both for bass and treble. This creates a very smooth tilt which practically does not affect the phase, and thus there is no need to switch the plugin into the "Linear Phase" mode. Note that although LP10 can also apply a spectral tilt, it's less flexible than what Tone Control can do. Finally, the MConvolutionEZ plugin, operating in "Mid" and "Side" modes, is used to apply "DF-to-FF" or "FF-to-DF" correction curves.

Obviously, linear phase plugins create significant latency, thus this setup is not intended for a "real-time" playback. However, using linear phase mode is worth it. I actually tried doing headphone balance adjustments using a regular minimum phase equalizer, and the result was much more "fuzzier." In fact, I can hear the same kind of "fuzziness" in the iOS spatializer running in "head tracking" mode. It seems that minimum phase equalization with narrowband filters causes a significant increase of ASW of sound sources.

What's Next

In the upcoming Part II of this post, I will provide steps on finding the right values to configure the processing components. These parameters are printed in italics on the processing chain scheme from the previous section.