Electronic Projects

Sunday, May 18, 2025

LXdesktop Headphone Auralization Tuning

This post continues the previous post LXdesktop Auralization with Ambisonics providing more details on the tuning of the chain that I have built, as well as some listening impressions of myself and from other people to whom I have demoed the setup.

Evaluation of the Initial Version by Listeners

People who were participating in the listening demo were already familiar with the "immersive audio" technology with its implementations on Android and iOS as well as how binaural renderings done by Dolby Atmos and MPEG-H authoring tools sound. They also understood how head tracking works, and why it is needed. Still, everyone listening was surprised with a very wide and externalized image provided by the ambisonic rendering. Another point many people noticed was that when rotating the head, the scene was rotating smoothly, with no perceivable "jumps" of the phantom center which usually present on traditional discrete channel spatializers as one turns their head in the direction of the left or right virtual speaker.

However, they have also noticed some drawbacks:

Some of the listeners have noted the "fuzziness" of the reproduced sound sources.
For some of the listeners the center image was feeling unnaturally elevated.
As people were switching between "raw" headphone playback (on Sennheiser HD600) and the binaural render, they noted that they wanted to get more bass for the latter.

I decided to take some time to address this feedback, and this had led me to making some improvements for my playback chain.

Improving Phantom Center Naturalness

The problem of the elevated phantom center frequently occurs when listening to binaural records in headphones. It may also occur when listening to regular stereo records with stereo speakers. Many people note that when panning a mono source using amplitude panning between the left and the right speaker, the trajectory of the rendered source may have a "rainbow" shape, meaning elevated center.

Although the manifestation of the problem is the same, the reasons for its occurrence are different for the speaker and the headphone playback. For the playback over stereo speakers commonly cited reasons are:

Presence of reflections which make the phantom center image to be perceived as more fuzzy compared to the acoustic sources that originate mostly from one speaker.
Speakers are often not placed to have their acoustic center at the eye (ear) level, and this introduces a vertical component into the recreated sound scene.
Other spectral colorations caused by acoustic interference may create more energy in the bands that are associated by the brain with vertical sources. In the absence of visual clues, the auditory system assumes that the sound from an invisible source comes somewhere from above.

For headphone playback, especially when doing "3D", "binaural" or "immersive", one of the common problems is the mismatch between the listener's own and the simulated HRTFs. Similar to the reason 3. for the speaker playback, mismatched HRTFs can also push more energy into the frequency bands associated by the brain with vertical source placement. This is a quote from the interview with the "father" of the Neumann KU-100 dummy head Stephan Peus describing this problem in the context of binaural recordings made with it:

We have also changed the "pitch angle" of the ears somewhat. In listening tests with the KU 81, it had been noticed that sound sources in the horizontal plane usually tended to be perceived slightly upward during reproduction. This is related to a characteristic “dip” in the horizontal frequency response of our outer ears. For every natural ear, that dip is at a slightly different frequency. This does not interfere with natural hearing, because we “adjust” the location of sound sources with the help of our eyes throughout our lives. If we are now given a certain configuration by the dummy head, we cannot correct visually. As it happened, the aforementioned dip in the horizontal frequency response of the KU 81 caused sound events from the front to be perceived as slightly shifted upward. In the KU 100, we therefore adjusted the angles of the ear cups relative to the vertical so that the imaging is now correct horizontally and vertically.

Now, imagine what happens when we simulate speaker playback over headphones, via HRTFs of a dummy head! I suppose, all of these problems combine and affect the perception of the phantom center even stronger. I can't fix the HRTF issue because in my processor I just use the HRTFs of the KU-100 head (via the IEM BinauralDecoder plugin). However, I was able to some extent fix the "fuzziness" of the phantom center.

My approach uses the same idea as speaker crosstalk cancellation, however in my case I did not have to use any actual XTC filters. First, let's recap the essence of my approach. Using a stereo speaker setup and an Ambisonics microphone I have captured transfer functions between each speaker and each microphone capsule in order to simulate real-time recording of the speakers by the microphone:

Now we can realize that if we render the left and the right speaker separately (each on its own track), on the output we will get ipsi- and contra-lateral signals for each of them separately—that gives us 4 channels: for each combination of left/right speaker and left/right ear. When mixing these signals for the binaural presentation we can control how much cross-talk we want to have in the end. First I have tried having no cross-talk at all—that's the ideal XTC! However this did not sound natural at all, resembling very much the regular "headphone sound", just with extra reverb. The resulting phantom center was very close to the face. I have found that attenuating ipsi-lateral paths by about 6 dB produces the most natural result and yields a very compact and clean sounding phantom center. Recall that HRTFs of the dummy head already incorporate some head shadowing, this is why the extra attenuation does not need to be excessive.

Did it fix the perceived center elevation problem? Yes, it did! However, fixing the phantom center this way had a negative effect on the width of the perceived sound scene—instead of appearing wide in front of me, the left and right sides have now collapsed close to my ears. Why is that? This has reminded me of the dilemma that people often mention on audio forums: why speaker listeners want to reduce the crosstalk between them, while headphone listeners often want to add crosstalk, via cross-feed circuits or plugins? The answer is—they are tackling different problems.

As I have noted earlier, the XTC is aimed to fix the coloration and fuzziness of the phantom center by attenuating ipsilateral audio paths from stereo speakers to ears, and making the sound waves achieving ears to have the characteristic closer to a "real" center source in front of the listener's head.

And the cross-feed mostly fixes reproduction of lateral sources in headphones. If we consider hard-panned dry sources that only exist in one of stereo channels, these sound very unnatural on headphones because only one ear receives the signal for them, resulting in "inside the head" localization. This is because lateral sound sources occurring outside of the listener will always be heard by both ears, with natural attenuation from head shadowing, and time of arrival difference.

So it seems that we need to decompose our stereo signal and separate "phantom center" signals from lateral signals. In multi-channel and object-based audio scene representations this decomposition is given, but for stereo sources we have to do some work. I decided to employ the approach similar to the one described in the post Headphone Stereo Improved, Part III. I separated the stereo stream into 3 components:

mostly correlated components: the "phantom center";
mostly uncorrelated components: lateral signals created by hard left/right amplitude panning;
the rest: components lacking strong correlation, or anti-correlated: the ambience.

I thought I could use a multichannel upmixer for this. However, after experimenting with the free SpecWeb tool set and inexpensive Waves UM225 I have realized that although upmixers use conceptually the same approach for components separation, their end goal is a bit different because their target is a multi channel speaker system. Thus they are designed to "spread" virtual sources softly between pairs of speakers, for example, the phantom center is also "translated" into some energy in the left and right channels, but I need to extract it in almost "solo" fashion. Also, in multichannel setups there are typically no dedicated channels for "ambience", and ambient components are also spread across all channels. It is possible that with some practice I could set up an upmixer to avoid spreading and do what I need, but decided to leave that for later.

So, instead I decided to use the Bertom "Phantom Center" plugin for this operation. But how to extract lateral sources? While the phantom center is composed of fully correlated components, the "residue"—the non-correlated components is composed from a mix of lateral and "diffuse" sounds. So I came up with the following topology for extracting lateral sounds, which uses both the "Phantom Center" and Mid-Side approach:

The idea is that if we invert one of the channels, and process the result via the "Phantom Center", it will extract the anti-correlated, "diffuse" components. This way we can separate them from lateral components and end up with 3 sound "streams" that I have enumerated above. To illustrate the result, here is how this plugin setup separates a set of Dirac pulses which correspond to different source positions, based on their correlation:

If you want to refresh your understanding of interchannel correlation, please refer to my old post On Mid/Side Equalization. Of course, this decomposition only works correctly for amplitude-panned sources, because the correlation meter in the "Phantom Center" plugin uses zero-lag setting, however in practice this approach yields good results for stereo records.

Note that in the end I have settled on 98% setting for the "Phantom Center" to avoid sharp transitions between the streams.

So, from this preprocessor we have 3 pairs of outputs (the aforementioned 3 streams). Each pair is processed independently, and moreover, each channel of the pair has its own speaker-to-binaural processing path, which yields 4 channels, thus at the output we have 12 channels, each of them representing a certain component of the stereo field, as rendered via particular speaker, on the path to each ear. This gives us full control on how to mix these components for binaural playback and allows us to use both XTC and cross-feed at the same time, applied to proper kinds of acoustic sources. I have ended up with following mixing matrix in Reaper:

From left to right, the first 4 channels are the center: Left speaker to left ear, and to right ear, then right speaker to left and right ear. Then the next 4 channels are the lateral components, in the same order, and the last 4 channels are the ones representing "ambience".

As a bonus, I had realized that by defeating the effect of head shadowing for ambient components by boosting ipsi-lateral paths I can achieve even better externalization of the virtual sound scene. In my previous spatializer I had achieved the same effect by just boosting uncorrelated components.

I have made the final adjustments to the balances by listening to correlated and anti-correlated pink noise, making sure that they both sound centered. I ended up wondering why the required interchannel balance is not symmetric, and my hypothesis is that first, of course use of non-individual HRTFs may cause this, and the second reason may be due to not fully symmetric speakers setup in the room (see my earlier posts on LXdesktop). In future I will try to correct that by making a better speaker setup.

Fixing the Bass

Of course, as audiophiles we always enjoy rich and deep bass, and headphone makers usually try to add more bass to their headphones. As I and other people have found while comparing "raw" stereo sound to the sound of the binaural rendering, the latter was lacking bass noticeably. That seemed strange to me, considering that I have a good subwoofer and I never feel the lack of bass when listening to my LXdesktop setup.

Trying just to boost up the bass of the binaural renderer's output led to excessive on the head vibration of headphone drivers which was ruining the externalization effect. This needs to be done in some other way, I decided.

After reading a bit more about the implementation of the BinauralDecoder, I noted that it uses the MagLS approach for interpolating between sampling points of measured HRTFs. This approach is intended to minimize the amplitude differences only. Although the authors say that this approach is only used starting at 2 kHz which may imply that interaural time differences are preserved for the frequencies below, I decided to check what will happen if I actually add them.

Since I have separate signal paths for the left and right ear, I decided to employ my "almost linear phase" ITD filters, and I was not disappointed—the sense of a good deep bass has returned back to my binaural renderer! It's interesting that these filters have flat amplitude, they do not boost the energy of the bass at all. Yet, somehow adding a correct shift in phase between the ipsi- and contra-lateral ear makes the bass sound to be perceived as stronger. While doing A/B switching between filter/no-filter configuration I have realized that maybe this phase shift allows the auditory system to "focus" on the bass and perceive it as coming from a compact source. Whereas mostly in-phase bass creates an impression of some ambient rumbling, and is not perceived to be as strong, even at the same energy level.

After some experiments with the cutoff frequency, I have ended up with 500 Hz for the "center" source, 750 Hz for the lateral components, and no ITD filtering for ambience. Raising the cutoff frequency or trying to apply the filter to the ambient component was causing moving of virtual sources closer to the face, and I did not like it.

One technical issue that use of this block creates is addition of latency. Since the filters are symmetric linear phase style, and they need good resolution in the bass region, they create a delay of 170 ms. And since they have to be placed after the BinauralDecoder, this latency affects the head tracking.

Adjusting the Tonality

It's never easy to tune an audio reproducing system to the ideal tonality (does one even exist?). My binaural renderer is of course not an exception to this rule. The first difficulty in obtaining the right tuning is that there are many uncertainties in how I had captured the speakers and the room, and also how the captured system is rendered using the binaural renderer and the headphones. The second reason is that the perceived tonality changes depending on how the brain perceives the location, the size, and the distance to virtual sources.

So we don't know how precise our measurements are, and we need to use our perceptions for tuning as well. However, in order to do that efficiently we would like to be able to make instant comparisons with some reference. One good reference I have found, thanks for Archimago's post, was the binaural version of the "Touch Yello" album released as a Dolby Atmos remaster in 2025 on a Blu-ray. The binaural version sounded quite good when listening with Sennheiser HD600 so I decided to use the frequency domain measurement of it as a reference for fine-tuning my binaural chain.

The Blu-ray contains both the stereo and the binaural versions, so I was able to measure frequency curves both for the binaural rendering of the stereo version via my chain, and of the original binaural version. Below is the result of comparison of ERB smoothed curves. The FR of the original binaural version is in blue and red, and the FR of my version is orange + light teal:

It can be seen that the official binaural version has somewhat V-shaped (raised both bass and treble, with a dip in the mids) tuning compared to my rendering. My initial plan was to try to match them as closely as possible. However, as I have quickly understood, due to the fact that my rendering sounds as being farther from the listener than the original binaural version, their spectral shapes can't be just the same. Instead, my approach was to find the regions where there is a significant difference and then try adjusting these bands while listening to the changes in tonality and perception. The goal was to obtain more natural tonality for my rendering and minimize the changes in perceived tonality when switching back and forth between mine and the "official" binaural versions.

In the end my version sounds more spacious and is better externalized, while the original binaural rendering sounds closer to the face and is much "denser". You can compare the results yourself by using YouTube and Google Drive links below (the Drive version uses AAC at 320 kbps while YouTube transcoded it into Opus 140 kbps). Note that although my rendering is done specifically for Sennheiser HD600, you can anyway use any reasonable headphones to check it. I even used Apple EarPods for some testing! Just one note—if you are listening on modern headphones that support "spatial audio", make sure you turn it off and just use the plain stereo mode:

(Of course, these are provided for educational or personal use only).

A question one can ask—what is it besides the frequency balance that makes the original binaural rendering to be perceived very close to the face, while my rendering sounds much more externalized? One reason I have found is the objective measurement of the interaural cross correlation (IACC). Below are two graphs comparing IACC for these two binaural versions for the time position about 0:30:

We can see that my rendering is much less correlated in the high frequency region starting from 2 kHz which corresponds better to a more spacious listening experience. IACC is one of the metrics used by acousticians for objectively comparing sound of different concert halls (see the book by Y. Ando and P. Cariani "Auditory and Visual Sensations") and different microphone setups (see the book by E. Pfanzagl-Cardone "The Art and Science of Surround and Stereo Recording").

Another "source of truth" for the tonality that I have found are recordings produced by Cobra Records. Whereas the Yello binaural production was a result of binaural rendering done from Dolby Atmos master, recordings done at Cobra are real acoustical recordings done simultaneously by conventional multi-mic setups for stereo and surround, and using the KU-100 head for the binaural version. If you recall, the IEM BinauralDecoder plugin is also based on KU-100 free field HRTFs, and thus comparing the rendering of Cobra's stereo records processed by my chain with their binaural versions makes quite a fair apples-to-apples comparison.

Unfortunately, I do not know for which headphones their binaural version is intended for. I can imagine it should be for some diffuse field equalized headphones. So, as an example, here is an excerpt from "Extemporize" piano album, where my chain is rendered for HD800, again both as YouTube and "offline" files:

One thing that I have noted when comparing my binaural rendering with Cobra's binaural recording is that the latter for some reason have left and right channels swapped, and I have fixed that for my comparison test. The difference between these recordings / renderings is more subtle than with Yello track—the stereo recording is really good by itself! Still, I hope you could get the similar experience of the sound moving away from the head when listening to the rendering via my processing chain.

For completeness, this is a similar comparison of ERB-smoothed frequency responses of these renderings:

Again, it's a pity that producers of these binaural recordings do not specify which headphone would they recommend for listening to them. The page at Cobra Records site says "any brand or style will work (nothing fancy required!)"—to me that sounds too generous. I know, you can hear a difference using any headphones, but for actually experiencing "being there" the tonal balance of the headphones used for binaural reproduction is very important.

Using More Headphones for Auralization

With the equalization which needs to be applied to the headphones in order to achieve the "natural" equalization for the immersive playback all of them start to have similar tonality. Yet, the impression that we have when listening to them is still not the same. Even with the same EQ target, different headphones create different listening impressions due to differences in their drivers and their interaction with our ears.

Another important aspect of headphones apart from how they sound is how they feel on the head: their clamping force, size of the earpads, the weight etc. I have found Sennheiser HD800 and Shure SRH1840 to be very comfortable for long listening sessions. However, unfortunately neither of them is on the list of headphones originally measured by B. Bernschütz on KU-100 and thus they are absent from the EQ list of the IEM BinauralDecoder plugin.

However, I'm lucky to have access to the state-of-the-art headphone measuring system B&K 5128, so I used it to derive EQ filters turning HD800 into HD600 which are on the list of the BinauralDecoder. Note that EQuing headphones to sound like some other model is actually a non-trivial task. At the last NAMM convention I had a conversation with a representative of TiTumAudio company that makes headphones that can imitate a number of commercial headphones. He noted that actually copying the sound of other headphones requires tweaks that are beyond simple LTI processing (that is, EQing). This is why I specifically have chosen HD600 as the target for HD800 as their drivers are probably the closest in their non-linear properties, compared to headphones from other makers. In a similar fashion, I use AKG K240DF (which are on the BinauralDecoder) list as the target for other headphones by AKG, Beyerdynamic DT990 as the target for other Beyers, and discontinued Shure SRH940 for more modern models.

One technical challenge I have encountered when creating filters for these conversions is that the measurements for the left and right ear of the B&K 5128 never match exactly. Instead of using an average between left and right ear, I decided to leave them different. However, in order to avoid distorting phase relationship between left and right channels, I have made these conversion filters linear phase (8k taps). Since that creates an extra delay, I put this correction block before the SceneRotator in order to reduce the latency of head tracking.

Complete Processing Chain

Summarizing the processing blocks mentioned in the previous sections, this is what I ended up with:

It may look big, but after all these are all only necessary components. Realistic binaural rendering is not an easy task!

Music Tracks Used for Evaluation

While iterating on my spatializer, and also during preparation to demoing it to other people, I have come up with a list of songs available on Apple Music that have good spatial properties. They represent different musical styles: classical, pop, electronic, metal, etc. They also represent different recording styles: acoustic recordings, engineered stereo recordings, some modern Dolby Atmos tracks (rendered into stereo).

Roughly, I can classify them into several categories based on which attributes of the reproduction they can help testing. Of course, some tracks do belong to several categories.

Great stereo acoustic recordings with natural vocals and instruments
- All Roads to the River / Breaking Silence by Janis Ian
- Also sprach Zarathustra, Op. 30, TrV 176 by Richard Strauss
- Grandmother / The Raven by Rebecca Pidgeon
- L'Égyptienne / Les Sauvages by Béatrice Martin
- No Flight Tonight / from Chesky Records 10 Best, by The Coryells
- Pipeline / Two Doors by Michael Shrieve
- The Firebird Suite (1919 Version): V. Infernal Dance of King Kaschey by Igor Stravinsky
- The Wrath of God: Pt. 1 by Sofia Gubaidulina
- Violin Concerto No. 1: III. Quarter note by Philip Glass
Engineered records with strong emphasis on spaciousness
- An Echo of Night / The Pearl by Brian Eno & Harold Budd
- An Ending (Ascent) / Apollo: Atmospheres and Soundtracks by Brian Eno
- Animal Genesis / Oxymore by Jean-Michel Jarre
- Barco / Insen by Ryuichi Sakamoto & Alva Noto
- Contrapunctus 8, A 3 / Laibachkunstderfuge by Laibach
- Day One (Interstellar Theme) / Interstellar (OST) by Hans Zimmer
- Get Your Filthy Hands Off My Desert / The Final Cut by Pink Floyd
- High Hopes / The Division Bell by Pink Floyd
- Resonance / Resonance by Boris Blank
- Ripples in the Sand / Dune (OST) by Hans Zimmer
- The Snake and the Moon / Spiritchaser by Dead Can Dance
- Troubled / Passion (The Last Temptation of Christ OST) by Peter Gabriel
Synthesized stereo scenes, not as "spacious" but still interesting
- Another One Bites the Dust / The Game by Queen
- Birds / Nameless by Dominique Fils-Aimé
- Bubbles / Wandering—EP by Yosi Horikawa
- Jeremiah Blues (Part 1) / The Soul Cages by Sting
- Me or Him / Radio K.A.O.S. by Roger Waters
- On the Run / The Dark Side of the Moon by Pink Floyd
- Rocket Man / Honky Château by Elton John
- Space Oddity / David Bowie (aka Space Oddity) by David Bowie
- The Invisible Man / The Miracle by Queen
- Voice of the Soul (1996 Demos) (Instrumental) / The Sound of Perseverance by Death
- What God Wants, Pt. I / Amused to Death by Roger Waters
Not quite "spacious" but with good "visceral impact"
- Dyers Eve / ...And Justice for All by Metallica
- Flint March / Small Craft On A Milk Sea by Brian Eno, Jon Hopkins & Leo Abrahams
- Heatmap / Warmech by Front Line Assembly
- Lie to Me / Some Great Reward by Depeche Mode
- Single Blip / Ssss by VCMG

When demoing, I realized that most of these tracks are unknown to most people. They were generally choosing either Hans Zimmer's songs, or Pink Floyd, and for some reason the Rocket Man was also popular.

Remaining Issues (Future Work)

Essentially, I have two kinds of problems: one kind stems from the fact that instead of using artificial models of speakers and the room I use a capture of real speakers in a real room, so any of its flaws get aggravated by the processing and the listening on headphones. The second kind of problems lies purely in the DSP domain and hopefully can be fixed more easily.

Fixing Speaker Setup and Room Asymmetry

As noted in the section "Improving Phantom Center Naturalness", my Ambisonic capture of the room is not perfectly balanced thus requiring some correction. This is not the problem of the capture process, but rather the fact that it captures the imperfections of my setup, and they get aggravated by headphone listening. Having a better, more symmetric setup should help.

Reducing Room Ringing

This issue I have encountered when I was listening to male vocals recording. This was opera tenor Joseph Calleja performing "I Lombard" by Verdi. I had experienced a very uncomfortable sensation of "ringing" and "compression" in the sound of Calleja's singing. I had compared my rendering to the original recording and noticed that some of these artefacts are already there, due to the reverberation of the hall when the recording was made. Then I listened on the speakers, and noted that these artefacts are even more pronounced due to the reverberation added by my room. I think the primary sources for these are comb filtering and flutter echo that interact with the harmonics of the singer's voice.

I have realized that if I had invited Joseph Calleja to actually sing in my room, I would likely hear this compression and ringing as well. I recalled that I actually can notice these artifacts when listening to live vocals while sitting in acoustically mediocre halls.

What can I do about that? Ideally, I would like to reduce the reverberation of my room captured by the IRs, and treat the reflections. However it's not easy to apply this cleanup post hoc on already captured IRs. I decided that next time I will probably put some sound absorbing materials at the back side of the microphone in order to produce a bit more "dead" IRs.

Achieving Better Quality of Stereo Field Decomposition

As noted in the paper by E. Vickers "Frequency-Domain Two- to Three-Channel Upmix for Center Channel Derivation and Speech Enhancement", frequency-domain audio processes may produce certain artifacts, often described as "musical noise" or "watery sound." This is indeed what I can hear when I decompose pink noise into correlated and anti- and uncorrelated components and then listen to each of them separately. When the processed sound gets combined back together, these artifacts are mostly masked, however they still may pop up when listening to music with lots of transients. Ideally, I would like to find a more "high fidelity" way of decomposing the stereo sound field.

I have contacted Tom from Bertom Audio regarding the artefacts produced by the Phantom Center plugin, and his answer was that unfortunately nothing can be done to the current version of the plugin to get rid of them completely. So a possible solution to this may be studying the way for achieving the same decomposition using one of those expensive high quality upmixing plugins.

Solving the Latency Problem of ITD filters

As I have previously noted, the ITD filters which are required for good bass reproduction when binaural rendering is done via IEM BinauralDecoder have noticeable latency. So I either need to find a binaural renderer for Ambisonics which produces similar inter-aural phase, or re-create the filters in some mixed-phase way, with much lower latency.

Wednesday, December 18, 2024

LXdesktop Auralization with Ambisonics

In this post I describe yet another approach for producing a more natural rendering of stereo recordings via headphones. Previously I was exploring the idea of adding the features of speaker playback that we usually find missing on headphones: the room reverb, the natural cross-feed between signals of the speakers when they reach ears, the effect of head and torso reflection, etc (see old posts like this one). The results were sounding to me more pleasing and more spatially "correct" than any commercial spatializer products I have tried. However, when staying in front of my LXdesktop setup while wearing headphones, and switching back and forth between playback over the speakers and the headphones, the difference is still obvious, with the speaker sound being so much wider and having more depth, while the playback via headphones is always sounding too close to the head, both in forward and lateral dimensions.

The new approach that I have tried and describe in this post is based on a multi-microphone capture of the sound field that the speakers and the room create, representing it as 3rd order Ambisonics, and then using binaural renderers. This rendering has ended up sounding surprisingly realistic, finally allowing to create with headphones an illusion that I'm listening to my speakers. Another interesting point that I will discuss is that, in some sense, this rendering can sound even better than the speakers used for making it.

Why Ambisonics?

Let's start with a simple question—why using Ambisonics at all, when the end result is still a two channel binaural rendering. Why go through this intermediate representation if we could just make a binaural capture of the speakers using a dummy head in the first place? We probably need to disambiguate what is meant by the "capture" here. Binaural recordings are a great way for capturing an event and that we want to listen to later. When we are listening back at the same place where the recording was taken, it usually sounds very convincing, as if we are hearing a real event. The cheapest available solution for such a capture is in-ear microphones, and I had tried using the mics by The Sound Professionals for this purpose.

The "event" that is being captured does not necessarily need to be a real performance, it can as well be a playback of a song over the speakers. The benefit of using speakers is that we can capture their properties and use them to listen back not only to this particular song as played by these speakers, but to any arbitrary song played by them. For that what we need is a capture of impulse responses (IRs) and apply them to a stereo recording via convolution. In the simplest case, we just need 2 pairs of IRs: one pair for left speaker to the left and to the right ear, and another pair for the right speaker. These IRs are usually called "BRIRs": "Binaural Room Impulse Responses."

The problem is that making a good capture of BRIRs using a live person's head is very challenging. The main difficulty is that due to noisiness of the environment and relatively long reverberation times of real rooms, the sweep signal used for the capture has to last for about 30–60 seconds, and the head must stay completely still during this time, otherwise the information about high frequencies will become partially incorrect.

As a side note, the same binaural capturing process is used for acquiring HRTFs of real persons, however, such an acquisition is done in anechoic chambers, thus the sweeps can be quite short—under 1 second, and there is usually some kind of a fixture used for preventing the subject's head from moving.

As another side note, I know that with Smyth Realiser they can actually capture BRIRs in customers' rooms in a "high fidelity" way, but it must be their "know how" and it is not trivial to match with a DIY approach.

Now, what if in our room, instead of a real person we use a dummy head, like Neumann KU100? That assumes it's OK for you to buy one—due to their "legendary" status in the audio recording industry they are very pricey. If you have got one, it is of course much easier to make a stationary setup because the artificial head is not moving by itself, it is not even breathing, thus capture of a good quality BRIRs is certainly possible.

However, the recorded BRIRs can not be further manipulated. For example, it is not possible to add rotations for head tracking, as that would require capturing BRIRs for every possible turn of the head (similar to how it's done when capturing HRTFs). So, another approach which is more practical, is to capture the "sound field" because it can be further manipulated, and then finally rendered into any particular representation, including simulation of a dummy head recording. This is what Ambisonics is about.

The Capture and Rendering Process

I've got a chance to get my hands on the Zylia Ambisonics microphone (ZM-1E). It looks like a sphere with the diameter slightly less than of a human head, and has 19 microphone capsules spread over the surface (that's much more than just two ears!). The Zylia microphone has a number of applications, including capturing of small band performances, recording sound of spaces for VR, and of course producing these calming recordings of ocean waves, rain, and of birds chirping in woods.

However, instead of going with this mic into woods or to a seashore, I stayed indoors and used it to record the output of my LXdesktop speakers. This way I could make ambisonics recordings that can be rendered into binaural, and even have support for head tracking. The results of preliminary testing in which I was capturing the playback of speakers have turned out to be very good. My method was consisting of capturing raw input from the ZM-1E—this produces a 19-channel recording, and this is what is called the "A-format" in the Ambisonics terminology. Then I could experiment with this recording by feeding it to the Ambisonics converter provided for free by Zylia. The converter plugin outputs 16 channels of 3rd order Ambisonics (TOA) which then can be used with one of the numerous Ambisonics processing and rendering plugins. The simplified chain I have initially settled upon is as follows:

A practical foundation for the binaural rendering of Ambisonics used by the IEM BinauralDecoder from the "IEM Plug-in Suite" is the result of the research work by B. Bernschütz which is described in the paper "A Spherical Far Field HRIR/HRTF Compilation of the Neumann KU 100". They have recorded IRs for the KU100 head which was rotated by a robot arm before a speaker in an anechoic chamber. Below is a photo by P. Stade from the project page:

This allowed producing IRs for the number of positions on a sphere that are enough for producing a realistically sounding high order Ambisonics rendering. The work also provides compensating filters for a number of headphones (more on that later), and among them, I have found AKG K240DF (the diffuse field compensated version) and decided to use it during preliminary testing.

Since the binaural rendering is done using a non-personalized HRTF, in order to help the auditory system to adapt, it's better to use head tracking. Thankfully, with Ambisonics the head tracking can be applied quite easily with a plugin which is usually called "Rotator". The IEM suite has one which is called SceneRotator. It needs to connect to the head tracking provider which is usually a standalone app via the protocol called OSC. I have found a free app called Nxosc which can use the Bluetooth WavesNX head tracker that I had bought a long time ago.

Somewhat non-trivial steps for using IEM SceneRotator with Nxosc are:

Assuming that Nxosc is already running and has been connected to the tracker, we need to connect SceneRotator to it. For that, click on the "OSC" text at the left bottom corner of the plugin window. This opens a pop-up dialog and in it, we need to enter 8000 into the Listen to port field, and then press the Connect button.
Since our setup is listener-centric, we need to "flip" the application of Yaw, Pitch, and Roll parameters by checking "Flip" boxes under the corresponding knobs.

As a result, the SceneRotator window should look like below, and the position controls should be moving according to the movements of the head tracker:

And once I made everything working... The reproduced renderings sounded astonishingly real! In the beginning, there were lots of moments when I was asking myself—"Am I actually listening to headphones? Or did I forget to mute the speakers?" This approach to rendering has turned out to be way more realistic than all my previous attempts, finally I have got the depth and the width of the sound field that matched the speaker experience—great!

A-format IRs

And now the most interesting part. After I had played enough with ambisonic captures of speaker playback, I decided that I want to be able to render any stereo recording like that. Sure, there are enough plugins now that can convert stereo signals into ambisonics and synthesize room reverb: besides the plugins from the IEM suite, there is SPARTA suite (also free), and products by dearVR. However, what they synthesize is some idealized speakers and perfect artificial rooms, but I wanted my real room, and my speakers—how do I do that?

I realized that essentially I had to get from two stereo channels to Ambisonics. Initially I was considering capturing IRs of the output from the Zylia's Ambisonics plugin for my speakers. However, I was not sure whether the processing that the plugin performs is purely linear. Although the book on Ambisonics by F. Zotter, M. Frank describes the theory of the transformation of A-format into B-format as a matrix multiplication and convolution (see Section 6.11), I was still not sure whether in practice Zylia does just that, or it also performs some non-linear operations like decorrelation in order to achieve a better sounding result.

Thus, I decided to go one level of processing up, and similar to BRIRs, capture the transfer function for each pair of my speakers and the microphone capsules of the Zylia mic. This is for sure an almost linear function, and it is fully under my control. However, that's a lot of IRs: 2 speakers by 19 microphones makes it 38 IRs in total! And in order to be correctly decoded by Zylia's plugin these IRs have to be relatively aligned both in time and in level, so that they represent correctly the sound waves that reach the spherical microphone and get scattered over its surface.

Thankfully, tools like Acourate and REW can help. I have purchased the multi-mic capture license for REW. This way, I could capture inputs to all 19 microphones from one speaker at once, and what's important, REW maintains the relative level between the captured IRs. However, maintaining time coherence is more tricky. The problem is that REW sets the "start" of the IR at the sample with the maximum value which usually corresponds to the direct sound of the speaker. However, for IRs captured by the mics on the rear part of the sphere, the power of the reflected sound from room walls can actually be higher than the power of the direct sound, thus the "start" position of the IR chosen by REW is not always correct. That's one problem.

Another problem is aligning the levels between the captures of the left and the right speaker. Since they are captured independently, and REW normalizes the captured IRs, the resulting levels between the left and the right speakers may be mismatched.

In order to solve both of these alignment problems, I have captured with the same Zylia setup periodic pulses played in turn by the left and the right speakers. Then by convolving the produced IRs with the original test signal, and comparing the result with the actual recorded signal, I could see whether the timing and level of IRs is correct.

I have performed this alignment manually, which of course was time-consuming. I think, if I ever want to do this again, I would try to automate this process. However, it was actually interesting to look at the waveforms because I could clearly see the difference between the waveforms that were actually captured in the room, and the ones obtained by a convolution with the captured IRs. They actually look very different, see an example below:

What we can see is that the pulse rendered via convolution (red trace) has much lower noise, and looks more regular than an actual capture (green trace). This is because the IR only captures the linear part of the acoustic transfer function. There is actually a whole story on how captures done with the logarithmic sweep relate to the parameters of the original system, I will leave it for future posts. In short, by rendering via the captured IR, we get rid of a lot of noise, and leave out any distortions. So in some sense, the produced sound is a better version of the original system which was used for creating the IRs.

After producing these IRs, I enhanced the rendering chain in order to be able to take a stereo input, and produce 20 (19 mic channels plus 1 silent) channels of audio which emulate the output from the Zylia's microphone, that is the A-format, which is then fed into the Zylia's ambisonics decoder (note that on the diagram we only count channels that actually have signal):

It was an unusual experience for me to create a 40-channel track in Reaper, that's in order to duplicate the left and the right stereo signals into 20 channels for applying the A-format IRs. However, it worked just fine. I truly appreciate robustness and reliability of Reaper!

One issue remained though. Since Zylia is not a calibrated measurement microphone, it imprints its own "sound signature" on the captured IRs. Even by looking at transfer functions of these IRs I could see that both bass and high frequencies have some slope which I did not observe while performing the tuning of LXdesktop. Thus, some correction was needed—how to do it properly?

Zylia Capture and Headphones Compensation

Yet another interesting result of the "Spherical Far Field HRIR/HRTF Compilation" paper is that if we take all frequency responses for all these positions captured by the KU100 head around the sphere, and average them, the result will deviate from the expected diffuse field frequency response by a certain amount, as we can see on the graph below, adapted from the paper, which is overlaid with the graph from the KU100 manual showing the response (RTA bars) of the KU100 in a diffuse field:

I used the actual diffuse field frequency response as the target for my correction. The test signal used in this case is uncorrelated stereo noise (zero interchannel correlation, see notes on ICCC in this post). Note that the "Headphone Equalization" in IEM BinauralDecoder has to be turned off during this tuning.

And this adjustment works, however applying it also "resets" the speaker target curve, so I had to re-insert it into the processing chain. With that, the final version of the processing chain looks like this:

Now if we turn back on the headphone equalization in the IEM BinauralDecoder, it will compensate both for the deviations from the diffuse field for the KU100 head and the selected headphones.

It's funny that although AKG K240DF is described by AKG as a diffuse field compensated headphone, in reality, however deviations as much as 3 dB from the diffuse field response across frequency bands is a norm. Usually the deviations exist on purpose, to help make listening to "normal" non-binaural stereo recordings more pleasant and in addition to produce what is known as the "signature sound" of the headphone maker.

Would Personalized HRTF Work Better?

I was actually surprised by the fact that use of non-personalized HRTFs (even non-human, since they were captured on a KU100 artificial head) works that well for giving the outside-of-head sound experience, and sufficiently accurate localization, including above and behind the head locations. I was wondering what an improvement would be if I actually had my own HRTF captured (I haven't yet, unfortunately) and could use it for rendering?

One thing I have noticed with the non-personalized HRTF is that tonal color may be inaccurate. A simple experiment to demonstrate this is to play mono pink noise and then pan it, with simple amplitude panning, to left and right positions. On my speakers, the noise retains its "color", but in my binaural simulation I can hear difference between the leftmost and the rightmost positions. Since I'm not a professional sound engineer, it's hard for me to equalize for this discrepancy precisely "by ear", and obviously, no measurement would help because this is how my auditory system compensates for my own HRTFs.

By the way, while seeking information for some details related to my project I have found a similar research effort called "Binaural Modeling for High-Fidelity Spatial Audio" done by Viktor Gunnarsson from Dirac which helped him to obtain a PhD degree recently. The work describes a lot of things that I was also thinking of, including the use of personalized HRTFs.

The fun part is that Viktor also used the same Zylia microphone, however unlike me, he also had access to a proper Dolby Atmos multichannel room, you can check out his pictures in this blog post. I hope that this research will end up being used in some product intended for end users, so that audiophiles can enjoy it :)

Music

Finally, this is a brief list of music that I had sort of "rediscovered" while testing my new spatializer.

"You Want it Darker" by Leonard Cohen which was released in 2016 shortly before his death. I'm not a big fan of Cohen, but this particular album is something that I truly like. The opening song creates a wide field for the choral, and Cohen's low voice drops into its net like a massive stone. It makes a big difference when in headphones I can perceive his voice as being at a hand distance before me, as if he is speaking to me, instead of hearing it right up to my face, which feels a bit disturbing. This track helps to evaluate whether the bass is not tuned up too much, as otherwise the vibration of headphones on your head can totally ruin the feeling of distance.

"Imaginary Being" by M-Seven, released in 2011. It was long before the Dolby Atmos was created, but still sound of this album delivers a great envelopment, and it feels nice being surrounded by its synthetic field. My two favorite tracks are "Gone" and "Ghetto Blaster Cowboy" which creates an illusion of being in some "western" movie.

"Mixing Colours" by Roger and Brian Eno, released in 2020 is another example of enveloping sound. Each track carries a name of a stone, and they create a unique timbre for each one of them. Also, Brian Eno is a master of reverb which he uses skillfully on all of his "ambient" albums, including this one, giving a very cozy feeling to the listener.

The self-titled album by Penguin Café Orchestra, released in 1996. I have discovered this band by looking at the list of releases done by the Brian Eno's experimental "Obscure Records" label. The music is rather "eccentric" and sometimes experimental—check the "Telephone and Rubber Band" track for example. However, overall it's a nice work, with lots of different instruments used, which are placed in the sound field around the listener.

"And Justice for All" by Metallica, released in 1988—not an obvious choice after the selection of albums above—this one was suggested by my son. After I have completed setting up my processing chain and wanted to have an independent confirmation that it really sounds like speakers, I have summoned my son and asked him to listen to any track from Apple Music. And he chose "Blackened". In the end, it has turned out to be a good test track for number reasons: first, since it's the "wall of sound" type of music, with lots of harmonic distortion, it's a great way to check whether the DSP processing chain is clean, or does it add any unpleasant effects ("digitis" as Bob Katz calls it) that would become immediately obvious. Second, again it's a good test to see if the amount of bass is right in the headphones. As one usually wants to listen to metal at "spirited" volume levels, any extra amplification of bass frequencies will start shaking the headphones, ruining the externalization perception. And the final reason is that listening in the "virtual studio" conditions that binauralized headphone sound creates allows to hear every detail of the song, so I got a feeling that during more quiet passages of "Blackened"—for example the one starting after the time position 2:50—I can hear traces of the sound of Jason Newsted's bass guitar which was infamously brought down during mixing of this album. Later I have confirmed that it's indeed the bass guitar by listening to a remix done by Eric Hill on YouTube which had brought its track back to a normal level.

Conclusions and Future Work

This experiment of creating a binaural rendering of my desktop speakers using an ambisonics microphone has demonstrated a good potential for achieving high fidelity realistic headphone playback at relatively low cost. In future this is what I would like to explore and talk about:

use of other models of headphones, especially those that are missing from the dataset captured by B. Bernschütz and thus do not present in the headphone equalization list of the IEM BinauralRenderer plugin;
additional elements of processing that improve the sound rendered via this chain;
use of other head trackers, like HeadTracker 1 by Supperware;
individualization of the headphone transfer function and binaural rendering;
ways to compare the headphone playback to speaker playback directly.

Also, one thing I would like to experiment on is doing processing of the left and the right speaker signals separately. Currently we mix the signals from the speakers at the input of the simulated Zylia microphone. However, since the microphone is significantly smaller and is less absorptive compared to a human head and torso, it does not achieve the same level of cross talk attenuation that I have on my physical speakers. So the idea is that we should try processing the input from the left and the right speakers separately, in parallel. And only after we get to the Ambisonics version of the signal, or maybe even to the binaural rendering, we mix these signals together, providing necessary cross-talk cancellation. I'm not sure how to do that in the Ambosonics representation, but in the binaural version it's pretty easy because we will end up having simulated BRIRs, that is, two pairs of two channel signals, so we can attenuate the contra-later signal as desired. In some sense, this approach simulates a barrier installed between the ears of the listener, but without the need to deal with side effects of an actual physical barrier.

Monday, August 26, 2024

LXmini Desktop Version (LXdesktop)—Part III: Binaural Tuning

This is another "technical note" about my experience of building and tuning a pair of desktop speakers with a DSP crossover, based on the original design of the LXmini by Siegfried Linkwitz. This post is about the aspect of tuning which helps to obtain the most natural presentation of the audio scene encoded as a stereo format. As Linkwitz itself explains in this talk "Accurate sound reproduction from two speakers in a living room", a stereo representation is no more than an illusion which only appears in the brain of the listener. However, this can be a rather realistic illusion. It's realistic when the listener is able to forget that the sound scene which he or she is hearing is created using the speakers. Ideally, the speakers themselves should not be registered by the auditory brain as the sound sources.

In the Q&A section of the talk, in particular, on this video fragment, somebody is asking Siegfried what is the recommended speaker setup for a small room. And he recommends putting the speakers wider, and to place the listening spot closer to them. That's in fact what I've done in my setup (see one of the previous posts which illustrates the arrangement). The idea behind this setup is to create sort of "giant headphones"—this characteristic is attributed in to the sense of envelopment that this setup can achieve. In fact, the sound of speakers located at some distance is more natural for our auditory brain than the sound from headphones because the sound from the speakers gets filtered by our natural HRTF, thus it's easier for the brain to decode it properly. However, our perception of sound from these "giant headphones" suffers both from a strong interaction between the speakers themselves—that's the crosstalk affecting the center image, and between the speakers and the room—this interaction produces reflections and additional reverb not existing on the recording—that's the "sound of the room."

The good part is that in my unusual setup the predominant dipole radiation pattern of the speakers is supposed to reduce the crosstalk without resorting to DSP tricks. And as for reflections, they can be filtered out by the auditory brain when they are sufficiently separated from the direct sound, and—that's another interesting point from Siegfried's talk—has the spectral content which is similar to the direct sound. The last topic is actually complex and different people have different views on it. However, the cross-talk cancellation is something that can be easily measured.

Cross-talk Cancellation

I have made two types of measurements: one is the usual log sweep which allows recreating the impulse response and window it as necessary, and another kind is a "steady state" measurement produced by taking "infinitely" averaged RTA of a pink noise. Both measurements are made using a "dummy head" technique, so they are binaural. However, since I don't have a proper head and torso simulator at home, I just use my own head and binaural microphones by The Sound Professionals built with the XLR outputs option so that they can be connected to the same sound card used to drive the speakers. I use REW for these captures, and I have purchased the "multi-mic input" option which is essential for this job since I want to record both the ipsi- and contra-lateral ear inputs at the same time.

The typical way to measure the effectiveness of the cross-talk cancellation (XTC) efficiency is to consider the measurement at the ipsilateral (closer to the speaker) ear and see by how much must it be attenuated in order to obtain the same result as the measurement at the contralateral (farther from the speaker, shadowed by the head) ear. The resulting frequency response curve is the profile of the effective attenuation.

So let's see. If we look at the steady state response, the XTC from my speaker arrangement is quite modest—around -10 dB in the best case. Below is the spectrum of the attenuation for the left and for the right ear:

However, if we look at the direct sound only, by applying a frequency-dependent window (FDW) of 8 cycles to the log sweep measurement, results look much better, showing consistent attenuation values between -20 and -10 dB. It works better for one ear due to asymmetry of the setup:

Note that deep notches as well as a couple of peaks are due to comb filtering from reflections and the effects of the dipole pattern itself. I must warn that just looking at what seems to the eye as the "average value" on the graph and taking this as the suppression efficiency measure may be self-deceiving. In order to calculate the actual negative gain of the XTC I have measured the difference in RMS level of pink noise filtered via the impulse responses of these natural attenuation filters. The results are somewhat modest: -4.7 dB for the sound of the left speaker and -4.9 dB for the sound of the right speaker.

For comparison of performance with DSP XTC solutions, I have checked the Chapter 5 of the book "Immersive Sound" which talks about BACCH filters. There is a graph of a similar measurement that I have done, they have made it using a Neumann KU-100 dummy head in a real (non-anechoic) room using speakers set up at 60 degrees, 2.5 meters distance from the head, with their filters turned on. The Figure 5.12 of the book presents the measured spectrum at the ipsi- and contra-lateral ears, and similarly they measure the effectiveness of the XTC by subtracting these. I have digitized their graphs and derived the attenuation curve, it is presented on the graph below as the brown solid curve, and I have changed my curves to dashed and dotted lines for readability purposes:

We can see that the BACCH XTC does a better job, except in the region of 7–10 kHz. Also note that since I have a single subwoofer there is no attenuation below 100 Hz. The author of the chapter calculates the level of attenuation as an average of the frequency response curve values across the audible spectrum, and their result is -19.54 dB. However, since I had digitized their graph, I could build a filter which implements it and measure the resulting decrease of the RMS level of ping noise, the same method that I used for my measurements. Measured this way, the effective gain of the BACCH XTC is -8.86 dB. This is still better than my result, but only by 4 dB. So I must admit, DSP can do better job than natural attenuation due to the speaker arrangement and the radiation pattern, however as we can see from the chapter text, building a custom XTC filter tailored to the particular setup is a challenging task, and there are many caveats that need to be taken care of.

Center Image Height

As I have explained in the section "Target Curve Adjustments" of the previous post, in order to provide correct rendering of the center image, the spectrum of the sound from the speakers which are on the sides of the listener must be corrected so that the phantom center image has the spectrum that a real center source would have. The paper by Linkwitz which I cited in that post contains necessary details. One good test for the correction is to make sure that a source which is intended to be at the ears (or eyes) height is actually heard this way. For that, I use the track called "Height Test"—track 46 from the "Chesky Records Jazz Sampler & Audiophile Test Compact Disc, Vol. 2".

Merging Perceived Results of ITD and ILD Panning

Changing the spectrum of side images in the way described in the previous section also helps to reduce attention to the speakers, because now sounds coming from them do not have the spectral characteristic of a side source.

However, while listening to old engineered recordings (from 70s or earlier) that use "naive" hard panning of instruments entirely to the left or to the right by level adjustment only, I have noticed that this spectrum change is not enough for decoupling the sound of an instrument from the speaker which is playing it. Real acoustic recordings and modern tracks produced with Dolby Atmos sounded better. This was likely because modern panning techniques use both level and delay panning. They may actually use more—to get a full idea of what is possible I used a panning plugin called "PanPot" by GoodHertz.

While playing with a plugin using dry percussion sounds from the "Sound Check" CD produced by Alan Parsons and Stephen Court I have noticed that hard panned sounds using delay panning a perceived a bit "away" from the speakers while level panned sounds are perceived coming from the speaker itself. Schematically, it was perceived like this:

I decided to combine them. In order to move hard ILD panned sounds I use the "Tone Control" plugin, also by GoodHertz. It can do Mid/Side processing, and I switched it to the "Side only" mode. Recall from my previous post on Mid/Side Equalization that M/S decomposition does not completely split out the phantom center from the sides. However, it is good enough to tune hard panned sources.

I have prepared a test track which interleaves pure ILD and ILD+ITD panning of a dry sound of a snare drum. While listening to it, I was experimenting with the settings for the corner frequency, slope, and gain of the treble shelf, as well as with overall gain of the side component. The goal was to move the ILD panned source closer to the position of an ILD+ITD panned source, and at the same time not to change its perceived tonality too much. Obviously the results of panning using different techniques will never sound identically, however, I could come close enough. As a result, the sound scene has moved a bit away from me, behind the speakers plane:

I have pictured the scene as going beyond the speakers because this happens with some well-made recordings like the track 47 "General Image and Resolution Test" from "Chesky Records Jazz Sampler & Audiophile Test Compact Disc, Vol. 2" where the sounds of doors being shut are rendered well beyond the speakers in a distance.

It's interesting that correction of the purely level-panned images really helped to decouple the speakers from the sound they are producing! I used tracks from the full OST to "The Holdovers" movie which feature a number of records produced in 60s and 70s. Note that as far as I know, the full version with all tracks is only available on vinyl—the usual issue with licensing on streaming services prevents them from offering all the tracks. And the producer of the OST decided not to bother themselves with offering a CD.

Banded Correlation and Decorrelation

Since my speaker system is not a dipole across the entire spectrum, and walls are located nearby, there was still some "unnaturalness" of the image, even though the quasi-anechoic frequency response looks correct. How can we do further tuning without noticeably affecting the frequency response? The trick is that we can change it depending on the correlation.

For example, while listening to the bass line of the "Spanish Harlem", I have noticed that the first note, which is mostly delivered by the subwoofer does not sound as strong as following notes, which are higher and are delivered by the main speakers. I did not want to raise the level of the sub, because I know it is at the right level, and listening to OSTs by Hans Zimmer proves that—I don't want the sub to be any louder :). Instead, my solution was to decrease level of the correlated component (the phantom center) in the frequency range served by the woofers—they are omnidirectional, thus their sound is reinforced by the walls. For that I used the "Phantom Center" plugin by Bertom Audio.

Another correlation tweaking needs to be done in the high frequency region, above 4.7 kHz. I took the track I often use—the left / right imaging test from "Chesky Records Jazz Sampler Vol. 1" and overlaid the "midway" position announcement with "off-stage" position. Initial lack of correlation due to somewhat excess of reverberation at high frequencies causes the off-stage announcement to sound either in the front, in the position similar to the "midway" position, or even "inside the head." By increasing the correlation I was able to move it to the intended location. However, having too much correlation causes the phantom center to become too strong and too narrow, which makes the "midway" position to collapse to close to the center. Thus, by hearing both announcements at the same time I can increase the correlation to the just right amount.

Finally, I used a set of 1/3 octave band-filtered dry mono recordings of percussion instruments converted to stereo: first with identical left and right channels, then with the right channel inverted. This is the same set of sounds that I used in this post about headphones. I compared how loud the correlated version sounds to relative to anti-correlated. It is expected that it should be of the same loudness or a bit louder, however I have found that in the region between 400 and 900 Hz anti-correlated sounds are perceived to be louder than correlated. Unlike my previous experience with traditionally arranged speakers, this time I was able to reduce loudness of anti-correlated sounds in this band.

This perceptual correction helps to reduce attention paid to the details in the sound produced by speakers that get amplified by the room too much. The sound becomes less fatiguing—that's yet another aspect of "naturalness." As Linkwitz has put it, it's better to make our brain to add missing details than trying to force it to remove extra details—that costs much more mental effort which manifests itself in exhaustion resulting from long listening sessions.

Processing Chain

The description of the tuning process has turned out to be a bit lengthy. Let's summarize it with a scheme of the filters that I put on the input path. They are inserted before the digital crossover and correction filters that I described in the Part II of these post series.

So, first there is the Tone Control applied to the "Side" part of the M/S decomposition, which is intended to move ILD-panned sounds a bit deeper down the virtual scene to match ITD+ILD-panned sounds. Then there go 3 instances of the Phantom Center plugin tuned at different frequency bands that perform the job of correcting the effects of the room-speakers interaction. I wish there was kind of an "equalizer" plugin that could apply phantom center balancing to multiple bands—Bertom Audio, take a note :)

Some Tracks to Listen To

Having achieved a good imaging through my speakers, I had re-listened to many albums that I have pinned down in my collection. Here are some tracks that I can recommend listening to:

"The Snake and The Moon" from the "Spiritchaser" album (2007) by Dead Can Dance. It starts with a buzzing sound rotating around the head. The rhythm of the song is set by an African drum pulsating at the center, and there are other instruments appearing at different distances and locations.
"The Fall of the House of Usher" multi-track piece from "Tales of Mystery and Imagination - Edgar Allan Poe" album (1976) by The Alan Parsons Project. Alan Parsons is known for his audio engineering excellence, and this debut album of his own band features rich and interesting combination of various kinds of instruments: synths, guitars, live choir etc. This album was at some point re-issued in a 5.1 version, but I still enjoy it in stereo.
"Spark the Coil" from "We Are the Alchemists" joint album (2015) by Architect, Sonic Area & Hologram is a rhythmical electronic piece with a very clear sound and precise instruments placement.
"Altered State" from the "360" (2010) album by Asura. Nice melodic piece with ethnic instruments and ambient electronics. The track produces good enveloping effect and delivers well positioned instruments.
"Fly On the Windscreen (Final)" from the "Black Celebration" (1986) album by Depeche Mode. Although the recording technique used here is not very sophisticated—one can hear that panning of some instruments is done with level only—it's interesting to listen into different kinds of reverbs applied to sound effects, instruments and vocals.
"Prep for Combat" from the "AirMech" game soundtrack released as the 2012 album by the industrial band Front Line Assembly. It uses rather powerful sounding electronic sounds that are panned dynamically and fly around the listener.
But of course, when we speak about powerful sound, it is hard to compete with Hans Zimmer whose "Hijack" track from the "Blade Runner 2049" OST (2017) is sending shivers down the spine and can be used as a great test of how well the subwoofer(s) is/are integrated with the main speakers.
"Zen Garden (Ryōan-ji Temple Kyoto)" from the classic "Le Parc" album from 1985 by Tangerine Dream starts with ambient sounds of wind and then adds gentle percussion sounds carefully panned along the virtual sound stage. I'm not sure which instruments are synthesized and which are real, but they all sound quite natural, with their locations well-defined.
And finally, one track which is not music but rather the recording of rain—featured at the very end of the movie "Memoria" (2021). I keep listening to it over and over again, especially in the evenings. It feels like you are sitting on a porch, watching rain and listening to delicate yet powerful rumbling of the thunder in the distance. It's funny that the track ends with someone coming up to the recording rig (you can feel their breathing) and turning it off—not sure why they did not cut this out during post-production, but it definitely enhances the feeling of realism of the recording :)