This post continues the previous post LXdesktop
Auralization with Ambisonics providing more details on the tuning of
the chain that I have built, as well as some listening impressions of
myself and from other people to whom I have demoed the setup.
Evaluation of
the Initial Version by Listeners
People who were participating in the listening demo were already
familiar with the "immersive audio" technology with its implementations
on Android and iOS as well as how binaural renderings done by Dolby
Atmos and MPEG-H authoring tools sound. They also understood how head
tracking works, and why it is needed. Still, everyone listening was
surprised with a very wide and externalized image provided by the
ambisonic rendering. Another point many people noticed was that when
rotating the head, the scene was rotating smoothly, with no perceivable
"jumps" of the phantom center which usually present on traditional
discrete channel spatializers as one turns their head in the direction
of the left or right virtual speaker.
However, they have also noticed some drawbacks:
- Some of the listeners have noted the "fuzziness" of the reproduced
sound sources.
- For some of the listeners the center image was feeling unnaturally
elevated.
- As people were switching between "raw" headphone playback (on
Sennheiser HD600) and the binaural render, they noted that they wanted
to get more bass for the latter.
I decided to take some time to address this feedback, and this had
led me to making some improvements for my playback chain.
Improving Phantom Center
Naturalness
The problem of the elevated phantom center frequently occurs when
listening to binaural records in headphones. It may also occur when
listening to regular stereo records with stereo speakers. Many people
note that when panning a mono source using amplitude panning between the
left and the right speaker, the trajectory of the rendered source may
have a "rainbow" shape, meaning elevated center.
Although the manifestation of the problem is the same, the reasons
for its occurrence are different for the speaker and the headphone
playback. For the playback over stereo speakers commonly cited reasons
are:
Presence of reflections which make the phantom center image to be
perceived as more fuzzy compared to the acoustic sources that originate
mostly from one speaker.
Speakers are often not placed to have their acoustic center at
the eye (ear) level, and this introduces a vertical component into the
recreated sound scene.
Other spectral colorations caused by acoustic interference may
create more energy in the bands that are associated by the brain with
vertical sources. In the absence of visual clues, the auditory system
assumes that the sound from an invisible source comes somewhere from
above.
For headphone playback, especially when doing "3D", "binaural" or
"immersive", one of the common problems is the mismatch between the
listener's own and the simulated HRTFs. Similar to the reason 3. for the
speaker playback, mismatched HRTFs can also push more energy into the
frequency bands associated by the brain with vertical source placement.
This is a quote from the
interview with the "father" of the Neumann KU-100 dummy head Stephan
Peus describing this problem in the context of binaural recordings made
with it:
We have also changed the "pitch angle" of the ears somewhat. In
listening tests with the KU 81, it had been noticed that sound sources
in the horizontal plane usually tended to be perceived slightly upward
during reproduction. This is related to a characteristic “dip” in the
horizontal frequency response of our outer ears. For every natural ear,
that dip is at a slightly different frequency. This does not interfere
with natural hearing, because we “adjust” the location of sound sources
with the help of our eyes throughout our lives. If we are now given a
certain configuration by the dummy head, we cannot correct visually. As
it happened, the aforementioned dip in the horizontal frequency response
of the KU 81 caused sound events from the front to be perceived as
slightly shifted upward. In the KU 100, we therefore adjusted the angles
of the ear cups relative to the vertical so that the imaging is now
correct horizontally and vertically.
Now, imagine what happens when we simulate speaker
playback over headphones, via HRTFs of a dummy head! I suppose, all of
these problems combine and affect the perception of the phantom center
even stronger. I can't fix the HRTF issue because in my processor I just
use the HRTFs of the KU-100 head (via the IEM
BinauralDecoder plugin). However, I was able to some extent fix the
"fuzziness" of the phantom center.
My approach uses the same idea as speaker crosstalk cancellation,
however in my case I did not have to use any actual XTC filters. First,
let's recap the essence of my approach. Using a stereo speaker setup and
an Ambisonics microphone I have captured transfer functions between each
speaker and each microphone capsule in order to simulate real-time
recording of the speakers by the microphone:

Now we can realize that if we render the left and the right speaker
separately (each on its own track), on the output we
will get ipsi- and contra-lateral signals for each of them
separately—that gives us 4 channels: for each
combination of left/right speaker and left/right ear. When mixing these
signals for the binaural presentation we can control how much cross-talk
we want to have in the end. First I have tried having no cross-talk at
all—that's the ideal XTC! However this did not sound natural at all,
resembling very much the regular "headphone sound", just with extra
reverb. The resulting phantom center was very close to the face. I have
found that attenuating ipsi-lateral paths by about 6 dB
produces the most natural result and yields a very compact and clean
sounding phantom center. Recall that HRTFs of the dummy head already
incorporate some head shadowing, this is why the extra attenuation does
not need to be excessive.
Did it fix the perceived center elevation problem? Yes, it did!
However, fixing the phantom center this way had a negative effect on the
width of the perceived sound scene—instead of appearing wide in front of
me, the left and right sides have now collapsed close to my ears. Why is
that? This has reminded me of the dilemma that people often mention on
audio forums: why speaker listeners want to reduce the crosstalk between
them, while headphone listeners often want to add crosstalk, via
cross-feed circuits or plugins? The answer is—they are tackling
different problems.
As I have noted earlier, the XTC is aimed to fix the coloration and
fuzziness of the phantom center by attenuating ipsilateral audio paths
from stereo speakers to ears, and making the sound waves achieving ears
to have the characteristic closer to a "real" center source in front of
the listener's head.
And the cross-feed mostly fixes reproduction of lateral sources in
headphones. If we consider hard-panned dry sources that only exist in
one of stereo channels, these sound very unnatural on headphones because
only one ear receives the signal for them, resulting in "inside the
head" localization. This is because lateral sound sources occurring
outside of the listener will always be heard by both ears, with natural
attenuation from head shadowing, and time of arrival difference.
So it seems that we need to decompose our stereo signal and separate
"phantom center" signals from lateral signals. In multi-channel and
object-based audio scene representations this decomposition is given,
but for stereo sources we have to do some work. I decided to employ the
approach similar to the one described in the post Headphone Stereo
Improved, Part III. I separated the stereo stream into
3 components:
- mostly correlated components: the "phantom center";
- mostly uncorrelated components: lateral signals created by hard
left/right amplitude panning;
- the rest: components lacking strong correlation, or anti-correlated:
the ambience.
I thought I could use a multichannel upmixer for this. However, after
experimenting with the free
SpecWeb tool set and inexpensive Waves UM225 I have
realized that although upmixers use conceptually the same approach for
components separation, their end goal is a bit different because their
target is a multi channel speaker system. Thus they are designed to
"spread" virtual sources softly between pairs of speakers, for example,
the phantom center is also "translated" into some energy in the left and
right channels, but I need to extract it in almost "solo" fashion. Also,
in multichannel setups there are typically no dedicated channels for
"ambience", and ambient components are also spread across all channels.
It is possible that with some practice I could set up an upmixer to
avoid spreading and do what I need, but decided to leave that for
later.
So, instead I decided to use the Bertom "Phantom Center" plugin for
this operation. But how to extract lateral sources? While the phantom
center is composed of fully correlated components, the "residue"—the
non-correlated components is composed from a mix of lateral and
"diffuse" sounds. So I came up with the following topology for
extracting lateral sounds, which uses both the "Phantom Center" and
Mid-Side approach:

The idea is that if we invert one of the channels, and process the
result via the "Phantom Center", it will extract the anti-correlated,
"diffuse" components. This way we can separate them from lateral
components and end up with 3 sound "streams" that I
have enumerated above. To illustrate the result, here is how this plugin
setup separates a set of Dirac pulses which correspond to different
source positions, based on their correlation:

If you want to refresh your understanding of interchannel
correlation, please refer to my old post On Mid/Side Equalization.
Of course, this decomposition only works correctly for amplitude-panned
sources, because the correlation meter in the "Phantom Center" plugin
uses zero-lag setting, however in practice this approach yields good
results for stereo records.
Note that in the end I have settled on 98% setting
for the "Phantom Center" to avoid sharp transitions between the
streams.
So, from this preprocessor we have 3 pairs of
outputs (the aforementioned 3 streams). Each pair is
processed independently, and moreover, each channel of the pair has its
own speaker-to-binaural processing path, which yields 4
channels, thus at the output we have 12 channels, each
of them representing a certain component of the stereo field, as
rendered via particular speaker, on the path to each ear. This gives us
full control on how to mix these components for binaural playback and
allows us to use both XTC and cross-feed at the same time, applied to
proper kinds of acoustic sources. I have ended up with following mixing
matrix in Reaper:

From left to right, the first 4 channels are the
center: Left speaker to left ear, and to right ear, then right speaker
to left and right ear. Then the next 4 channels are the
lateral components, in the same order, and the last 4
channels are the ones representing "ambience".
As a bonus, I had realized that by defeating the effect of head
shadowing for ambient components by boosting ipsi-lateral paths I can
achieve even better externalization of the virtual sound scene. In my
previous spatializer I had achieved the same effect by just boosting
uncorrelated components.
I have made the final adjustments to the balances by listening to
correlated and anti-correlated pink noise, making sure that they both
sound centered. I ended up wondering why the required interchannel
balance is not symmetric, and my hypothesis is that first, of course use
of non-individual HRTFs may cause this, and the second reason may be due
to not fully symmetric speakers setup in the room (see my earlier posts on
LXdesktop). In future I will try to correct that by making a better
speaker setup.
Fixing the Bass
Of course, as audiophiles we always enjoy rich and deep bass, and
headphone makers usually try to add more bass to their headphones. As I
and other people have found while comparing "raw" stereo sound to the
sound of the binaural rendering, the latter was lacking bass noticeably.
That seemed strange to me, considering that I have a good subwoofer and
I never feel the lack of bass when listening to my LXdesktop setup.
Trying just to boost up the bass of the binaural renderer's output
led to excessive on the head vibration of headphone drivers which was
ruining the externalization effect. This needs to be done in some other
way, I decided.
After reading a bit more about the implementation of the
BinauralDecoder, I noted that it uses the
MagLS approach for interpolating between sampling points of
measured HRTFs. This approach is intended to minimize the amplitude
differences only. Although the authors say that this approach is only
used starting at 2 kHz which may imply that interaural
time differences are preserved for the frequencies below, I decided to
check what will happen if I actually add them.
Since I have separate signal paths for the left and right ear, I
decided to employ my
"almost linear phase" ITD filters, and I was not disappointed—the
sense of a good deep bass has returned back to my binaural renderer!
It's interesting that these filters have flat amplitude, they do not
boost the energy of the bass at all. Yet, somehow adding a correct shift
in phase between the ipsi- and contra-lateral ear makes the bass sound
to be perceived as stronger. While doing A/B switching between
filter/no-filter configuration I have realized that maybe this phase
shift allows the auditory system to "focus" on the bass and perceive it
as coming from a compact source. Whereas mostly in-phase bass creates an
impression of some ambient rumbling, and is not perceived to be as
strong, even at the same energy level.
After some experiments with the cutoff frequency, I have ended up
with 500 Hz for the "center" source,
750 Hz for the lateral components, and no ITD filtering
for ambience. Raising the cutoff frequency or trying to apply the filter
to the ambient component was causing moving of virtual sources closer to
the face, and I did not like it.
One technical issue that use of this block creates is addition of
latency. Since the filters are symmetric linear phase style, and they
need good resolution in the bass region, they create a delay of
170 ms. And since they have to be placed
after the BinauralDecoder, this latency affects the
head tracking.
Adjusting the Tonality
It's never easy to tune an audio reproducing system to the ideal
tonality (does one even exist?). My binaural renderer is of course not
an exception to this rule. The first difficulty in obtaining the right
tuning is that there are many uncertainties in how I had captured the
speakers and the room, and also how the captured system is rendered
using the binaural renderer and the headphones. The second reason is
that the perceived tonality changes depending on how the brain perceives
the location, the size, and the distance to virtual sources.
So we don't know how precise our measurements are, and we need to use
our perceptions for tuning as well. However, in order to do that
efficiently we would like to be able to make instant comparisons with
some reference. One good reference I have found, thanks for Archimago's
post, was the binaural version of the "Touch
Yello" album released as a Dolby Atmos remaster in 2025 on a
Blu-ray. The binaural version sounded quite good when listening with
Sennheiser HD600 so I decided to use the frequency domain measurement of
it as a reference for fine-tuning my binaural chain.
The Blu-ray contains both the stereo and the binaural versions, so I
was able to measure frequency curves both for the binaural rendering of
the stereo version via my chain, and of the original binaural version.
Below is the result of comparison of ERB smoothed curves. The FR of the
original binaural version is in blue and red, and the FR of my version
is orange + light teal:

It can be seen that the official binaural version has somewhat
V-shaped (raised both bass and treble, with a dip in the mids) tuning
compared to my rendering. My initial plan was to try to match them as
closely as possible. However, as I have quickly understood, due to the
fact that my rendering sounds as being farther from the listener than
the original binaural version, their spectral shapes can't be just the
same. Instead, my approach was to find the regions where there is a
significant difference and then try adjusting these bands while
listening to the changes in tonality and perception. The goal was to
obtain more natural tonality for my rendering and minimize the changes
in perceived tonality when switching back and forth
between mine and the "official" binaural versions.
In the end my version sounds more spacious and is better
externalized, while the original binaural rendering sounds closer to the
face and is much "denser". You can compare the results yourself by using
YouTube and Google Drive links below (the Drive version uses AAC at
320 kbps while YouTube transcoded it into Opus
140 kbps). Note that although my rendering is done
specifically for Sennheiser HD600, you can anyway use any reasonable
headphones to check it. I even used Apple EarPods for some testing! Just
one note—if you are listening on modern headphones that support "spatial
audio", make sure you turn it off and just use the plain stereo
mode:
(Of course, these are provided for educational or personal use
only).
A question one can ask—what is it besides the frequency balance that
makes the original binaural rendering to be perceived very close to the
face, while my rendering sounds much more externalized? One reason I
have found is the objective measurement of the interaural cross
correlation (IACC). Below are two graphs comparing IACC for these two
binaural versions for the time position about 0:30:


We can see that my rendering is much less correlated in the high
frequency region starting from 2 kHz which corresponds
better to a more spacious listening experience. IACC is one of the
metrics used by acousticians for objectively comparing sound of
different concert halls (see the book by Y. Ando and P. Cariani "Auditory and
Visual Sensations") and different microphone setups (see the book by
E. Pfanzagl-Cardone "The Art
and Science of Surround and Stereo Recording").
Another "source of truth" for the tonality that I have found are
recordings produced by Cobra
Records. Whereas the Yello binaural production was a result of
binaural rendering done from Dolby Atmos master, recordings done at
Cobra are real acoustical recordings done
simultaneously by conventional multi-mic setups for stereo and
surround, and using the KU-100 head for the binaural version. If you
recall, the IEM BinauralDecoder plugin is also based on KU-100 free
field HRTFs, and thus comparing the rendering of Cobra's stereo records
processed by my chain with their binaural versions makes quite a fair
apples-to-apples comparison.
Unfortunately, I do not know for which headphones their binaural
version is intended for. I can imagine it should be for some diffuse
field equalized headphones. So, as an example, here is an excerpt from
"Extemporize"
piano album, where my chain is rendered for HD800, again both as YouTube
and "offline" files:
One thing that I have noted when comparing my binaural rendering with
Cobra's binaural recording is that the latter for some reason have left
and right channels swapped, and I have fixed that for my comparison
test. The difference between these recordings / renderings is more
subtle than with Yello track—the stereo recording is really good by
itself! Still, I hope you could get the similar experience of the sound
moving away from the head when listening to the rendering via my
processing chain.
For completeness, this is a similar comparison of ERB-smoothed
frequency responses of these renderings:

Again, it's a pity that producers of these binaural recordings do not
specify which headphone would they recommend for listening to them. The
page at Cobra Records site says "any
brand or style will work (nothing fancy required!)"—to me that sounds
too generous. I know, you can hear a difference using
any headphones, but for actually experiencing "being there" the tonal
balance of the headphones used for binaural reproduction is very
important.
Using More Headphones
for Auralization
With the equalization which needs to be applied to the headphones in
order to achieve the "natural" equalization for the immersive playback
all of them start to have similar tonality. Yet, the impression that we
have when listening to them is still not the same. Even with the same EQ
target, different headphones create different listening impressions due
to differences in their drivers and their interaction with our ears.
Another important aspect of headphones apart from how they sound is
how they feel on the head: their clamping force, size of the earpads,
the weight etc. I have found Sennheiser HD800 and Shure SRH1840 to be
very comfortable for long listening sessions. However, unfortunately
neither of them is on the list of headphones originally measured by
B. Bernschütz on KU-100 and thus they are absent from the
EQ list of the IEM BinauralDecoder plugin.
However, I'm lucky to have access to the state-of-the-art headphone
measuring system B&K 5128, so I used it to derive EQ filters turning
HD800 into HD600 which are on the list of the BinauralDecoder. Note that
EQuing headphones to sound like some other model is actually a
non-trivial task. At the last NAMM convention I had a conversation with
a representative of TiTumAudio
company that makes headphones that can imitate a number of commercial
headphones. He noted that actually copying the sound of other headphones
requires tweaks that are beyond simple LTI processing (that is, EQing).
This is why I specifically have chosen HD600 as the target for HD800 as
their drivers are probably the closest in their non-linear properties,
compared to headphones from other makers. In a similar fashion, I use
AKG K240DF (which are on the BinauralDecoder) list as the target for
other headphones by AKG, Beyerdynamic DT990 as the target for other
Beyers, and discontinued Shure SRH940 for more modern models.
One technical challenge I have encountered when creating filters for
these conversions is that the measurements for the left and right ear of
the B&K 5128 never match exactly. Instead of using
an average between left and right ear, I decided to leave them
different. However, in order to avoid distorting phase relationship
between left and right channels, I have made these conversion filters
linear phase (8k taps). Since that creates an extra delay, I put this
correction block before the SceneRotator in order to reduce the latency
of head tracking.
Complete Processing Chain
Summarizing the processing blocks mentioned in the previous sections,
this is what I ended up with:

It may look big, but after all these are all only necessary
components. Realistic binaural rendering is not an easy task!
Music Tracks Used for
Evaluation
While iterating on my spatializer, and also during preparation to
demoing it to other people, I have come up with a list of songs
available on Apple Music that have good spatial properties. They
represent different musical styles: classical, pop, electronic, metal,
etc. They also represent different recording styles: acoustic
recordings, engineered stereo recordings, some modern Dolby Atmos tracks
(rendered into stereo).
Roughly, I can classify them into several categories based on which
attributes of the reproduction they can help testing. Of course, some
tracks do belong to several categories.
- Great stereo acoustic recordings with natural vocals and instruments
- All Roads to the River / Breaking Silence by Janis
Ian
- Also sprach Zarathustra, Op. 30, TrV 176 by Richard Strauss
- Grandmother / The Raven by Rebecca Pidgeon
- L'Égyptienne / Les Sauvages by Béatrice Martin
- No Flight Tonight / from Chesky Records 10 Best, by
The Coryells
- Pipeline / Two Doors by Michael Shrieve
- The Firebird Suite (1919 Version): V. Infernal Dance of King Kaschey
by Igor Stravinsky
- The Wrath of God: Pt. 1 by Sofia Gubaidulina
- Violin Concerto No. 1: III. Quarter note by Philip Glass
- Engineered records with strong emphasis on spaciousness
- An Echo of Night / The Pearl by Brian Eno &
Harold Budd
- An Ending (Ascent) / Apollo: Atmospheres and
Soundtracks by Brian Eno
- Animal Genesis / Oxymore by Jean-Michel Jarre
- Barco / Insen by Ryuichi Sakamoto & Alva
Noto
- Contrapunctus 8, A 3 / Laibachkunstderfuge by
Laibach
- Day One (Interstellar Theme) / Interstellar (OST)
by Hans Zimmer
- Get Your Filthy Hands Off My Desert / The Final Cut
by Pink Floyd
- High Hopes / The Division Bell by Pink Floyd
- Resonance / Resonance by Boris Blank
- Ripples in the Sand / Dune (OST) by Hans
Zimmer
- The Snake and the Moon / Spiritchaser by Dead Can
Dance
- Troubled / Passion (The Last Temptation of Christ
OST) by Peter Gabriel
- Synthesized stereo scenes, not as "spacious" but still interesting
- Another One Bites the Dust / The Game by Queen
- Birds / Nameless by Dominique Fils-Aimé
- Bubbles / Wandering—EP by Yosi Horikawa
- Jeremiah Blues (Part 1) / The Soul Cages by
Sting
- Me or Him / Radio K.A.O.S. by Roger Waters
- On the Run / The Dark Side of the Moon by Pink
Floyd
- Rocket Man / Honky Château by Elton John
- Space Oddity / David Bowie (aka Space Oddity) by
David Bowie
- The Invisible Man / The Miracle by Queen
- Voice of the Soul (1996 Demos) (Instrumental) / The Sound of
Perseverance by Death
- What God Wants, Pt. I / Amused to Death by Roger
Waters
- Not quite "spacious" but with good "visceral impact"
- Dyers Eve / ...And Justice for All by
Metallica
- Flint March / Small Craft On A Milk Sea by Brian
Eno, Jon Hopkins & Leo Abrahams
- Heatmap / Warmech by Front Line Assembly
- Lie to Me / Some Great Reward by Depeche Mode
- Single Blip / Ssss by VCMG
When demoing, I realized that most of these tracks are unknown to
most people. They were generally choosing either Hans Zimmer's songs, or
Pink Floyd, and for some reason the Rocket Man was also popular.
Remaining Issues (Future
Work)
Essentially, I have two kinds of problems: one kind stems from the
fact that instead of using artificial models of speakers and the room I
use a capture of real speakers in a real room, so any of its flaws get
aggravated by the processing and the listening on headphones. The second
kind of problems lies purely in the DSP domain and hopefully can be
fixed more easily.
Fixing Speaker Setup
and Room Asymmetry
As noted in the section "Improving Phantom Center Naturalness", my
Ambisonic capture of the room is not perfectly balanced thus requiring
some correction. This is not the problem of the capture process, but
rather the fact that it captures the imperfections of my setup, and they
get aggravated by headphone listening. Having a better, more symmetric
setup should help.
Reducing Room Ringing
This issue I have encountered when I was listening to male vocals
recording. This was opera tenor Joseph Calleja
performing "I Lombard" by Verdi. I had experienced a very uncomfortable
sensation of "ringing" and "compression" in the sound of Calleja's
singing. I had compared my rendering to the original recording and
noticed that some of these artefacts are already there, due to the
reverberation of the hall when the recording was made. Then I listened
on the speakers, and noted that these artefacts are even more pronounced
due to the reverberation added by my room. I think the primary sources
for these are comb filtering and flutter echo that interact with the
harmonics of the singer's voice.
I have realized that if I had invited Joseph Calleja to actually sing
in my room, I would likely hear this compression and ringing as well. I
recalled that I actually can notice these artifacts when listening to
live vocals while sitting in acoustically mediocre halls.
What can I do about that? Ideally, I would like to reduce the
reverberation of my room captured by the IRs, and treat the reflections.
However it's not easy to apply this cleanup post hoc on already captured
IRs. I decided that next time I will probably put some sound absorbing
materials at the back side of the microphone in order to produce a bit
more "dead" IRs.
Achieving
Better Quality of Stereo Field Decomposition
As noted in the paper by E. Vickers "Frequency-Domain
Two- to Three-Channel Upmix for Center Channel Derivation and Speech
Enhancement", frequency-domain audio processes may produce certain
artifacts, often described as "musical noise" or "watery sound." This is
indeed what I can hear when I decompose pink noise into correlated and
anti- and uncorrelated components and then listen to each of them
separately. When the processed sound gets combined back together, these
artifacts are mostly masked, however they still may pop up when
listening to music with lots of transients. Ideally, I would like to
find a more "high fidelity" way of decomposing the stereo sound
field.
I have contacted Tom from Bertom Audio regarding the artefacts
produced by the Phantom Center plugin, and his answer was that
unfortunately nothing can be done to the current version of the plugin
to get rid of them completely. So a possible solution to this may be
studying the way for achieving the same decomposition using one of those
expensive high quality upmixing plugins.
Solving the Latency
Problem of ITD filters
As I have previously noted, the ITD filters which are required for
good bass reproduction when binaural rendering is done via IEM
BinauralDecoder have noticeable latency. So I either need to find a
binaural renderer for Ambisonics which produces similar inter-aural
phase, or re-create the filters in some mixed-phase way, with much lower
latency.