Stereo Setup with LXdesktop
Now let’s experiment with real speakers in a real room. For this experiment, I re-arranged my LXdesktop setup into a more conventional placement and orientation. Because my desktop setup sits close to the walls, I positioned the speakers to minimize strong reflections in the first 15 ms. The best configuration delayed the first strong reflection to 12 m, resulting in a 42° speaker angle and 74 cm listening distance. Below are ETC graphs for the left and the right speaker:
Next, I loaded the original LXmini filters into a convolution plugin hosted by Reaper. I used a physical center speaker—a Genelec 8331A—to find the right room curve, first equalizing it for a flat response. Then, starting from the “Harman target curve” that I’ve got from ASR I slightly adjusted the high frequencies roll-off in order to compensate for the near field speaker location. The resulting “nearfield Harman” target curve looks like this:
I then used this curve to equalize both LXdesktop speakers using simple IIR filters generated by REW. I used microphone oriented towards the ceiling in order to avoid coincidental high frequencies bump for the center speaker:
(This photo is from early experiments, the LXdesktop speakers are in their original “wide” setup). After experimenting with the Genelec, I realized that it sounds a bit different from LXdesktop likely due to difference in the radiation pattern. So I built a third LXdesktop speaker just to avoid this difference. The resulting test setup is below:
This is how all three speakers are tuned with respect to the target curve (with 15 cycles of Frequency-Domain Window (FDW) applied):
As you can see they are set up to measure as close as possible to each other.
Notes on Loudness Equalization Approaches
A phantom center and a discrete center equalized to the same linear frequency response do not sound the same. We are interested in matching their perceived sounding, not the frequency response of their linear model as measured by an omnidirectional microphone. If I try to equalize the centers based on the conventional frequency response measurement using anthropometric KU-100 or even binaural microphones, the perceived sound will still not match exactly. After all, non-linear effects, the radiation pattern, and the resulting room reflections of the sound from the speakers play very important role in the final perception of the sound image, and they are not the same between a physical center produced by a single speaker and the phantom center produced by a pair of speakers. Even more distinctive would be the sound for the diffuse field scenario.
We can try to approach the equalization in a controlled way from subjective perspective. The method proposed by D. Griesinger for headphone equalization and implemented by his Sonic Focus app can be employed for matching the perceived sound of speakers. It’s actually even more straightforward for them. Griesinger’s idea is that a human can listen to short 1/3 octave filtered bursts of pink noise and compare their loudness. For the purpose of headphones tuning, the human compares loudness between the current band and the reference band (500 Hz) which results in building an equal loudness contour—a subjective EQ. Others and I have tried this approach (see this MSc thesis by T. Kinnunen). It requires significant effort and yields inconsistent results.
With speakers, however, we don't need to build a loudness contour. We can simply switch rapidly between the reference and target speakers to compare the loudness of each 1/3-octave band directly—I will refer to this later as “modified Griesinger's method.” The results of this tuning are consistent, however it still requires time and effort to go over every 1/3 octave band and meticulously tune the equalizer for it. In addition, one needs to go back and forth multiple times because the bands of human hearing do not necessarily match this fixed 1/3 octave quantization and tuning one band may affect the perception of loudness for adjacent bands. However, I like the overall idea with using perceived loudness as the metric because it takes into account many factors such as non-linearity of speakers and the effects of the room and of the comb filtering.
Can we measure loudness objectively, similar to frequency response?
Of course, we can to some extent because loudness is a very important
metric for hearing safety, and acousticians do work hard on making the
measurement process as automatic and repeatable as possible. One of the
methods which is an international standard is the Moore-Glasberg model
(ISO
532-2) which can be applied to tones, broadband noises, and complex
sounds with emphasized spectral components. The method simulates the
highly non-linear, level-dependent compression of the cochlea. There is
an implementation of this method within MATLAB’s Acoustic Toolbox—the acousticLoudness
function. As you can see from examples,
for ISO 532-2 it is possible to obtain a nice graph of loudness per
frequency, which looks similar to the frequency response, but in fact is
a psychoacoustic measurement. The paper by V. Gunnarsson which I
referred to in the first part of this post also employs ISO 532-2
loudness model for assessing the discrepancy between phantom and
physical sound sources.
There is a catch: loudness is measured in sones, which can not be directly converted into the decibel values needed for equalization. There are only “rules of thumb,” for example “a 10 dB increase in physical energy roughly doubles the perceived loudness.” But that depends on the frequency range that we are working with and possibly the room conditions. In addition, as I mentioned, the loudness model is non-linear by itself (in order to simulate human hearing correctly). Thus, the character of the sound largely affects the resulting loudness perception.
Gunnarsson sidesteps the issue of translating sones into decibels for equalization by not actually using the loudness model to calculate the correction filters. Instead, he calculates the filters using the composite summed power spectra of both ear signals. Please refer to his paper for more details on this approach.
I decided that instead of using a proxy measurement for loudness, we can automate loudness matching. In some sense, this is similar to performing manual loudness matching by ear as in the modified Griesinger’s method. But instead of doing this manually, we solve a minimization problem for the difference between two loudness curves by applying equalization and adjusting the gain of its bands until we reach the desired tolerance. Basically this is still “matching by ear” but performed by a computer.
There is another hurdle. The ISO 532-2 model expects a measurement microphone with a flat transfer function and internally applies its own pinna gain. But we are using a KU-100 dummy head to correctly capture stereo cross-talk, meaning our measurement is already pre-emphasized by the KU-100's physical pinnae. Since the loudness model is non-linear and mimics the effects of masking and cochlear compression, the presence of double pinna filtering will affect the result in a non-obvious way. Because the loudness model is non-linear, this double pinna filtering corrupts the result. Unlike a purely Linear Time-Invariant (LTI) system, we cannot assume the pinna filter will cancel itself out when we subtract the two curves.
That means, before feeding any readings from KU-100 into MATLAB’s
acousticLoudness, first we need to remove the pinna
filtering. Thankfully, there is an easy approach to this called
free-field equalization (FF-EQ). We basically measure
the transfer function of KU-100 for the frontal sound source, smoothen
it, and use it as a calibration function. That means, the output from a
physical center will be seen a flat line—well, not entirely flat because
we do not intend to remove the frontal notch fully, as this will
inevitably introduce coloration and skew the results from the loudness
model. After applying FF-EQ, the outputs from all other directions
around KU-100 will be relative to the frontal source.
I used Claude Code to help me to write the automated loudness
matching MATLAB script. Claude wrote a function loudnessmatch
which takes as parameters:
- the SPL level of the “reference” FF-EQd binaural recording (because perceived loudness depends on it, recall equal-loudness contours);
- the reference recording itself (for example, the recording from a physical center speaker);
- the recording of the source system that we need to equalize to the reference.
And then the function uses a 29 band 1/3 octave IIR equalizer (similar to what Griesinger’s DGSonicFocus app employs), and carefully minimizes the difference in calculated loudness between the reference and EQ’d source recordings. The function outputs both the values of the equalizer bands and the impulse response of the filter.
Anechoic Simulation of Loudness Matching
Before jumping forward to measure and adjust loudness between physical speakers I decided to run a quick test on the same Ambisonics simulation of KU-100. It was interesting to compare the difference between impulse responses (linear model) with the equalization needed to match loudness (non-linear human hearing model).
This is what we have for the physical vs. phantom center:
And this is for the “diffuse field” (actually, two rear speakers):
I used 80 dB SPL as the source loudness. In theory, when playing audio at a sufficiently different level (e.g., 65 dB SPL or 95 dB SPL), the target loudness matching curve would change (again, see the equal-loudness contours). Thus, the loudness matching approach is sort of an optimization compared to purely linear matching.
As we can see, for the physical vs. phantom center loudness matching has the midrange peak attenuated by about 2 dB and also represents a more regularized curve. For the “diffuse field,” loudness matching also smoothens the dip and makes it slightly less pronounced. Also, the region between 6–10 kHz is amplified instead of being attenuated as frequency response measurement suggests. I guess, this has something to do with equal loudness contours. Anyway, it’s interesting to figure out how does its prediction compare with what I can hear.
Physical Center vs. Phantom Center on Speakers
Matching the frequency response of stereo speakers to a center speaker does not make the phantom center sound like a physical center. To confirm this, I matched their SPL levels by using a measurement microphone while playing correlated signal from the stereo pair.
Then I started playing various mono recordings of acoustic instruments (from Alan Parsons’ “Sound Check” CD) and pink noise, and noticed obvious differences in the sounding of the physical and the phantom center:
the phantom center sounds wider;
it gets automatically “anchored” to the center speaker even when in fact the center is muted (I had to check myself a couple of times at first—I was sure that I forgot to mute the center);
if I close my eyes, the phantom center might be perceived as elevated, depending on the sound being reproduced;
the tonality is indeed different, which is especially obvious for the pink noise signal. The interesting thing is that the phantom center sounds “softer” and more pleasant to my ears. Maybe that’s why a lot of music producers still prefer it to the physical center. Going a bit ahead, this is caused by differences in IACC, I will explore that in subsequent posts.
So, yes, the “phantom image problem” is real and even with speakers that have narrow, directed radiation pattern the sound from the opposite speaker creates comb filtering. However, even the single, physical center is not immune to the comb filtering problem because the sound from it reflects from nearby surfaces and from the listener’s torso! This can be easily confirmed by stepping towards and away from the speaker. One may argue that this comb filtering is “natural”, however it is not helping me to “unhear” it!
Linear Model
First, let’s now compare the result from the anechoic simulation of KU-100 from the first part of the post with a measurement on a KU-100 in my room:
Below is the comparison between them, the KU-100 room measurement is averaged for the left and the right ear, and I have applied Psychoacoustic smoothing in REW to the result. I also included the Linkwitz EQ curve:
We can see in a room the segment of 300–3000 Hz matches the Linkwitz EQ curve closer than of the anechoic measurement using an ideal point source. Obviously, the room has its influence on the bass region making sources at different locations to produce different sound pressure, so that region is uneven. The high frequency bump is also observed in the room, and the Linkwitz EQ is missing it.
I also measured the same setup using in-ear microphones, and derived the compensation the same way. The result is close to the KU-100 measurement despite the fact that KU-100 is just a head without torso:
What about stereo phantom source to physical center compensation in a real room? For a reference, I used graphs from the paper by B. Shirley et al. “The Effect of Stereo Crosstalk on Intelligibility…” which Toole cites in his book (and also re-uses their graph for illustrating the “phantom image problem” in a non-anechoic room). The graph below compares the following curves: compensation needed for KU-100 and for me in a real room, and Shirley’s result (recall that graphs in my posts are all “EQ graphs,” thus where the original transfer function has a dip, on my graph you see the inversion of it—a hump):
We can see that in my room the location of the cross-talk interference dip is at a different frequency range, only partially overlapping with the Shirley’s result, and it is also wider.
Loudness Model (Non-Linear)
Now a more interesting and practical comparison in terms of perceived loudness. I used an excerpt of 5 seconds of pink noise recorded from each of the sources using KU-100 with SPL of 80 dB in front of the head. The noise was then “free field equalized” (FF-EQ) using the compensation curve obtained during the anechoic experiment. After my loudness matching script has produced the compensation EQ, I loaded it into the Reaper and repeated the recording and measurement in order to ensure that the EQ indeed compensates modelled loudness. It all worked out quite smooth! Below is the graph comparing loudness alignment needed for the phantom center vs. physical center of the anechoic simulation, real KU-100 in my room, and also the linear measurement of KU-100 in my room:
We can see that EQ by loudness is more gentle. It’s interesting to note that loudness matching for the anechoic simulation looks more similar to simple linear matching from room measurements of KU-100.
And then, since my head is a bit different from KU-100, I used DGSonicFocus to check the loudness of each band between the phantom and physical center as perceived by my ears. I have noted some slight differences and applied some compensation:
The differences are quite minor, so I ust conclude that the ISO 532-2 model works really well in this case.
Listening Impressions
Of course, after this meticulous matching of the phantom center it’s interesting to compare how does it sound. Since we only considered the perceived loudness aspect and attempted to fix the spectral coloration, there are still other factors that can contribute to perceived sonic differences.
Comparing full stereo recordings is pointless here; a single physical center cannot convey the intended width and depth of the mix. So I only checked mono dry recordings of vocals and instruments from that Alan Parsons’ test CD.
The most striking difference is of course in the perceived width. A phantom center sound source is always perceived as wider than a physical speaker. For some instruments, especially playing in the lower frequency range this creates much more pleasant feeling of envelopment by its sound. However, for physically smaller instruments like flute or tambourine this widening sounds artificial and make the sound image more blurred. A mono recording of a snare drum completely lost its dry character when played over stereo speakers. I suppose, such sharp transients interact more actively with the room than sounds of more stationary character.
Another difference that is easy to spot is the image stability. In my near field setup moving the head laterally completely changes the location of a phantom center source, while the physical center unsurprisingly remains stable.
A more subtle difference is that dry instruments played via physical center speaker sound a bit “punchier”, with more “in your face” sound. I guess, whether it’s good or not depends on whether one wants “realism” vs. “relaxation.”
As a side note, I think, these differences make it clear why music producers (as opposed to movie dialogue and FX producers) all express greatly varying opinions on the use of the physical center in modern multichannel and object-based recordings. On one hand, it improves the clarity and the “presence” of the rendered source, but on the other hand it could be too much “real” to the point of being disturbing and unpleasant. That’s why it becomes an art in itself how to support the discrete source with phantom source, or how to play with the “sound width” parameter in object-based representation in order to achieve the desired character of the reproduced source.
Finally, the loudness-based equalization almost completely fixed the problem of phantom center elevation. Even with eyes closed—to avoid visual anchoring of the sound to the center speaker—both physical and the phantom center sources are perceived at the same height. With the only exception of the pink noise signal! This is a true acid test for source similarity. Not only the tonality difference due to comb filtering could still be heard, but the perceived height was also not matching, with phantom center sounding elevated compared to physical center. We probably have reached the limit of what can be achieved with a phantom source.
Diffuse Field Simulation on Stereo Speakers
The reference source was the same pair of stereo speakers but facing the back of KU-100 (I have turned it around). The test signal was decorrelated pink noise, at the same output level. Matching loudness in this case required building a more divergent EQ curve. We could see that from the anechoic simulation example, and in the actual room the result is almost the same. Let’s compare the EQ curve obtained in the room with the anechoic curve and the diffuse field compensation curve proposed by Gunnarsson:
We can see that the curve from Gunnarsson’s paper is indeed diffuse field because it lacks the head interference dip.
Making any direct comparisons between the “real” source and its rendering via stereo speakers for this case is quite hard in my setup. However, I relied on Griesinger’s original method of creating an equal loudness curve. I created one for physical behind the head location of speakers (staying with my back turned to them) playing uncorrelated (between left and right) noise bands, as a “reference,” and then turned facing them, and corrected the algorithmically produced curve to yield a similar loudness contour. Here is my final curve:
I suppose, diffuse field (in my case, it’s not even a proper diffuse field since it’s not all around me) is perceived a bit differently from point sources, and because of that the loudness model does not exactly match my personal perception.
Then I tried several tracks that feature off-stage, behind the back, and very diffuse sources:
- left/right imaging test (Track 10 from “Chesky Records Jazz Sampler & Audiophile Test Compact Disc, Vol. 1”);
- tom-tom drum naturally panned around (Track 28 “Natural stereo imaging” from “Chesky Records Jazz Sampler & Audiophile Test Compact Disc, Vol. 3”);
- music instruments performed all around the microphone (Track 47 “Generic Image and Resolution Test” from “Chesky Records Jazz Sampler & Audiophile Test Compact Disc, Vol. 2”);
- F16 and Tornado jets flying by overhead (Track 88 from the “Sound Check” CD by Alan Parsons and Stephen Court);
- Track 1 from “Ambient 1: Music For Airports” by Brian Eno;
- the recording of rain featured at the very end of the movie “Memoria” (2021).
The employed diffuse field compensation noticeably improves presentation of “off-stage” sources and makes overall sound more enveloping. I have combined it with the phantom source EQ by putting the diffuse sound compensation into the side channel of the Mid-Side equalizer. It might sound a bit unusual because the Side channel by definition emphasizes anti-correlated rather than uncorrelated signals. However, commercial music producers typically avoid having strongly anti-correlated signals because they: a) sound really weird, and b) disappear completely when the stereo track gets downmixed into mono. Thus, typically the side channel contains just hard panned left and right channel sources (with the right channel being inverted) and also weakly anti-correlated sources, that is—the ambience.
The thought about hard panned sources had brought me to one realization which required a bit more experimentation.
The Effect of Mid-Side EQ on Hard-Panned Sources, and Up/Downmix Alternative
In commercial stereo recordings it is not unusual to encounter “engineered” (as opposed to “live recorded”) music where the producer panned some instrument to one channel exclusively (also known as “hard panning”). These kinds of mixes were used actively in the early days of stereo, for example on such tracks as:
- “Anna (Go to Him)” by Beatles from one of their first albums (1963);
- “The Time Has Come By” by The Chambers Brothers (1967);
- “Space Oddity” by David Bowie (1969).
But even stereo albums from more mature recording era could use hard-panned instruments, at least in some elements, for example in Madonna’s “American Life” song (same titled album from 2003) we can hear a synth and a guitar hard-panned to left and right channels, respectively, during some moments of the song.
What happens to such hard-panned sources when we employ our Mid-Side EQ for phantom center and diffuse sound compensation? If we just used to compensation for the phantom center, the answer would be trivial—hard-panned source will get reduced amount of the phantom center compensation because it’s applied to the half of the signal—the part from the Mid component, while the Side component is left untouched. But if we also put our diffuse field compensation EQ into the Side channel, then the result becomes less predictable. Linear-phase EQ only affects signal amplitude, but applying different EQs to two instances of the signal and then summing them back introduces severe spatial bleeding and frequency-dependent panning shifts. For instance, this is what happens to hard-panned signals with our spectral correcting EQ (the effect is symmetric for left and right channels):
We can see that application of this EQ introduces “bleeding” and coloration. The situation around the region between 4–5 kHz—this is where the side channel EQ has a deep wide notch—is especially dire. Not good! Can this be fixed? One approach is to abandon diffuse field compensation and use phantom center compensation exclusively, and also reduce the swings of EQ as much as possible balancing phantom center “correctness” vs. the effect on hard-panned sources.
Another approach is to use more sophisticated tool for stereo signal decomposition which actually understands the correlation between signals and perform what is called “Primary-Ambient Extraction” (PAE). In other words, use a modern multichannel upmixer (as opposed to older matrix-based upmixers). PAE algorithms typically utilize time-frequency masking (analyzing the Short-Time Fourier Transform). This allows the algorithm to dynamically route correlated energy to the center and uncorrelated energy to the sides on a per-frequency-bin basis. However, as we saw from my experiments on using upmixers to improve headpone rendering, these algorithms can add artefacts for complex dynamic signals (especially noise-like).
In order to check the differences between the Mid-Side and this approach, I fired up HALO Upmixer and configured it for 5.0 upmixing with hard-panned center extraction and producing an “exact” upmix which can be then precisely downmixed back into stereo with a ITU-R BS.775-compliant downmixer, for example, Nugen’s own HALO Downmixer.
In this processing chain we apply the phantom center compensation to the extracted center channel, and the diffuse field compensation to the side left and side right channels, and then we do a downmix back into stereo:
I’ll call this Up-Downmix EQ, or “U/D EQ” for short. It works better in terms of reducing coloration of hard-panned sources. For comparison:
We can see that the amount of coloration to hard-panned sources and their “bleeding” has been reduced significantly, however it was not eliminated entirely—why? Here is why:
The PAE-based upmixer does not just route source left channel into the left channel of 5.0 upmix (if we think about it, even a matrix based upmixer would put the source L channel both into the L and C channels because it would make C as L+R). Instead, a PAE upmixer spreads it among all three frontal channels (L, C, and R) with phase relationships that cause them to cancel completely in the right channel of the downmix. This is with the assumption that the intermediate 5.0 representation was not tampered with.
However, since our purpose is actually to apply EQ to some of those 5.0 channels, we disrupt this balance, and as a result, in the downmix the components from L, R, and C do not cancel themselves perfectly in the right channel of the downmix.
Note that HALO Upmixer’s behavior with hard-panned sources depends dramatically on whether the opposite channel contains true digital silence (literal zeroes) or acoustic silence (e.g., a -120 dB copy of the primary channel). In the digital silence case, HALO Upmixer puts the signal (say, from the left channel) into left front, center, and left rear channels, effectively creating the same problem for hard-panned sources as Mid-Side EQ. However, if the opposite channel actually has a copy of the primary channel, although reduced as much that it will never be heard via an electro-acoustic system, the algorithm detects correlation, and spreads the signal in the frontal pane, as I had described in previous paragraphs. We can see this difference by looking at HALO’s own visualization of the acoustic field:
As for the “bleed” that the equalization of the intermediate 5.0 representation creates—it’s really negligible. In order to confirm that, I measured using KU-100 a “clean” hard-panned channel and compared it with the result of Up/Downmixing processing. The results are very close:
This is likely the best we can do to leave hard-panned sources untouched while providing spectral compensation for both the phantom center and diffuse field. I have compared whether this upmix/downmix equalization approach with my initial Mid-Side EQ approach, and I think it sounds a bit “fuller” considering sides on tracks with lots of ambience, for example, the rain recording mentioned earlier. My hypothesis here is that since side sources are less colored by the frontal source signature, they are indeed perceived as coming from sides, while with M/S EQ they are more colored with the diffuse curve and their location becomes more ambiguous.
Another benefit of the upmix/downmix approach is that it allows more flexibility for controlling the resulting sound field as HALO Upmixer in particular allows changing the angles of rear speakers—this effectively balances the energy between virtual front and back, and also includes psychoacoustic shelving filter for the rear channels which helps to adjust conveniently the perceived height of off-stage sources. For sure, to some extent we can emulate that in the Mid/Side EQ by adjusting gain and shelving of the side component, but this can bring in its own issues.
For now, as I see it, we have reached the limit of how phantom center reproduction can be compensated purely by spectral correction without eliminating the physical sources of the divergence: the comb filtering and the fact that the speakers are located on sides of the listener instead of in front. It would be interesting to check what happens if we try to fix the root of the problem by employing true binaural rendering.



















