Electronic Projects: 2025

Saturday, November 22, 2025

LXmini Full Range Driver Alternatives

There was one question that could never escape the back of my mind since I have built my desktop version of LXmini—is it possible to fix the distortion issue with the full range driver? This issue was also the major negative point in the Erin's review of LXmini.

Because the full range driver used for LXmini—SEAS FU10RB—was released 15 years ago, I thought, perhaps there are any better alternatives on the market that have been developed with modern materials and better technologies, so they can provide more even (without the prominent bump between 1 and 2 kHz), and hopefully achieve a 10 dB or better improvement in the overall distortion level? I checked distortion measurements for various drivers available on Zaph|Audio and at Erin's Audio Corner, but could not find any small "full range" drivers that would come up as obviously superior to SEAS FU10RB in terms of distortion. One notable exception are drivers made by Purifi, however, with the current global trade situation getting them in the USA is very costly.

The Candidates

Scanning through the available stock of MadiSound and Parts Express, I have came up with a very scarce list of possible candidates for high fidelity full range 3"/4" drivers:

SEAS MU10RB-SL—this is the "midrange driver" of the LX521 speaker (I guess, the "SL" suffix are the initials of S. Linkwitz). But the fun fact is that is has the same specs as FU10RB, except for the properties of the suspension. The suspension in the "midrange" version is stiffer, which limits the cone excursion, and as a result the driver has poorer bass. But the bass is not an issue for its use in LXmini—there is a woofer for that, and I was thinking that perhaps the difference in suspension makes a positive effect on reducing distortion (spoiler: not so much!). I haven't found any reputable measurements for this driver so decided to try to measure it myself.

MarkAudio MAOP-5. This driver looks a bit exotic—it has no "spider" suspension part (the corrugated fabric supporting the cone), so in contrast to MU10RB-SL, this driver has less suspension force than FU10RB. I was a bit suspicious about the consequences of this design decision, but I have found no measurements for the distortion to reveal the effect of this approach so it was interesting to try to measure this driver myself.

Tang Band W3-1878 also has an unusual look thanks to the massive motor and a specially designed "phase plug", which reminds me of grilles that are used on some measurement microphones. This driver was measured by the same Erin on a Klippel jig back in 2011, but the results were not cross-posted to his site. In the post Erin mentions that his distorion measurements are "relative", which most probably means they are not calibrated to a specific SPL standard, and frankly I did not understand how to read this particular distortion graph, but from Erin's own comments the distortion is on a good level.

And that's basically it! I have used two more drivers for comparison:

One is obviously the FU10RB itself, to make sure that I'm using its measurements taken under the same conditions as from the contenders.
And the midrange driver that I had recovered from a broken Cambridge Audio Minx Go portable speaker. It is build as a "full cone", and the cone is made of paper. This one I used as an "anchor" in my measurements—it would be a miracle if it could actually beat any of the speaker drivers above, and the "miracle" would mean that my measurements have gone wrong.

Measurement Setup

I used my QuantAsylum stack consisting of the original QA401 analyzer, QA460 transducer driver, QA492 microphone preamp (this model is relatively new), and Earthworks M30 microphone. I powered both the QA460 and QA492 from a portable Jackery battery because my mains power is rather noisy, and the laptop was also running on battery power. Still, initially I had some issues with the mains-induced noise which I was asking about on the QuantAsylum forum. As that thread indicates, I traced the issue down to a poorly shielded USB cable which I used to power the QA492. Also, after the conversation with Matt of QuantAsylum, I obtained shorting BNC plugs and used them to cover up any unused inputs both on QA401 and Q492. This has reduced electrical noise to the minimum.

Since Earthworks M30 goes beyond the standard 20 kHz range, I was running the measurements up to 35 kHz and was using 192 kHz sampling rate on QA401. I was only using the log sweep method which is sufficient to get basic understanding of the non-linearities in the measured system. I did not have enough spare time to run the stepped sine method with sufficient resolution, and I was interested in making relative comparisons, so using the log sweep was fine.

Impedance, Sensitivity, and Efficiency

First, I measured for each driver its impedance, sensitivity, and acoustic radiation efficiency.

Measuring impedance is straightforward with QA460. Below is the summary graph for all drivers:

The impedance plots are abbreviated as follows:

CA is the Cambridge Audio driver;
FU is the SEAS FU10RB (the original driver of LXmini);
MA is the Mark Audio MAOP-5;
MU is the SEAS MU10RB-SL (the bass-reduced version of FU10RB);
TB is the Tang Band W3-1878.

We can see that both MU10RB-SL and MAOP-5 have a nominal 4 Ohm impedance. The FU10RB is also 4 Ohm nominally, but the actual impedance is more like 6 Ohm. And both W3-1878 and the CA speakers are honest 8 Ohm drivers, with CA being a true "midrange" driver has steeply increasing impedance above the midrange.

Sensitivity measurement was done by by applying to the driver under test a 1 kHz test tone with 1 V RMS amplitude, and measuring the resulting SPL from 1 meter distance. Due to no-baffle mounting the result is lower than the "official" sensitivity spec. The difference between the drivers is not very critical because I use a 100 Watt amplifier and listening to the speakers from 70–100 cm distance, so even an 8 Ohm driver can work well assuming that it is efficient acoustically. Below are my results:

Driver	SPL dBA (1kHz/1V/1m)
Cambridge Audio (CA)	60
FU10RB (FU)	62
MAOP-5 (MA)	74
MU10RB-SL (MU)	68
W3-1878 (TB)	68

It's interesting that MAOP-5 despite having the same impedance at 1 kHz as the MU10RB-SL ends up being louder. However, the FU10RB is also quieter than W3-1878 (TB) despite that the former has lower impedance. Is it because the non-linearity in the 1–2 kHz region causes losses in acoustic transfer? The Cambridge Audio driver is unsurprisingly the quietest because at 1 kHz it has effective impedance 16 Ohm.

And this brings us to acoustic radiation efficiency. In this test I checked what is the SPL level for the same 1 kHz test tone at the same distance of 1 m, but this time the level of the electrical input for each driver was adjusted to achieve 105 dB SPL at the 5 mm distance from the driver's cone. Note that this is different from the sensitivity, and characterizes the ability of the cone to work efficiently as an ideal piston.

Driver	SPL dBA (1kHz/1m)
Cambridge Audio (CA)	66
FU10RB (FU)	72
MAOP-5 (MA)	73
MU10RB-SL (MU)	72
W3-1878 (TB)	65

Here, both SEAS drivers and MAOP-5 show almost the same result, while both W3-1878 and the CA driver are 6 dB worse. The point of measuring the radiation efficiency is that even if one driver has lower distortion than another, and both have the same sensitivity, still due to lower efficiency one of the drivers has to be driven with higher voltage relative to another driver in order to achieve the same SPL at the listening position.

Impulse and Frequency Response

Now the results from a logsweep. Since I was using the driver in an open configuration, the same way as it is mounted in LXmini, and my room is small, I was measuring the logsweep in the close proximity of the driver cone—about 5 mm.

Below are impulse and step responses of CA and W3-1878 drivers:

We can see that CA driver is seriously damped and the impulse decays quickly. While W3-1878 is also damped well, it exhibits quicker back and forth motion, allowing for a wider high frequency extension.

It's interesting to compare the SEAS "siblings":

They look similar, yet the "midrange" MU10RB-SL exhibits more back and forth motion about 200 μs after the initial impulse, but then the oscillations become less severe, although the overall step response ends up being almost the same.

And now the interesting part, this is the IR of MAOP-5 driver computed from a 40–35000 Hz sweep:

The lack of damping is apparent here. Is it—is it good? Certainly not. Initially I thought that perhaps this is a kind of "exotic" drivers that promise "euphonic" non-linearities for a pleasant sound? I started experimenting with the test setup: first I swapped the QA460 amp for the same amp I'm actually using in my LXdesktop setup: QSC SPA4-100, but the IR stayed the same. Then I started playing with the parameters of the sweep and figured out that if I limit the high frequency range to the standard 20 kHz then the ringing is gone:

The IR still has more fluctuations than IRs of other drivers, but at least now there are no high frequency modulations. I suppose, the stiff material of the driver's cone causes it to go into very high frequency oscillations. Although these should be above the human hearing, they still can cause more non-linear behavior when excited. For this driver it is strictly required to use a low-pass filter.

And below are pairwise comparisons of frequency responses of all drivers corresponding to these IRs. In fact, the response of MAOP-5 driver does not change in the range of 40–20000 Hz regardless of the sweep's upper bound frequency:

These are "nearfield" responses so they are not super useful for evaluating a dipole. Still, we can see that CA driver is indeed a "midrange" driver with a steep downwards slope after 10 kHz, so it's clearly unsuitable for use in LXmini-based designs.

There is an expected difference between FU10RB and MU10RB-SL in the bass response, otherwise they are indeed very similar. And finally, this is the comparison between MAOP-5 and FU10RB:

We can see that the overall shapes of the frequency responses are close. One interesting point is the high frequency behavior of these full range drivers. Since all the cones operate in "break-up" mode, the material of the cone affects the response a lot. We can see that both MAOP-5 and W3-1878 have a null at about 9 kHz in this arrangement, while both SEAS drivers have two: near 5 and 7.5 kHz.

Distortion

Finally, the graphs we have been looking for. As I mentioned, the distortion measurement is derived from the same logsweep, I did not use the stepped sine method. But I have done the sweep at two SPL levels (as measured near the driver cone): 105 dB and 96 dB. Below are the measurements for each driver done at the higher level showing 2nd to 4th harmonics (the levels of other harmonics are benign):

And the summary graph comparing them:

Looking at the graphs, almost all the drivers seem to be from the same league—I do not see an obvious winner, but I do see an obvious loser: the CA driver again. For others, as we know, FU10RB has higher distortion levels in the midrange, and MU10RB-SL is not much better, and it has these strange peaks between 2–3 kHz and 3–4 kHz, although they are very sharp so likely not audible. MAOP-5 driver has issues in the 2–3 kHz region, while W3-1878 looks like the most linear with the exception that distortion seriously increases past 10 kHz.

This is a comparative graph at the 96 dB output level:

The peaks after 5 kHz seem to be measurement artifact as I see them for all drivers, it's just they are drowned in noise for other drivers but clearly visible for W3-1878 which was measured on a different day. I suppose, for cleaner results I would need to use the stepped sine method.

Conclusions

We can see that MU10RB-SL variation is not significantly better than the original FU10RB, only a bit. While W3-1878 driver can be thought as a winner from the distortion level perspective, recall that it has lower acoustic radiation efficiency, which means I might need to drive it harder in order to achieve the desired loudness at the listening position. So, it looks like in order to make the final decision I will need to build one sample of LXdesktop with MAOP-5 driver, one sample with W3-1878, and compare them with my original LXdesktop speaker, with all samples tuned to the same target, of course. That should be a fun experiment, looking forward to it!

Saturday, September 20, 2025

Visualizing Phase Anomalies

In my unofficial contest of LCR upmixers I encountered multiple cases where the extracted center channel had audible anomalies. These could most often be described as "musical noise" or "underwater gurgling." This kind of artifacts can also be heard when listening to audio content which had passed through low bitrate lossy codecs. One of the reasons that these artifacts can occur is that the audio signal gets processed in the frequency domain, and during the processing the original phase information has been lost or degraded.

For my experiments, I was using two kinds of signals: pink noise and simple two instruments music. From the information theory perspective they sit at the opposite ends of the spectrum: the noise is entirely chaotic and lacks any meaningful information, while music is highly organized on many layers, so let's consider these cases separately.

Pink Noise

One thing that I personally find interesting is realizing how important the relative phase of various frequency components is. For example, if we look just at the frequency spectrum of the phantom center signal extracted by Bertom Phantom Center 2 from uncorrelated stereo pink noise, we will see that the magnitude spectrum is in fact correct and matches the usual spectrum of a pink noise (maybe it is not as "smooth", but these are very minor irregularities):

Yet, an untrained ear can easily hear that this signal doesn't sound like "clean" pink noise and has many artifacts:

So all the issues are actually due to the phase component. But how to understand what is exactly wrong with it? I'm a "visual" person, so I like looking at graphs. However, the phase of audio signals is challenging to visualize. On its own, it's not nearly as intuitive as the magnitude spectrum. In fact, the visualization of the phase of real world signals is even less intuitive than the time domain (waveform) view.

In the particular case of the noise, the phase must be random, basically like the noise itself:

So if we look at the raw view phase of a "proper" pink noise and try to compare it with the phase of pink noise that has artifacts, we will not be able to see much of a difference. Getting to a visualization that works requires some understanding and creativity.

We can ask ourselves—what is the nature the artifacts that we are observing? That's actually the product of our hearing system which automatically tries to find patterns—repeating fragments—in any information it receives. This is normal because all the important sounds that we need to hear: voices, noises from other creatures, and the sounds of nature also have patterns in them. In a "clean" pink noise everything is very much shuffled and the hearing system is unable to detect any patterns so it just perceives it as "noise" (note since we can name it using only one word, the entire "noise" phenomenon actually is just another sound pattern!).

Since it is possible to generate an infinite amount of correctly sounding versions of pink noise—we can just run the random numbers generator over and over again—presence of artifacts does not mean that we have "deviated" from some perfect condition of the phase of the noise signal. Instead, it simply means that the artifacts are some periodic structures created due to corrupted phase information. Because of that, one way of trying to visualize these artifacts is to use some algorithm which is looking for repeating information. One example of such algorithm is the pitch detector. Fortunately, Audacity includes one, and it indeed shows something for the pink noise with artifacts. Check below:

On the top is the result of pitch spectrogram applied to a clean pink noise, under it there are spectrograms of the noise extracted by Bertom Phantom Center 2, UM225 in mode 6, and Dolby Surround algorithms. We can see that the clean pink noise only shows some patterns at low frequencies, and these actually are just processing artifacts (I've seen patterns at low frequencies when examining a pure 1 kHz sinusoid). But!—the noise with actual audible artifacts shows some patterns in the region of 500–3500 Hz where the human ear is very sensitive, and that's what our ear is hearing.

A Bit More on Phase

So I mentioned above that the phase is very non-intuitive, but I also mentioned that the phase is actually very important for proper signal reconstruction. I would like to expand and illustrate these ideas a bit more before we proceed to musical signals analysis.

First of all, let's separate cases of impulse responses and usual musical signals. I'm bringing up impulse responses here because probably most often you could have seen phase graps in audio analysis tools like Room EQ Wizard (REW). You probably know that the value of the phase, since it's an angle of a periodic function normally only goes from -180° to 180° and wraps there. For impulse responses, the continuity of the phase between adjacent FFT bins is very important. That's why phase views always include an "unwrap" function, which lines up an otherwise "jumpy" phase into a continuous line.

However, for usual musical signals phase unwrapping rarely makes any sense because transitions between FFT bins do not have to produce a continuous phase. Take the case of the noise for example—here the bins change completely independent from one another, and that's why trying to "unwrap" phase of a noise will not produce any meaningful visualization.

Yet, in signals that have some structure, for example, in musical signals—there actually exist very important relationship between phases of groups of bins, but not necessarily ajacent ones. If you recall, the FFT decomposes the source signal into a set of orthogonal sinusoids. Now, if we imagine adding these sinusoids in order to get our original signal back, we can realize that relative phases of sinusoids are very important for creating various shapes of the signal in the time domain. For example, let's consider a pulse which has the initial strong transient part. In order to create that part from a set of sinusoids, their phases must be aligned so that their peaks mostly coincide. As I explained in an older post, the result of summing of sinusoids with similar amplitudes greatly depends on their relative phases. When phases are aligned, two sinusoids can produce a signal with at most +6 dB boost, but if their phases are in an inverse relation, then they can cancel each other completely instead.

Below is an illustration how set of sinusoids forms a pulse signal when summed:

In this picture, we see the time domain view of the original signal—it has the base frequency of 128 Hz, and the first 9 sinusoids (these contribute the most of the signal's energy) below it. We can also see the waveform which we get by summing these 9 sinusoids. It's not quite the original signal yet, but it close enough already. If we kept adding the sinusoids specified by the remaining FFT bins, we would eventually reconstruct the source signal. It's interesting to see that the amplitudes of the basis sinusoids are quite small (0.02 absolute value, or less), yet they manage to create a peak which reaches almost 0.4—20 times larger (!)—on the positive side and lower than -0.4 on the negative side. In order to be able to achieve this magnification it's very important to maintain alignment between their phases!

The problem is that the alignment itself is not possible to see with a "naked eye", as easy as, for example, we can see a fundamental and its harmonics on a magnitude graph. Phase alignment is much more "technical", in a sense that the values of phases are relative to the phase of the corresponding basis sinusoid at the sample 0, and the change with a different speed depending on the bin frequency. If we look at the usual frequency domain graphs: the magniture and the phase, the phase part is not very "illustrative":

As an another example, on the series of graphs below I'm shifting the pulse forward in time. Since its shape is obviously preserved, the relationship between the phases remains the same because the shape of the signal is not changing, yet the values of phases are "jumping" around with no obvious pattern:

On the other hand, if we try to "play" with phase values we can easily disrupt the phase alignment, and the pulse starts to "smear" or even changes its shape completely. In the examples below, I have tried several things: adding random shifts to the phases—this makes the signal "jittery," replacing all phase values with zeroes—this got me a completely different signal, fully symmetric, and finally, I created a "minimum phase" version of the signal by making sure that it has the most energy in the beginning, like an acoustic pulse:

So, the phase of the signal is really-really important. But if looking at the raw phase graph does not really help us in detecting disruptions of the phase information, what should we use then? The answer is that we should use various derivatives of a spectrogram that take the phase information into account. A "classical" spectrogram only shows the magnitude, which, as we can see, means that we are throwing away half of the information about the signal. But some types of spectrograms incorporate phase information into the picture. For example, below is the "classical" spectrogram of the signals from the last example:

We can see the main problem of this visualization—the spectrogram view loses the information about the exact moment when the pulse happens. But if we use a "reassigned" spectrogram, then the frequency-domain view becomes much sharper in the cases when the phase information is consistent. But "mangled" (randomly shifted) phase also produces a blurry image even on a reassigned spectrogram:

Now we have some clues, let's looks at our music signals.

Music Signals

With the uncorrelated pink noise signal, we were in a strange situation where a "reference" extracted center signal did not exist because in theory there is no correlation between the channels and thus there no "correlated phantom center" to extract. We could only compare the extracted center channel with some "theoretical" pink noise, and look at presence of patterns. However, in the case of music signals I do have the "source of truth"—the signals that I had used to create my mix.

However, another consideration that we need to take into account is that none of the upmixers I tested, except "AOM Stereo Imager D," was able to separate the center instrument from the side instrument cleanly. In other words, the extracted center, instead of containing only the "centered" instrument (the guitar) also had a mix in of the saxophone sound (which was panned hard left). Similarly, the left channel also had the saxopone with a mixed in sound of the guitar. For example, comparing the original clean saxophone (bottom) with the processed version (top), we can see that new harmonics have been mixed into the original signal:

If we look at the extracted center channel (which contains the guitar), we actually can see some blurriness of transients compared (at the top) compared to the original clean signal (at the bottom):

That indicates that the phase of the extracted signal is not as good as it was in the original signal. Even more drastic is phase mangling in the right channel which in the source stereo did not contain any hard panned instrument tracks, and only carried the equivalent with the left channel part of the centered instrument. After extracting it, in the ideal case the right channel should become empty, but instead it contained a very poor sounding mix of both instruments, although at very low volume. For comparison purposes, I have normalized its level with other channels. From looking at the reassinged spectrogram we can see a lot of blurriness so there is no surprise that it sounds pretty artificial:

Conclusions

Looking at how hard it is actually to separate stereo signal into components, I'm amazed with the capabilities of our hearing system that can do it so well. Of course, the extraction techniques based on such low level parameters as cross-correlation can't achieve the same result because they do not "understand" the underlying source signals. Source separation (or stems separation) using neural networks trained on clean samples of various types of musical instruments and speech can produce much better results, especially if the reconstruction is able create natural phase—annotations to some of the tools often mention that.

As for my initial task of finding a visual representation for phase issues, I don't think I have fully succeeded. So far, I've only found representations that can illustrate a problem after it has been detected by ear. But I wouldn't rely on these visualizations alone, without listening, for judging the quality of any algorithm.

Saturday, August 23, 2025

Finding the Best Stereo-to-LCR Upmixer

In my last post on tuning my headphone auralization setup, I noted that some "future work" was needed to improve the sonic quality of stereo-to-LCR upmixing. Specifically, I needed a way to extract the phantom center channel from a standard stereo source while avoiding audible artifacts. Center channel extraction is needed because, from my experience with creating an auralization chain, making the phantom center sound "externalized" (experiencing it out of the head) requires the most effort. So, I went down the rabbit hole of testing center extraction plugins, and I found that there is always a trade-off between extraction efficiency and resulting audio quality.

Upmixing Approaches

First, let's quickly review approaches that can be used for stereo signal upmixing. For a much more comprehensive overview I would recommend checking the PhD thesis work by Sebastian Kraft. Note that for my purpose I only consider the extraction of the center channel, which is one of the simplest form of upmixing. The resulting channel configuration is often called "LCR": "Left, Center, Right."

Mid/Side and Matrix Approaches

The simplest approach for extracting the center channel is Mid/Side decomposition, which I discussed previously in one of my posts. As we know, by summing the left and right channels together, we automatically boost the signal that is identical in both channels, creating the effect of the "phantom center." Many simple plugins, like the excellent and free Voxengo MSED and GoodHertz MidSide Matrix can isolate this Mid signal perfectly.

The problem? This isn't a true center extractor. It's a "center summer." A sound panned hard left is still present in the Mid channel, albeit at half its original amplitude (-6 dB). This means that Mid/Side decomposition doesn't separate what's exclusively in the center from what's panned elsewhere.

A generalization of Mid/Side is "matrix" approaches which allow producing more channels from stereo for playback over surround speakers configuration. Processing is done completely in the time domain, which makes it fast and minimizes the possibility of creating any audible artifacts. The downside is that they are fairly limited in their ability to truly decompose real musical signals into components; what they do is mostly "energy rebalancing." Some tricks that matrix processors can employ include changing the processing coefficients "on the fly" to "steer" the dominant sound to the speaker at the most appropriate location, and also adding a delay to rear channels to create more ambience.

Correlation, STFT, and "Musical Noise"

To achieve better separation, proper upmixers don't just sum or subtract the channels. They analyze the signal using a Short-Time Fourier Transform (STFT), which breaks the audio into tiny time slices and analyzes the frequency content of each slice. Within each slice, the algorithm examines the inter-channel correlation across every frequency bin and the lateral energy. This approach has some similarity to the process our brain uses when analyzing inputs from the left and right ears:

If a frequency is highly correlated (i.e., similar in phase and energy) between the left and right channels, it's likely panned to the center.
If it's uncorrelated and one channel (left or right) has substantially more energy then it belongs to that channel.
But if it’s uncorrelated and the energy is the same in both channels, then it’s usually considered to be the "ambient" component.
The interesting case is when the signals are anticorrelated (highly correlated but one channel is a phase inverted copy of another). This kind of signal usually sounds very confusing when played both via stereo speakers and headphones, and I don’t think there is a consensus whether it should be considered to be in the center channel or rather spread into side channels.

Note that instead of aggregating inter-channel correlation across all frequency bins together, we can instead consider groups of bands with similar correlation. As explained in S. Kraft’s PhD thesis, this can be used for inferring the location of each instrument, since in a musical composition the instruments are usually arranged in non-overlapping bands.

The weak point in this elegant approach is phase reconstruction. When the algorithm creates a new center channel by manipulating the magnitude of the frequency bins and then performs an inverse STFT to go back to the time domain, it has to guess what the phase of the extracted center should be. What makes this issue worse is that the STFT approach processes the input signal in "chunks" which are glued together. Thus, the phases for the same frequency bin may not even be continuous between chunks, and this creates audible artifacts, infamously known as "musical noise."

If we use a dual mono signal, for example my favorite is pink noise, then if the algorithm uses the average phase between two channels, it gets the true phase value because both channels contain the same noise. However, the "acid test" that I used for upmixers is an uncorrelated pink noise. The theoretical version of this signal has zero correlation between the left and right channels. Thus, an ideal extractor should produce silence, as there is no "center" information to extract. This all in theory, however, and requires the signal to have an infinite length. In practice any real, finite "uncorrelated" pink noise still has some non-zero correlation happening here and there across frequency bands. The shorter the time interval we are looking at, the more pronounced these spurious correlations are. As an example, below is a graph which shows per-band absolute correlation values for such "uncorrelated" pink noise over time, using STFT, and the resulting phantom center energy that such an algorithm can infer from it:

As the center extractor "locks" on these spurious correlations (values with absolute value close to 1), it can put them into the phantom center (the actual result also depends on the energy balance, as we have noted above). Thus, any sufficiently loud phantom center sound extracted from the uncorrelated noise is purely an artifact of the processing algorithm itself.

As noted in the paper "Frequency-Domain Two- to Three-Channel Upmix for Center Channel Derivation and Speech Enhancement" by E. Vickers, in the "traditional" application for upmixers these artifacts may not be a big issue because when all channels are presented over speakers, acoustic summation of their parts happens, and the resulting acoustical "downmix" may conceal the artifacts. However, in my headphone virtualization application I perform separate processing of the center channel, and as a result the partial artifacts lose their initial match, so the resulting sum can still reveal themselves in the binaural downmix as unnaturally sounding artifacts. This is why my goal is to have an upmixer with minimal artifacts.

Testing the Contenders

To find the best tool for the job, I devised a simple test to evaluate separation quality and artifact generation. I pitted a few different classes of plugins against each other:

Bertom Audio Phantom Center plugin which I employed initially for my headphone chain—it uses the STFT approach. While I was working on the post, Tom released the new version called Phantom Center 2 which has a bit more settings, but the underlying principle is still the same.
A.O.M. Stereo Imager D—another center extractor which also uses the correlation approach.
Inexpensive upmixer plugin from Waves Audio—UM225. It has several modes, and I have tried two of them because they sounded completely different: mode 5 ("Steady Center") and mode 6 ("Stereo Preserve").
"Industry standard" expensive Halo Upmix plugin by Nugen Audio, which I used in LCR mode, with the "Hard Center" preset.
Hardware implementations of surround upmixers: Auro 2D, Dolby Surround, and DTS Neural:X from the Marantz AV 7704 AVR. AV 7704 was configured for LCR output (no surround channels) and a narrow center image.

Since for commercial upmixers their actual implementation is kept in secret, we can only guess it by the results they produce. In order to understand the behavior of upmixers better, I used the following test signals:

Correlated stereo pink noise (dual mono signal). This should normally go into the center channel, however some upmixers also "spill" it into side channels (this behavior may be configurable). This spilling may also cause the upmixer to decorrelate the output channels in order to avoid producing comb filtering which can happen when identical signals from different speakers reach the listener at non-equal times.
Uncorrelated Stereo Pink Noise. As we have already discussed, this signal in its ideal form has zero correlation between the left and right channels. An ideal extractor should produce very quiet cleanly sounding pink noise.
Mono Pink Noise (left channel only). Naturally, this signal also has close to zero correlation between the left and the right channels, however all the energy is on the left side. My expectation is that the extractor should produce perfect silence in the center channel, as all the signal should be panned hard left.
Simple music track which I have produced myself by combining dry recordings of a saxophone and an acoustic guitar. The sax is hard-panned to the left channel, and the guitar is panned into the center (dual mono).

Besides these tracks, I also used the Plugindoctor app by DDMF to quickly examine the linearity of plugins and hardware implementations. To my surprise, one of the plugins had some issues with that—maybe it’s good for "artistic" purposes, but in my case the requirement is that the processing algorithm should be as transparent as possible.

The Results

The actual results have turned out to be very interesting. Since my intention was to stick to the "best" upmixer, I introduced a ranking system with 5 dimensions, each on the scale from 5 (best) to 1 (worst). This is what they are:

For the correlated pink noise: what is the relative level of the center to sides (ideally, there should be no sides). And what is the audible quality of it—are there any artifacts? I grade both factors on the scale from 5 to 1, and average them.
For the uncorrelated pink noise: again, how loud is the center compared to side channels (ideally, there should be no center), and are there any audible artifacts?
Same for the mono pink noise (which is also uncorrelated)—does it spill into the center channel (or even the right channel), and does the quality degrade?
How well the upmixer extracts the phantom center from music—am I hearing just the central instrument, or am I also hearing the left-panned instrument? Also, what is in the right channel—which should ideally be silent. Are there artifacts in the music—also, for all 3 output channels. Again, I average these scores into one grade.

So, what we have—"phantom center" extractors, following their design goal, are the best in actually extracting the center, but since they use the STFT approach, they have issues with the phasing artifacts which can be heard both on uncorrelated noise and on music.

Whereas surround upmixers may have less artifacts, however they may "spill" even a strongly correlated center into all 3 channels. Again, maybe this is fine with actual multi-speaker playback, but this is not what I would prefer for my application.

Artifacts for the uncorrelated pink noise vary widely. Below are examples of how they sound with:

Dolby Surround upmixer
A.O.M. Stereo Imager D
Waves Audio UM225, Mode 5
Bertom Phantom Center 2

Artifacts for music are also quite interesting. Here are some examples from the right channel of the resulting LCR upmix—which should be silent, IMO! Note that the actual level of this channel in the produced upmix was much lower, but I have normalized it to -14 dB LUFS to be able to hear the artifacts more clearly:

Dolby Surround upmixer
DTS Neural:X upmixer
Bertom Phantom Center 2
A.O.M. Stereo Imager D

Here is a beautiful diagram with the summary. None of the upmixers is perfect, there is always a tradeoff between the degree of channel separation and the induced artifacts. It seems that for my needs Halo Upmixer works the best (it’s at the top position on the chart):

Other plugins / upmixers are listed in clockwise direction, so the next best is actually the Bertom Phantom Center which I was using before—it has the known problem with artifacts. Then came the hardware implementations of upmixers from AV 7704 with Auro 2D having the best quality. A.O.M. Stereo Imager D could be ranked higher than them because it offers separation which is close to ideal (better than Bertom Phantom Center!), however for some reason it has pretty bad aliasing revealed on a simple dual mono sine signal:

Note that they do not appear if I leave the knobs for Center and Side gains at their default value, but they show up as soon as I start moving them. I contacted A.O.M. about this issue, and their reply was rather edgy, stating that they are not going to fix any "imperfections" in order to avoid potentially changing the sound of the plugin—oh, well.

Finally, cheaper upmixers by Waves Audio seem to have low cost for a reason—they are not very good at separating out the center channel, and exhibit strong artifacts on some of my test signals.

Conclusion

I ended up purchasing Halo Upmixer (just the basic version)—I think it’s worth its money. One big disadvantage of it is reliance on the iLok copy protection system which requires use of a USB dongle (that’s in 2025!). I set it up on my MacBook Air which I use as a portable setup. For a more stationary setup I can use Auro 2D on Marantz AV7704 (note that I haven’t tried that for real, so there might be caveats).

I plan to come up with another post which dives deeper into the analysis of these artifact problems. It’s interesting to see how the artifacts look on visualizations.

Sunday, May 18, 2025

LXdesktop Headphone Auralization Tuning

This post continues the previous post LXdesktop Auralization with Ambisonics providing more details on the tuning of the chain that I have built, as well as some listening impressions of myself and from other people to whom I have demoed the setup.

Evaluation of the Initial Version by Listeners

People who were participating in the listening demo were already familiar with the "immersive audio" technology with its implementations on Android and iOS as well as how binaural renderings done by Dolby Atmos and MPEG-H authoring tools sound. They also understood how head tracking works, and why it is needed. Still, everyone listening was surprised with a very wide and externalized image provided by the ambisonic rendering. Another point many people noticed was that when rotating the head, the scene was rotating smoothly, with no perceivable "jumps" of the phantom center which usually present on traditional discrete channel spatializers as one turns their head in the direction of the left or right virtual speaker.

However, they have also noticed some drawbacks:

Some of the listeners have noted the "fuzziness" of the reproduced sound sources.
For some of the listeners the center image was feeling unnaturally elevated.
As people were switching between "raw" headphone playback (on Sennheiser HD600) and the binaural render, they noted that they wanted to get more bass for the latter.

I decided to take some time to address this feedback, and this had led me to making some improvements for my playback chain.

Improving Phantom Center Naturalness

The problem of the elevated phantom center frequently occurs when listening to binaural records in headphones. It may also occur when listening to regular stereo records with stereo speakers. Many people note that when panning a mono source using amplitude panning between the left and the right speaker, the trajectory of the rendered source may have a "rainbow" shape, meaning elevated center.

Although the manifestation of the problem is the same, the reasons for its occurrence are different for the speaker and the headphone playback. For the playback over stereo speakers commonly cited reasons are:

Presence of reflections which make the phantom center image to be perceived as more fuzzy compared to the acoustic sources that originate mostly from one speaker.
Speakers are often not placed to have their acoustic center at the eye (ear) level, and this introduces a vertical component into the recreated sound scene.
Other spectral colorations caused by acoustic interference may create more energy in the bands that are associated by the brain with vertical sources. In the absence of visual clues, the auditory system assumes that the sound from an invisible source comes somewhere from above.

For headphone playback, especially when doing "3D", "binaural" or "immersive", one of the common problems is the mismatch between the listener's own and the simulated HRTFs. Similar to the reason 3. for the speaker playback, mismatched HRTFs can also push more energy into the frequency bands associated by the brain with vertical source placement. This is a quote from the interview with the "father" of the Neumann KU-100 dummy head Stephan Peus describing this problem in the context of binaural recordings made with it:

We have also changed the "pitch angle" of the ears somewhat. In listening tests with the KU 81, it had been noticed that sound sources in the horizontal plane usually tended to be perceived slightly upward during reproduction. This is related to a characteristic “dip” in the horizontal frequency response of our outer ears. For every natural ear, that dip is at a slightly different frequency. This does not interfere with natural hearing, because we “adjust” the location of sound sources with the help of our eyes throughout our lives. If we are now given a certain configuration by the dummy head, we cannot correct visually. As it happened, the aforementioned dip in the horizontal frequency response of the KU 81 caused sound events from the front to be perceived as slightly shifted upward. In the KU 100, we therefore adjusted the angles of the ear cups relative to the vertical so that the imaging is now correct horizontally and vertically.

Now, imagine what happens when we simulate speaker playback over headphones, via HRTFs of a dummy head! I suppose, all of these problems combine and affect the perception of the phantom center even stronger. I can't fix the HRTF issue because in my processor I just use the HRTFs of the KU-100 head (via the IEM BinauralDecoder plugin). However, I was able to some extent fix the "fuzziness" of the phantom center.

My approach uses the same idea as speaker crosstalk cancellation, however in my case I did not have to use any actual XTC filters. First, let's recap the essence of my approach. Using a stereo speaker setup and an Ambisonics microphone I have captured transfer functions between each speaker and each microphone capsule in order to simulate real-time recording of the speakers by the microphone:

Now we can realize that if we render the left and the right speaker separately (each on its own track), on the output we will get ipsi- and contra-lateral signals for each of them separately—that gives us 4 channels: for each combination of left/right speaker and left/right ear. When mixing these signals for the binaural presentation we can control how much cross-talk we want to have in the end. First I have tried having no cross-talk at all—that's the ideal XTC! However this did not sound natural at all, resembling very much the regular "headphone sound", just with extra reverb. The resulting phantom center was very close to the face. I have found that attenuating ipsi-lateral paths by about 6 dB produces the most natural result and yields a very compact and clean sounding phantom center. Recall that HRTFs of the dummy head already incorporate some head shadowing, this is why the extra attenuation does not need to be excessive.

Did it fix the perceived center elevation problem? Yes, it did! However, fixing the phantom center this way had a negative effect on the width of the perceived sound scene—instead of appearing wide in front of me, the left and right sides have now collapsed close to my ears. Why is that? This has reminded me of the dilemma that people often mention on audio forums: why speaker listeners want to reduce the crosstalk between them, while headphone listeners often want to add crosstalk, via cross-feed circuits or plugins? The answer is—they are tackling different problems.

As I have noted earlier, the XTC is aimed to fix the coloration and fuzziness of the phantom center by attenuating ipsilateral audio paths from stereo speakers to ears, and making the sound waves achieving ears to have the characteristic closer to a "real" center source in front of the listener's head.

And the cross-feed mostly fixes reproduction of lateral sources in headphones. If we consider hard-panned dry sources that only exist in one of stereo channels, these sound very unnatural on headphones because only one ear receives the signal for them, resulting in "inside the head" localization. This is because lateral sound sources occurring outside of the listener will always be heard by both ears, with natural attenuation from head shadowing, and time of arrival difference.

So it seems that we need to decompose our stereo signal and separate "phantom center" signals from lateral signals. In multi-channel and object-based audio scene representations this decomposition is given, but for stereo sources we have to do some work. I decided to employ the approach similar to the one described in the post Headphone Stereo Improved, Part III. I separated the stereo stream into 3 components:

mostly correlated components: the "phantom center";
mostly uncorrelated components: lateral signals created by hard left/right amplitude panning;
the rest: components lacking strong correlation, or anti-correlated: the ambience.

I thought I could use a multichannel upmixer for this. However, after experimenting with the free SpecWeb tool set and inexpensive Waves UM225 I have realized that although upmixers use conceptually the same approach for components separation, their end goal is a bit different because their target is a multi channel speaker system. Thus they are designed to "spread" virtual sources softly between pairs of speakers, for example, the phantom center is also "translated" into some energy in the left and right channels, but I need to extract it in almost "solo" fashion. Also, in multichannel setups there are typically no dedicated channels for "ambience", and ambient components are also spread across all channels. It is possible that with some practice I could set up an upmixer to avoid spreading and do what I need, but decided to leave that for later.

So, instead I decided to use the Bertom "Phantom Center" plugin for this operation. But how to extract lateral sources? While the phantom center is composed of fully correlated components, the "residue"—the non-correlated components is composed from a mix of lateral and "diffuse" sounds. So I came up with the following topology for extracting lateral sounds, which uses both the "Phantom Center" and Mid-Side approach:

The idea is that if we invert one of the channels, and process the result via the "Phantom Center", it will extract the anti-correlated, "diffuse" components. This way we can separate them from lateral components and end up with 3 sound "streams" that I have enumerated above. To illustrate the result, here is how this plugin setup separates a set of Dirac pulses which correspond to different source positions, based on their correlation:

If you want to refresh your understanding of interchannel correlation, please refer to my old post On Mid/Side Equalization. Of course, this decomposition only works correctly for amplitude-panned sources, because the correlation meter in the "Phantom Center" plugin uses zero-lag setting, however in practice this approach yields good results for stereo records.

Note that in the end I have settled on 98% setting for the "Phantom Center" to avoid sharp transitions between the streams.

So, from this preprocessor we have 3 pairs of outputs (the aforementioned 3 streams). Each pair is processed independently, and moreover, each channel of the pair has its own speaker-to-binaural processing path, which yields 4 channels, thus at the output we have 12 channels, each of them representing a certain component of the stereo field, as rendered via particular speaker, on the path to each ear. This gives us full control on how to mix these components for binaural playback and allows us to use both XTC and cross-feed at the same time, applied to proper kinds of acoustic sources. I have ended up with following mixing matrix in Reaper:

From left to right, the first 4 channels are the center: Left speaker to left ear, and to right ear, then right speaker to left and right ear. Then the next 4 channels are the lateral components, in the same order, and the last 4 channels are the ones representing "ambience".

As a bonus, I had realized that by defeating the effect of head shadowing for ambient components by boosting ipsi-lateral paths I can achieve even better externalization of the virtual sound scene. In my previous spatializer I had achieved the same effect by just boosting uncorrelated components.

I have made the final adjustments to the balances by listening to correlated and anti-correlated pink noise, making sure that they both sound centered. I ended up wondering why the required interchannel balance is not symmetric, and my hypothesis is that first, of course use of non-individual HRTFs may cause this, and the second reason may be due to not fully symmetric speakers setup in the room (see my earlier posts on LXdesktop). In future I will try to correct that by making a better speaker setup.

Fixing the Bass

Of course, as audiophiles we always enjoy rich and deep bass, and headphone makers usually try to add more bass to their headphones. As I and other people have found while comparing "raw" stereo sound to the sound of the binaural rendering, the latter was lacking bass noticeably. That seemed strange to me, considering that I have a good subwoofer and I never feel the lack of bass when listening to my LXdesktop setup.

Trying just to boost up the bass of the binaural renderer's output led to excessive on the head vibration of headphone drivers which was ruining the externalization effect. This needs to be done in some other way, I decided.

After reading a bit more about the implementation of the BinauralDecoder, I noted that it uses the MagLS approach for interpolating between sampling points of measured HRTFs. This approach is intended to minimize the amplitude differences only. Although the authors say that this approach is only used starting at 2 kHz which may imply that interaural time differences are preserved for the frequencies below, I decided to check what will happen if I actually add them.

Since I have separate signal paths for the left and right ear, I decided to employ my "almost linear phase" ITD filters, and I was not disappointed—the sense of a good deep bass has returned back to my binaural renderer! It's interesting that these filters have flat amplitude, they do not boost the energy of the bass at all. Yet, somehow adding a correct shift in phase between the ipsi- and contra-lateral ear makes the bass sound to be perceived as stronger. While doing A/B switching between filter/no-filter configuration I have realized that maybe this phase shift allows the auditory system to "focus" on the bass and perceive it as coming from a compact source. Whereas mostly in-phase bass creates an impression of some ambient rumbling, and is not perceived to be as strong, even at the same energy level.

After some experiments with the cutoff frequency, I have ended up with 500 Hz for the "center" source, 750 Hz for the lateral components, and no ITD filtering for ambience. Raising the cutoff frequency or trying to apply the filter to the ambient component was causing moving of virtual sources closer to the face, and I did not like it.

One technical issue that use of this block creates is addition of latency. Since the filters are symmetric linear phase style, and they need good resolution in the bass region, they create a delay of 170 ms. And since they have to be placed after the BinauralDecoder, this latency affects the head tracking.

Adjusting the Tonality

It's never easy to tune an audio reproducing system to the ideal tonality (does one even exist?). My binaural renderer is of course not an exception to this rule. The first difficulty in obtaining the right tuning is that there are many uncertainties in how I had captured the speakers and the room, and also how the captured system is rendered using the binaural renderer and the headphones. The second reason is that the perceived tonality changes depending on how the brain perceives the location, the size, and the distance to virtual sources.

So we don't know how precise our measurements are, and we need to use our perceptions for tuning as well. However, in order to do that efficiently we would like to be able to make instant comparisons with some reference. One good reference I have found, thanks for Archimago's post, was the binaural version of the "Touch Yello" album released as a Dolby Atmos remaster in 2025 on a Blu-ray. The binaural version sounded quite good when listening with Sennheiser HD600 so I decided to use the frequency domain measurement of it as a reference for fine-tuning my binaural chain.

The Blu-ray contains both the stereo and the binaural versions, so I was able to measure frequency curves both for the binaural rendering of the stereo version via my chain, and of the original binaural version. Below is the result of comparison of ERB smoothed curves. The FR of the original binaural version is in blue and red, and the FR of my version is orange + light teal:

It can be seen that the official binaural version has somewhat V-shaped (raised both bass and treble, with a dip in the mids) tuning compared to my rendering. My initial plan was to try to match them as closely as possible. However, as I have quickly understood, due to the fact that my rendering sounds as being farther from the listener than the original binaural version, their spectral shapes can't be just the same. Instead, my approach was to find the regions where there is a significant difference and then try adjusting these bands while listening to the changes in tonality and perception. The goal was to obtain more natural tonality for my rendering and minimize the changes in perceived tonality when switching back and forth between mine and the "official" binaural versions.

In the end my version sounds more spacious and is better externalized, while the original binaural rendering sounds closer to the face and is much "denser". You can compare the results yourself by using YouTube and Google Drive links below (the Drive version uses AAC at 320 kbps while YouTube transcoded it into Opus 140 kbps). Note that although my rendering is done specifically for Sennheiser HD600, you can anyway use any reasonable headphones to check it. I even used Apple EarPods for some testing! Just one note—if you are listening on modern headphones that support "spatial audio", make sure you turn it off and just use the plain stereo mode:

(Of course, these are provided for educational or personal use only).

A question one can ask—what is it besides the frequency balance that makes the original binaural rendering to be perceived very close to the face, while my rendering sounds much more externalized? One reason I have found is the objective measurement of the interaural cross correlation (IACC). Below are two graphs comparing IACC for these two binaural versions for the time position about 0:30:

We can see that my rendering is much less correlated in the high frequency region starting from 2 kHz which corresponds better to a more spacious listening experience. IACC is one of the metrics used by acousticians for objectively comparing sound of different concert halls (see the book by Y. Ando and P. Cariani "Auditory and Visual Sensations") and different microphone setups (see the book by E. Pfanzagl-Cardone "The Art and Science of Surround and Stereo Recording").

Another "source of truth" for the tonality that I have found are recordings produced by Cobra Records. Whereas the Yello binaural production was a result of binaural rendering done from Dolby Atmos master, recordings done at Cobra are real acoustical recordings done simultaneously by conventional multi-mic setups for stereo and surround, and using the KU-100 head for the binaural version. If you recall, the IEM BinauralDecoder plugin is also based on KU-100 free field HRTFs, and thus comparing the rendering of Cobra's stereo records processed by my chain with their binaural versions makes quite a fair apples-to-apples comparison.

Unfortunately, I do not know for which headphones their binaural version is intended for. I can imagine it should be for some diffuse field equalized headphones. So, as an example, here is an excerpt from "Extemporize" piano album, where my chain is rendered for HD800, again both as YouTube and "offline" files:

One thing that I have noted when comparing my binaural rendering with Cobra's binaural recording is that the latter for some reason have left and right channels swapped, and I have fixed that for my comparison test. The difference between these recordings / renderings is more subtle than with Yello track—the stereo recording is really good by itself! Still, I hope you could get the similar experience of the sound moving away from the head when listening to the rendering via my processing chain.

For completeness, this is a similar comparison of ERB-smoothed frequency responses of these renderings:

Again, it's a pity that producers of these binaural recordings do not specify which headphone would they recommend for listening to them. The page at Cobra Records site says "any brand or style will work (nothing fancy required!)"—to me that sounds too generous. I know, you can hear a difference using any headphones, but for actually experiencing "being there" the tonal balance of the headphones used for binaural reproduction is very important.

Using More Headphones for Auralization

With the equalization which needs to be applied to the headphones in order to achieve the "natural" equalization for the immersive playback all of them start to have similar tonality. Yet, the impression that we have when listening to them is still not the same. Even with the same EQ target, different headphones create different listening impressions due to differences in their drivers and their interaction with our ears.

Another important aspect of headphones apart from how they sound is how they feel on the head: their clamping force, size of the earpads, the weight etc. I have found Sennheiser HD800 and Shure SRH1840 to be very comfortable for long listening sessions. However, unfortunately neither of them is on the list of headphones originally measured by B. Bernschütz on KU-100 and thus they are absent from the EQ list of the IEM BinauralDecoder plugin.

However, I'm lucky to have access to the state-of-the-art headphone measuring system B&K 5128, so I used it to derive EQ filters turning HD800 into HD600 which are on the list of the BinauralDecoder. Note that EQuing headphones to sound like some other model is actually a non-trivial task. At the last NAMM convention I had a conversation with a representative of TiTumAudio company that makes headphones that can imitate a number of commercial headphones. He noted that actually copying the sound of other headphones requires tweaks that are beyond simple LTI processing (that is, EQing). This is why I specifically have chosen HD600 as the target for HD800 as their drivers are probably the closest in their non-linear properties, compared to headphones from other makers. In a similar fashion, I use AKG K240DF (which are on the BinauralDecoder) list as the target for other headphones by AKG, Beyerdynamic DT990 as the target for other Beyers, and discontinued Shure SRH940 for more modern models.

One technical challenge I have encountered when creating filters for these conversions is that the measurements for the left and right ear of the B&K 5128 never match exactly. Instead of using an average between left and right ear, I decided to leave them different. However, in order to avoid distorting phase relationship between left and right channels, I have made these conversion filters linear phase (8k taps). Since that creates an extra delay, I put this correction block before the SceneRotator in order to reduce the latency of head tracking.

Complete Processing Chain

Summarizing the processing blocks mentioned in the previous sections, this is what I ended up with:

It may look big, but after all these are all only necessary components. Realistic binaural rendering is not an easy task!

Music Tracks Used for Evaluation

While iterating on my spatializer, and also during preparation to demoing it to other people, I have come up with a list of songs available on Apple Music that have good spatial properties. They represent different musical styles: classical, pop, electronic, metal, etc. They also represent different recording styles: acoustic recordings, engineered stereo recordings, some modern Dolby Atmos tracks (rendered into stereo).

Roughly, I can classify them into several categories based on which attributes of the reproduction they can help testing. Of course, some tracks do belong to several categories.

Great stereo acoustic recordings with natural vocals and instruments
- All Roads to the River / Breaking Silence by Janis Ian
- Also sprach Zarathustra, Op. 30, TrV 176 by Richard Strauss
- Grandmother / The Raven by Rebecca Pidgeon
- L'Égyptienne / Les Sauvages by Béatrice Martin
- No Flight Tonight / from Chesky Records 10 Best, by The Coryells
- Pipeline / Two Doors by Michael Shrieve
- The Firebird Suite (1919 Version): V. Infernal Dance of King Kaschey by Igor Stravinsky
- The Wrath of God: Pt. 1 by Sofia Gubaidulina
- Violin Concerto No. 1: III. Quarter note by Philip Glass
Engineered records with strong emphasis on spaciousness
- An Echo of Night / The Pearl by Brian Eno & Harold Budd
- An Ending (Ascent) / Apollo: Atmospheres and Soundtracks by Brian Eno
- Animal Genesis / Oxymore by Jean-Michel Jarre
- Barco / Insen by Ryuichi Sakamoto & Alva Noto
- Contrapunctus 8, A 3 / Laibachkunstderfuge by Laibach
- Day One (Interstellar Theme) / Interstellar (OST) by Hans Zimmer
- Get Your Filthy Hands Off My Desert / The Final Cut by Pink Floyd
- High Hopes / The Division Bell by Pink Floyd
- Resonance / Resonance by Boris Blank
- Ripples in the Sand / Dune (OST) by Hans Zimmer
- The Snake and the Moon / Spiritchaser by Dead Can Dance
- Troubled / Passion (The Last Temptation of Christ OST) by Peter Gabriel
Synthesized stereo scenes, not as "spacious" but still interesting
- Another One Bites the Dust / The Game by Queen
- Birds / Nameless by Dominique Fils-Aimé
- Bubbles / Wandering—EP by Yosi Horikawa
- Jeremiah Blues (Part 1) / The Soul Cages by Sting
- Me or Him / Radio K.A.O.S. by Roger Waters
- On the Run / The Dark Side of the Moon by Pink Floyd
- Rocket Man / Honky Château by Elton John
- Space Oddity / David Bowie (aka Space Oddity) by David Bowie
- The Invisible Man / The Miracle by Queen
- Voice of the Soul (1996 Demos) (Instrumental) / The Sound of Perseverance by Death
- What God Wants, Pt. I / Amused to Death by Roger Waters
Not quite "spacious" but with good "visceral impact"
- Dyers Eve / ...And Justice for All by Metallica
- Flint March / Small Craft On A Milk Sea by Brian Eno, Jon Hopkins & Leo Abrahams
- Heatmap / Warmech by Front Line Assembly
- Lie to Me / Some Great Reward by Depeche Mode
- Single Blip / Ssss by VCMG

When demoing, I realized that most of these tracks are unknown to most people. They were generally choosing either Hans Zimmer's songs, or Pink Floyd, and for some reason the Rocket Man was also popular.

Remaining Issues (Future Work)

Essentially, I have two kinds of problems: one kind stems from the fact that instead of using artificial models of speakers and the room I use a capture of real speakers in a real room, so any of its flaws get aggravated by the processing and the listening on headphones. The second kind of problems lies purely in the DSP domain and hopefully can be fixed more easily.

Fixing Speaker Setup and Room Asymmetry

As noted in the section "Improving Phantom Center Naturalness", my Ambisonic capture of the room is not perfectly balanced thus requiring some correction. This is not the problem of the capture process, but rather the fact that it captures the imperfections of my setup, and they get aggravated by headphone listening. Having a better, more symmetric setup should help.

Reducing Room Ringing

This issue I have encountered when I was listening to male vocals recording. This was opera tenor Joseph Calleja performing "I Lombard" by Verdi. I had experienced a very uncomfortable sensation of "ringing" and "compression" in the sound of Calleja's singing. I had compared my rendering to the original recording and noticed that some of these artefacts are already there, due to the reverberation of the hall when the recording was made. Then I listened on the speakers, and noted that these artefacts are even more pronounced due to the reverberation added by my room. I think the primary sources for these are comb filtering and flutter echo that interact with the harmonics of the singer's voice.

I have realized that if I had invited Joseph Calleja to actually sing in my room, I would likely hear this compression and ringing as well. I recalled that I actually can notice these artifacts when listening to live vocals while sitting in acoustically mediocre halls.

What can I do about that? Ideally, I would like to reduce the reverberation of my room captured by the IRs, and treat the reflections. However it's not easy to apply this cleanup post hoc on already captured IRs. I decided that next time I will probably put some sound absorbing materials at the back side of the microphone in order to produce a bit more "dead" IRs.

Achieving Better Quality of Stereo Field Decomposition

As noted in the paper by E. Vickers "Frequency-Domain Two- to Three-Channel Upmix for Center Channel Derivation and Speech Enhancement", frequency-domain audio processes may produce certain artifacts, often described as "musical noise" or "watery sound." This is indeed what I can hear when I decompose pink noise into correlated and anti- and uncorrelated components and then listen to each of them separately. When the processed sound gets combined back together, these artifacts are mostly masked, however they still may pop up when listening to music with lots of transients. Ideally, I would like to find a more "high fidelity" way of decomposing the stereo sound field.

I have contacted Tom from Bertom Audio regarding the artefacts produced by the Phantom Center plugin, and his answer was that unfortunately nothing can be done to the current version of the plugin to get rid of them completely. So a possible solution to this may be studying the way for achieving the same decomposition using one of those expensive high quality upmixing plugins.

Solving the Latency Problem of ITD filters

As I have previously noted, the ITD filters which are required for good bass reproduction when binaural rendering is done via IEM BinauralDecoder have noticeable latency. So I either need to find a binaural renderer for Ambisonics which produces similar inter-aural phase, or re-create the filters in some mixed-phase way, with much lower latency.