Sunday, May 10, 2026

Spectral Correction of Phantom Audio Sources, Part I

I understand that the title of this series of posts sounds maybe too scientific, so first I would like to clarify what I’m talking about. Let’s state one of the most prominent challenges of audio reproduction. Complex sound scenes (for example, music performances by groups of people) always contain multiple sound sources. The number of individual sound sources in an orchestra is vastly bigger than the number of loudspeakers that anyone’s audio system has. Even if we consider a small band and a surround sound system with multiple speakers, the problem still exists because the performers are not necessarily arranged at the same locations as the loudspeakers. In addition, in movies sound sources often change their location dynamically. Because of that, when we play a recording on an audio system, it has to create phantom audio sources originating from locations between speakers. The simplest speaker setup which allows creating phantom sources is the good old stereo, so that’s what we consider here. In the context of stereo playback, we have two problematic aspects of stereophonic playback: the phantom center and the reproduction of diffuse sound fields.

Phantom Center

The phantom center phenomenon is very “unnatural” in the sense that it very rarely can be achieved in nature since it requires two almost identical synchronized (in-phase) sources located symmetrically in front of the listener. Yet, for some reason our brain seems to harness it just fine. This is likely because, from a physics standpoint, a physical center produces sound pressure level changes at both ears, which the brain successfully integrates into a single auditory image. This is why the ‘synchronized’ (in-phase) property of the sources is so important. If the signals are out of phase, the solid center image collapses into a spatially ambiguous, unlocalizable haze; and if they arrive at sufficiently different times, binaural fusion breaks completely, causing our brain to perceive them as two separate auditory events.

It’s interesting that since the phantom center has been used for such a long time in stereo recordings, there are still ongoing debates and discussions between music production professionals on the use of physical center channel vs phantom center, even with modern object-based formats such as Dolby Atmos. The main argument for use of the phantom center by audio producers is that the image of the audio source created by it is perceived as being “fuzzier” and “warmer,” whereas a physical center is more “point-like” and can have a “sharper” character.

However, more technically-oriented audio professionals never get tired of pointing out one of the most widely recognized problems associated with the use of phantom images—the comb filtering effect. Since a phantom sound source is created by combining acoustic waves from two or more neighboring speakers, when the wave from each speaker arrives to the listener’s ear at slightly different times, their sum can produce both constructive and destructive interference. This means, some frequencies will be boosted and some attenuated. This, in turn, means the timbre of the phantom image will differ from that of the physical source we are trying to imitate.

The change of the timbre may also affect the perceived location of the source, for example, it can appear to be elevated, especially in the absence of visual anchors—this is a psychoacoustical problem, a consequence of how the auditory system works. As it had been demonstrated by J. Blauert, narrow-band signals are perceived at specific elevations depending heavily on their center frequency, regardless of the actual sound source location.

Besides being perceived as “warmer” and “fuzzier,” another notable difference in the perception of phantom vs. physical sources is better stability of the latter when the listener is moving in the acoustic space, or when the listener simply turns or tilts their head. A phantom source experiences more dramatic change in its tonality because the comb filtering pattern changes immediately, and this affects the resulting tonal balance.

The well known solutions for the comb filtering problem are:

  • For stereo setups, one approach to eliminating the coloration due to cross-talk is to attempt to cancel the latter. Cross-talk cancellation can be abbreviated both as ‘CTC’ and ‘XTC,’ I will use the former acronym. There are different implementations of this approach, such as BACCH (see the description in the “Immersive Sound” book) and RACE.

  • For multichannel and Ambisonics setups the preferred approach is slight decorrelation of the physical components of a phantom source emitted by each speaker participating in its creation. That’s because there are many speakers around the listener, and tuning each pair of them for CTC becomes impractical.

Besides the comb filtering, there is another interesting problem which affects the phantom center severely. Since the human HRTF differs for the frontal and lateral directions, the phantom center created by lateral speakers may have different tonality from a frontal physical center just because there is a location mismatch: the brain thinks that the sound is arriving from the front of the listener, so it applies reverse frontal HRTF, but it is a wrong filter because the acoustic waves actually arrive from sides. S. Linkwitz thought about this problem and proposed to use a shelving filter based on the spherical head model. Conversely, D. Griesinger argues that “the frequency response is nearly constant as a sound source moves from zero to ±30 degrees in the horizontal plane.” With all respect to him, I disagree with this statement—as we will see, frontal and side HRTFs are significantly different. There is actually also a more recent, very detailed research by V. Gunnarson (paper “Spectral Correction of Audio Objects in Stereophonic Rendering” from 2024) which clearly shows that the phantom center is affected by the differences in HRTF, and this can be corrected using equalization.

Diffuse Sources

For sure, the phantom center which represents the leading performer is very important in stereo sound reproduction. However, there is also less noticeable but equally important component of the audio scene: the diffuse component which represents “the feeling of the space,” felt mostly unconsciously. However, in live performance recordings this component may jump to the listener’s attention when they are hearing applause. The applause originates from a widely spread source, and reinforced by the hall acoustics, creating a huge diffuse source with an enveloping feeling.

A stereo system trying to reproduce this diffuse source inevitably struggles. The listener’s room may help if it has enough diffusing surfaces, and the speakers are located far enough from the listener, but it’s not always the case. Multichannel system, by design, are much better at reproducing diffuse sources. However, as V. Gunnarson’s paper demonstrates, even multichannel sources win from some room-tailored correction for diffuse sound, and for a stereo system it’s really essential.

One example of such correction is the well known “BBC dip”. This is the speaker equalization which technically was intended to smoothen the transition between the woofer and tweeter which can cause a “power hump” in the upper-midrange making the speaker sounding overly aggressive or “bright” in a reflective environment. As judged by ear, this EQ was known to improve “spaciousness” and “depth” of orchestral recordings. Both of these feelings are communicated to the listener via the diffuse sound field.

The Goals of My Exploration

In my hobbyist research I decided to explore the following questions:

  1. In the context of a stereo speaker setup, how is the difference between a physical and phantom center perceived? And what are the major contributing factors to this difference?

  2. How should an “ideal” phantom center sound? The ideal phantom center is achieved by making sure that the sound waves arriving from a pair of speakers to the listener’s ears are the same as from a real, physical center. This is hard to achieve in a domestic room due to high level of reverberant, reflected sound. However, we can use earspeakers—a weird kind of headphones that do not block or even cover the ears (because that creates its own problems), but rather are suspended very close to the listener’s ears—in order to simulate speaker playback under anechoic conditions.

  3. What can be done in order to make the phantom center produced by stereo speakers sound similar to a physical center, or the ideal phantom center? Are the techniques of stereo speaker sound correction such as CTC and decorrelation actually effective for my speaker setup?

  4. Similar questions about the diffuse field reproduction. If I don’t use purposefully built diffusers in my room, how can the reproduced diffuse field be corrected in order to be perceived as more enveloping? Unlike the phantom center situation where the reference can be provided easily, creating a reference diffuse field in a domestic room is challenging.

  5. If we consider the often used psychoacoustic metric of Inter-Aural Correlation Coefficient IACC, how does it change between physical and phantom centers? The “Early IACC” (0–80 ms) is associated with “Apparent Source Width” (ASW), while the “Late IACC” (>80 ms) correlates heavily with “Listener Envelopment” (LEV). How is the IACC metric affected by phantom center correction? Also, can we improve the “feeling of space”?

Simulating Phantom Center via Ambisonics Binaural

In order to avoid complications from the issues with room acoustics, let’s first evaluate the ideal anechoic case. In the past, researchers had to simulate physics of the interactions of acoustical waves with a spherical head model, but these days we can perform a more realistic simulation using a binaural renderer. My preferred approach is to encode an acoustic scene containing left, right, and center speaker using Ambisonics and then render it via KU-100 HRTFs. For this purpose, I use IEM Ambisonic plugins: MultiEncoder and BinauralDecoder configured for the 6th order Ambisonics and connected as follows:

3 Channel Source --> MultiEncoder ----> BinauralDecoder --> 2 Channels
                      -42° azimuth (L)                        L/R Ear
                       42° azimuth (R)                        Signals
                        0° azimuth (C)

This setup simulates an anechoic chamber with the KU-100 being in the center of a circle with 3.25 meter radius (this was the distance used by B. Bernschütz when capturing KU-100 HRTFs), with speakers placed at and ±42° in the horizontal plane. I’ve chosen 42° number not because it’s “the answer to everything” but rather because it’s the same angle that I have in my desktop setup.

Side note: although Ambisonics is prone to “spatial aliasing” and in theory requires very high orders for reproducing correct magnitude and phase at high frequencies, the MagLS method used by BinauralDecoder allows to produce correct magnitude only even at high frequencies with relatively low Ambisonics orders.

Our goal here is to check out two things:

  1. What is the transfer function (EQ) for compensating the HRTF of a source at 42° on the side of the head to sound like a source in the front of the head (). This is to check the EQ curve suggested by S. Linkwitz.

  2. What is the EQ for compensating the stereo phantom center to sound like a real, physical center. This way we will double-check the existence of the “phantom image problem”—as Toole calls it, see section 4.3.2 in the 4th edition of the “Sound Reproduction” book, in particular Figure 4.4(d) which demonstrates the phantom center impairment due to stereo cross-talk in the anechoic case.

There are two things about these measurements and derivations that we need to keep in mind:

  • The KU-100 HRTF captured by B. Bernschütz and used by BinauralDecoder are symmetric. This is not the case for any real KU-100 since its pinnae although being closely matched are still not absolutely identical, and this affects slightly the measurements at high frequencies. But this symmetry actually simplifies our task since we only need to consider speaker on one side.

  • When comparing physical and phantom center, their levels must be aligned. If we just align the levels of speakers, the phantom center will have bass twice as loud as the physical center, because of summing. Unlike the sound waves at midrange frequencies, the bass audio waves are largely unaffected by the presence of a head or even a full human torso, and they just combine mostly in phase which gives them a considerable boost. I suppose, Toole and his colleagues took this fact into account as their transfer function looks flat in the bass region.

So, here is the answer to the first question about the difference between a side source and the frontal source, and I’ve overlaid it with the EQ curve suggested by Linkwitz:

As we see, in general his curve complies with the physical measurement. I would not expect these curves to match completely because Linkwitz was tuning his curve in a real room. However, I must note that his curve is missing an important energy bump after 11 kHz that I can actually hear when comparing phantom vs. physical center by ear—more on that later.

This is the EQ graph demonstrating the impairment of the phantom center compared to the physical center. I have overlaid it with the transfer function graph from Toole, but I had inverted his graph because he is showing how the phantom center is impaired, while I’m showing how the sound of the physical center could be equalized in an anechoic chamber. Note that I totally understand that minimum phase compensation, like traditional EQ, can’t overcome destructive wave interference, however, I’m using EQ curves instead of transfer function curves everywhere for consistency.

Note that Toole’s data is only up to 5 kHz. The exact location of the EQ “hump” for correcting the dip does not match, probably due to differences in the geometry of the KEMAR and KU-100, but the effect is very similar.

When I checked the correction curve for the phantom center in the anechoic case (D/R = ∞dB) by Gunnarson, it shows a decline by -2 dB in the bass region, and the text confirms that this compensates for the summing of bass from the frontal pair. That means, we can’t compare these curves directly with our curves from the last picture.

As you can see, different methods for measuring or calculating compensation curves for the phantom center yield different results. There is even more disagreement about the diffuse field equalization.

Simulating Diffuse Field via Ambisonics Binaural

Simulating enveloping diffuse field using two speakers is definitely more challenging than simulating a discrete center. It is not even entirely clear what should be our “reference sound source.” We can imagine an ideal isotropic diffuse field which envelops the listener from all directions, but this would be impossible to reproduce using a pair of speakers placed in front of the listener, even with acoustic help from a good listening room.

Another issue is with the diffuse field transfer function—it’s usually too smooth because it’s an average over all the sources on the sphere. Whereas, the frontal transfer function always has a deep notch somewhere between 8–10 kHz due to destructive wave interference. As I noted above, equalization can’t compensate it, especially under ideal anechoic conditions. So it’s unlikely that it’s even possible to fully converge these transfer functions.

If we consider the original goal of the diffuse field spectral correction, starting from the “BBC dip,” we can see that its original purpose was to compensate for the difference between the acoustic space of a listening room and a concert hall (many thanks to late Linkwitz for the scan). Gunnarson in his work proposes to use the diffuse field compensation as a way to align the sound of speakers with “a reference ideal diffuse sound field,” (as I mentioned, this is impossible for a stereo setup) however he also mentions that the BBC dip was intended for the same purpose.

So in my Ambisonics simulation I tried a couple of things. First I tried creating a lot of sources behind the listener, spread across the entire rear hemisphere, each playing its own random pink noise. When listened in headphones via BinauralRenderer it sounded quite enveloping. However, trying to equalize the frontal sources to have the same spectral profile yielded unsatisfactory results.

Then I restricted the simulated diffuse field to two uncorrelated rear sources, placed in symmetry with the front sources, that means at ±138°. This configuration looked similar to the classic Quadraphonic sound rectangle. I recalled that the main acoustic flaw of quadraphonic was inability to create stable side sources, and also the “hole in the middle” due to wider angle of the front speakers, but reproduction of surround diffuse images was just fine. Interestingly, when the front sources were equalized to have the same spectral profile as these rear sources, they started sounding much more like the “full rear hemisphere” setup that I have tried initially.

For comparison, here are my compensation curve, the BBC dip, and the diffuse field compensation curve for stereo speakers (D/R = 0dB) which Gunnarson sees as the “more detailed correction” than the former:

Indeed, we can see that Gunnarson’s curve (green) has the same dip as the BBC EQ curve (yellow), and in general it follows the trend of the KU-100 simulation curve (blue), albeit it is much smoother.

Notes on Equalization Approaches

In previous sections we have touched the question of equalization of the center image. Let’s consider how it can be achieved in practice. If we just apply our hypothetical “phantom center EQ” to the entire stereo signal, that will inevitably affect all directions, not just the phantom center. Ideally, we need to apply our EQ to the phantom center only. For that, the recording ideally needs to be object-based, that is composed of individual audio tracks with attached coordinates that are used by a renderer which is aware of the actual speaker configuration that is being used. However, as I checked on the MPEG-H authoring plugin (version 4.0.0 from 2020) they actually do not apply any spectral compensation to objects at the frontal location when rendering into stereo speakers.

As for binaural rendering, the situation is different. Since a binaural renderer employs some kind of HRTF, the results will be similar to our anechoic simulation in the previous section. However, use of non-matching HRTF with headphones lacking individual calibration can easily produce tonal colorations on its own. Because of that, some binaural renderers use more conservative curves. For example, the paper “A Practical Approach to the Use of Center Channel in Immersive Music Production” by K. Richard et al. compares Dolby Atmos binaural renderer vs. “true” binaural rendering using KU-100 HRTFs. From the illustrations we can see that Dolby binaural renderer uses much smoother curves that can be considered more like “head-related equalization” rather than actual HRTFs.

For non-object-based recordings (that is, the majority of commercial recordings), the only way to faithfully extract center objects is to perform some neural network-based (or “AI”, to say it more fashionably) process of stems extraction, and effectively re-synthesize the acoustic scene. But this process is too complicated for me to try, and likely has its own caveats. A more realistic approach is to apply some kind of upmixing into multichannel, at least into LCR and use the resulting center channel as the approximation for center objects, process this center with “phantom center EQ” and downmix back into stereo. Or start with a multichannel mix in the first place.

The interesting thing is that even multichannel and object-based mixes can have “phantom center” sources. As the K. Richard’s paper states, “phantom center images are often preferred over a discrete center, because of the added spaciousness, envelopment, etc.” For example, if we look at Pink Floyd’s “Dark Side of the Moon” 5.1 mix from 2003, and analyze the correlation between left and right channels on the section of the “Time” track with leading vocals (“Taking away the moments that make up the dull day”), we can see that all three channels: Left, Center, and Right are mutually correlated, and the levels of Left and Right are actually higher than of the Center:

That means, the producer intentionally wanted to achieve that classic feeling of the phantom center vocal, however it has reinforced it a bit with the physical center in order to avoid leaving a hole in the middle if the listener has a wide home theater-like setup. The spectrums of the left and right channels are not corrected for HRTF and are practically identical to the center:

I would hypothesize that since the level of the center channel here is much lower than of the left and right combined, it gets psychoacoustically integrated with the phantom center, and the resulting spectral discrepancy is left unnoticed. Similarly to humans, automatic stereo to surround upmixers also rarely pull all correlated components into the center channel (they can do that, but the user has to enforce this setting), spreading them instead across the front channels.

So, even use of a multichannel source (be it the actual multichannel mix, or an upmix of a stereo source) still requires some work to find the correlated components that form the phantom center acoustic image. But as I noted in the post on LCR upmixing, extracting three channels from two is an ill-posed problem. While modern upmixers are excellent, they rely on active steering and decorrelation which inevitably alters the phase relationships of the original stereo mix, often introducing artifacts on complex or uncorrelated signals.

Paradoxically, the cheapest and most reliable tool—mid/side processing—can provide better fidelity by avoiding introducing phase artefacts because it does not create any new channels. By simply summing the signals from the left and the right channels of a stereo recording we get a 6 dB boost for strongly correlated components. Note that it does not completely isolate the center, thus our equalization will affect side-panned sources as well, just to a lesser degree.

Many equalizers can work in the “M/S mode”—they transform left/right stereo into mid-side, apply the EQ to these signals, and then transform them back into stereo. However, if they use minimum-phase EQ filters (IIR filters being a typical example), the change in the phase that these filters inevitably have on the M/S components creates leakage between channels during the reverse transformation into stereo, as I have illustrated previously. Thus, a much cleaner approach is to use a linear phase M/S equalizer which only affects the magnitude of the signals. Note that it’s not without drawbacks, too—linear phase filtering can add substantial latency and also may add pre-ringing artifacts.

But linear filtering is what I use in practice, anyway. If the intended equalization is relatively simple (like the Linkwitz EQ or the BBC dip), then a plugin like ToneControl by Goodhertz can suffice. However, for a more “surgical” kind of EQ, I use LP10 by DDMF. Of course, there is always an option to use generic convolution plugins with a custom-made linear phase FIR filter, that’s in case the 10 bands provided by LP10 are not enough, or when we need to optimize the latency.

Note that Linkwitz did not mention that he used anything like M/S EQ for his proposed filter. I suppose, he was applying it to the whole stereo signal? Same thing for the “BBC dip” which is also considered as the “speaker EQ.” This makes these approaches more like tweaks on the room/speaker target curve, rather than actual phantom source correction.

I think, this part was long enough, so I will stop here. In the next part of the post, we will explore how real stereo speakers behave in a real room.

Monday, March 2, 2026

On Terminals, Emacs, and AI Coding

Not an audio post, just some thoughts on programming hackery, see also my old similar posts On Keyboards and Long Live 16 Color Terminals.

I think we have now entered the "Golden Age" of programming and use of computers in general—all thanks to AI. That's because Programming tasks that once took days can now be finished in a couple of hours. A personal example of this is my experience extending the Emacs editor. This is ages old "universal" editor—its original concept was created in 1976 at the MIT AI Lab, while GNU Emacs program that I use today was created in the mid-80s. However, it is still popular today among programmers and geeks thanks to its infinite possibilities in customizing and embracing new technologies. I use it every day and continue customizing it to my needs.

The customization of Emacs is done by writing some Emacs Lisp code. If your need is simple, like creating some custom action, or fixing some annoying behavior, a small code snippet usually suffices. If your demand is more serious, like enabling code highlighting and completion for some obscure programming language, you need to create a relatively big chunk of code, which in Emacs world is called a "package." If you are lucky, someone has already faced the same problem and written a package to solve it. In that case, the only thing you need to do is to plug this package into your configuration.

If your need is more or less unique for some reason, then you have to write code yourself. It's not super hard, but it's tedious, mainly because you need to study extensive Emacs APIs and consider how to express the solution in terms of list structures manipulation. Also, you need to write code that handles errors, and make sure that your solution works with acceptable speed. In order to accomplish all this, you used to have to comb through Emacs documentation, source code, and if nothing helps, resort to seeking help at online forums—this is what programming used to be. Since I usually needed to deal with this stuff during my work hours, the thought of writing a complex Emacs extension in my spare time sparked no joy.

Enter the Age of AI. Now, if I need to solve some simple Emacs customization task, or fix an annoyance, I can simply ask an LLM, "In Emacs, how do I ... ?", and it comes with a helpful answer and a snippet of Emacs Lisp code which does the thing you have asked about. In this way, I quickly resolved minor Emacs "friction points" that had bugged me for years. One big annoying thing still remained though—adding support for proper displaying of output from semi-interactive build and install scripts and programs in "shell" and "compilation" modes of Emacs. To explain this problem, I first need to give a brief lesson in computer history.

A Brief Course in Unix Terminals (and Editors) Evolution

The interface between humans and computers has evolved constantly since their inception. At first, humans had to program computers—that is, setting them up for solving a particular problem—by patching cables between physical ports, or flipping myriad switches (examples from Wikipedia). And computers were presenting the results of their work using arrays of lights. After the next computer interface evolution step people could enter data into computers using manually perforated cards, and the computer could print the calculation results on long rolls of paper, using motorized versions of typewriters. Both of those kinds of interfaces were not very much interactive, and required a good amount of planning ahead in order to avoid wasting precious compute time.

The interaction aspect had been improved by combining the aforementioned motorized typewriters with a typewriter-like keyboard (see this article about the legendary "Model 33" teletype). Finally, humans could type their commands or code in, and the computer could print the result immediately. You might have noted that this is already very similar to how we are interacting with AI assistants on the Web today, except that teletypes were much noisier.

Crucially, even at this early stage, a distinction between "symbol" and "control" characters was already apparent. When you see a printed page, you only see the letters of alphabet and punctuation—that's symbols. However, for producing this page, the typewriter operator (being that human or a computer) also needed to use some commands in order to drive the typewriter actions like performing a step roll of the paper, or move the printing head forward or back. Each of these actions was coming to the typewriter on the same lines as printable symbols, and is encoded using a control character.

Thus, when a computer sends the result to a typewriter terminal, symbol characters are interleaved with control characters in the same "stream" of data. Same thing for the human—although most of the typewriter keys are for inserting symbols, some keys like "carriage return" and "backspace" are for sending commands. The computer also has a bonus command called "bell" which it had adopted from telegraphy (so it actually predates computers). This command originally rang a physical bell inside the teletype machine. Nowadays, the computer just emits a short beep in order to attract the operator's attention.

Despite being quite basic, this typewriter interface had already opened a way to create interactive text editor programs. The most well known line-oriented text editor is ed. Its interface was designed to be very minimalist and terse, in order to save paper. ed was called "the standard text editor" in Unix OS—and now this is a hacker's joke. In fact, ed is still supplied with most OSes that descend from Linux.

Upon launching ed, you see no prompt; the program simply waits for your commands. The command is typically just one character, plus parameter. If ed does not understand your command, it prints ?, that's it. Since the file you are editing can be lengthy, ed does not reveal its contents—instead, you have to explicitly ask it to show a region of lines, and of course you can not edit them "in place", you need to enter a command for making each edit. Also, since on a typewriter it's not possible to correct typing mistakes in-place—there was no concept of "line editing" using "cursor left / right" commands, you could only use "backspace" and then re-type part of your command, or discard the entire command line and re-type from scratch. Needless to say, editing text files using these basic capabilities required very good memory, skills, and a lot of patience.

Nevertheless, it was the best user interface that programmers had at that time, and in fact the early Unix OS code was written using ed. As Brian Kernigan recalls in "UNIX A History and a Memoir" book, there were three main components that allowed development of Unix for PDP-7 computer on the computer itself: an editor (and that was ed!), an assembler, and a kernel.

In the next evolutionary step, paper teletypes were replaced by "glass teletypes." These used CRT displays instead of paper, and their keyboards started resembling modern ones. Note that these early glass teletypes did lack the scrollback buffer, thus the lines that have been scrolled away are gone forever. In some sense, this was a downgrade from paper rolls. On the other hand, since typed characters were appearing on a screen, it was possible to make in place corrections in the typed command text, and even move the cursor left and right in order to correct typos in the middle of the command—no more retyping!

The capabilities provided by this new type of terminals have spurred improvements to ed editor. First, it has got a command for showing a whole page of text from the file being edited, taking advantage of the silent nature of video terminals. This version was called em. Note that em still had to be conservative in what it displays because the connection line between the terminal and the computer was often painfully slow. Implementing in-place editing for a whole document—a "visual" editing—was not yet possible.

A lot of standard Unix utilities like bash, cat, du still operate in a similar line-oriented mode and thus are still technically compatible with typewriter-style terminals. In fact, Emacs exploits this fact by emulating a "dumb" terminal (that's another name for the kind of terminal that only understands basic cursor moving commands) in its "shell" mode. But in fact, since Emacs works on a real computer, its shell mode is much smarter, because it can hold the screen history of your entire session, and you can go back to any previous command, change it as needed, and send again.

Back in time, similar capabilities had also appeared in new generations of terminals that got their own CPU and RAM, and thus could hold in their memory much more than just one screen of text. They also got colors! The companies making them have coined the term "smart terminal." The terminal technologies started to be a hot topic among technology companies (very much like AI these days), and there was a "Cambrian explosion" of terminal models, each with their own set of features.

These new features of smart terminals gave birth to a whole new set of control commands. Since the controlled display area had expanded from a single line into a two-dimensional array, there were commands that control the cursor position on the screen, and perform screen clearing and scrolling. For compatibility, and due to technical constraints, these new commands were not single characters anymore (as "backspace" and "carriage return" are), but rather entire sequences of characters, starting with the "escape" command symbol.

The terminal controlling commands were still sent inline with the printable symbols. When a smart terminal saw a command, it processed it immediately. This was usually resulting in a cursor position change, or the current color change, or enabling bold font, or something else (note that some similarity with HTML language can be noted, except that unlike HTML tags, control commands do not have a closing pair). If it was a symbol character outside of the command, the terminal just printed it. To get a sense of how many terminal models and various types commands were there, take look at the "terminal information database" here.

Finally, smart terminals created possibility for visual editing, and the ex editor was reworked into vi. The first versions of vi were implemented by relying on ex running in visual mode. You could still use the same commands that ex has inherited from ed, but you could also navigate and scroll the document you are editing. The modern versions of vi still use these two modes of operation.

Of course, other visual versions of existing UNIX system utilities started to appear, for example, top is a visual interactive version of ps (processes control), and more and less are visual pagers, offering an alternative to line-oriented cat.

By the way, that terminal information database I mentioned above was not created just for lessons in computer history. In fact, it was solving the problem of standardization of the "terminals zoo". When a visual program runs, it needs to know what the terminal is capable of, and also what is the exact control characters sequence for each terminal command (remember that there were hundreds of smart terminal models). There was a library called curses which acted as a translator between a program and a terminal. Unfortunately, a lot of modern command-line scripts and utilities are unaware of this translation mechanism, and use control sequences "blindly", assuming that the terminal is able to interpret them correctly. In part, this works because these days we use "terminal emulator" programs that typically use the same set of control characters. But when this is not the case, the user starts seeing a flurry of cryptic sequences that start with ^[ (the escape character) in the program's output.

Besides the need for standardization, another interesting engineering aspect that had emerged was the ability of terminal virtualization. Since, as I mentioned previously, terminal control characters are in the same stream with program's output, and user cursor control commands are in the same stream with program's input, the standard UNIX mechanism for I/O streams redirection created a possibility of emulating terminal behavior within a visual program. Normally a visual program assumes that it uses the entire screen of a physical terminal (the terminal provides its dimensions in rows and columns). But if one program launches another, it can direct the I/O streams of the child process into itself, and maintain a virtual terminal for it. For example, the parent program may run several child processes and maintain a virtual screen for each of them (this is what the utility called screen does). Or it can allocate a subsection of the terminal (half of a screen, for example) to a child process, and run two of them side by side. This kind of virtual terminal manager programs is called "terminal multiplexer", with tmux being a well known example.

An interesting question emerges: since there is only one user, with only one keyboard, if a terminal multiplexer is running two other visual programs side by side, which one is receiving the input? The answer is—the input is received by the parent program, which then sends it to the child process which currently has "focus." To give the user the ability to switch focus, or send any other commands to the parent program, they need to prepend their input with a "control sequence" (or "escape sequence"). As an example, for screen, the default control sequence is Ctrl-a. When screen receives it, it understands that it does not need to retranslate what follows this command to the child process, but rather interpret it itself.

The virtualization can be arbitrarily nested. For example, you can launch screen, and inside it launch another instance of it, but you need to be careful with control sequences. screen has a command "send escape command", thus in order to send a control command to the nested screen we send the command "send escape sequence" which is interpreted first by the parent screen, and after it executes it (and thus sends the escape further down), the nested screen enters the command interpretation mode. If you need to send control commands to the nested screen frequently, you should really change the default "escape sequence" for it (to Ctrl-b, for example) so that the parent screen sends it directly.

Another problem that virtual terminal programs solve is working around the fact that in Unix, a process can only be "bound" to one terminal only. This reflected the normal use case of a user logging from a terminal into the OS, and when they log out, all their processes are automatically terminated. The only possibility for a process to outlive its terminal is to "detach" from it and become a "daemon" process. In fact, most of the OS own processes are daemons, so they can run even if no users are logged in. However, users cannot easily interact with a daemon—normally, the output from it goes into a log file, and commands are sent to the daemon using UNIX signals.

Terminal multiplexers / emulators created a new possibility of user session persistence. Since they launch the user's program under a virtual terminal, it is not bound to a physical terminal, and can run until next system restart, like a daemon. However, since the multiplexer also has a user-visible part, that user program can interact with the user normally. And in fact, only that user-visible part of the terminal multiplexer gets terminated if the user's physical terminal gets disconnected from the OS. When the user reconnects, it can re-attach to the already running session of screen, and continue their work. But this feature makes the implementation of screen and tmux rather complicated because the user can reconnect using a different kind of terminal. Thus, essentially the terminal multiplexer needs to perform adaptation of terminal control sequences that the user's visual program is sending to the virtual terminal into equivalent control sequences of the current physical terminal.

What is Wrong with Terminal Emulation in Emacs?

With that history in mind, I can explain the Big Friction Point that I had with running command-line programs under Emacs.

Emacs entered the text editor scene much later than vi, and it was designed right from the start to be a visual editor, so it does not have the line-oriented mode like vi does. Moreover, since Emacs pretends to be an operating system in itself (that's another popular hacker's joke), it offers both dumb terminal emulators (the "shell" mode, and the "compilation" mode), as well as full smart (or visual terminal) emulators. That means, you can run vi inside Emacs if you wish to. The caveat with visual terminal emulation in Emacs is that it goes against two important principles of its design.

First, very much like Macintosh OS design, Emacs strives for having unified key mappings across its editing modes, that's to avoid interrupting users "mental flow." However, a vi instance running inside an Emacs virtual terminal still expects standard vi input. And it will assume that it has entire control of the user's input and output. So, as it was discussed above on the example of screen, the terminal emulator of Emacs normally needs to capture entire user input and send it to vi. And in order to interrupt that, the user needs to send some "escape" command to Emacs. Thus, visual terminal emulators in Emacs also need to have at least two "modes", and this is inconvenient, because it breaks normal key chord control for the user. The built-in emulator called term calls these modes char (in which the user input goes into the child process) and line (in which the child process does not receive any input and the user can manipulate its output using normal Emacs commands). And if you are interacting with the nested app and the rest of Emacs, you need to switch them frequently.

The second thing of visual terminal emulation that goes against Emacs design, is that a program designed for a smart terminal can only have one output view (recall that in the design of Unix, an interactive program can be bound to a single instance of terminal only). Although both tmux and screen do allow connecting multiple clients to the same session—this is often used for "live" or pair programming sessions—but since the program "sees" only one terminal—the virtual terminal which tmux or screen emulates—it can adjust its view to one terminal size only. There are only two viable choices for the terminal multiplexer here in terms of which terminal size it can report to the nested app: either use the size of the smallest terminal for all connected physical terminals, or report the size of the "current" one, and admit corrupt visual state on physical terminals that have non-matching size. Note that this problem is somewhat a corner case for traditional terminal multiplexers, but Emacs naturally allows viewing the same text file (which is abstracted into a "buffer") in multiple views simultaneously, this situation can happen quite naturally. And in this case, a buffer associated with a terminal for a visual program can be displayed correctly in a single view of the buffer only.

So, basically, existing terminal emulation solutions for Emacs are all of two kinds. The first kind simulates a dumb terminal (like "shell" mode does), and this allows to transform the output of the program into a normal Emacs text buffer (with colors attributes, thanks to ansi-color package). And this, in its own turn, allows this buffer to be manipulated using standard Emacs editing commands, and also allows displaying it simultaneously using views of different sizes. But if the user runs a utility that uses more "advanced" terminal control characters, the output from it can go awry. As I mentioned before, a lot of terminal-based utilities, including build tools, and even OS's own tools, do not query the terminal type and just assume that they can use arbitrary terminal displaying tricks for their fancy progress bars.

The second kind of Emacs terminal emulator provide full screen emulation. From what I have seen on various forums, a lot of Emacs users sidestep the problem of garbled output by resorting to this kind of emulation. That is, these people run their shells and builds in full terminal emulators under Emacs. Some of these emulators, like eat, offer to solve the keyboard input problem by providing a third "hybrid" mode—eat calls it semi-char—where most symbols are sent into the child process, but some are interpreted as usual by Emacs. So maybe the user can remain in the "semi-char" mode for longer time while interacting with an app running under Emacs, but once they will need, for example, to copy some output from it, and for that they will have to engage into mode switching, which can be disrupting to their mental flow.

So we see, that the hybrid solution from eat is applied on the user input side. My idea was to make a hybrid on the program's output side instead. That is, to evolve the shell mode into a third kind of terminal emulator, which still mostly supports the dumb terminal emulation mode, but allows the child process to use a subset of control characters for displaying fancy progress statuses. After all, once the long action carried by the app completes, it normally erases all its intermediate output, and the result looks very much like an output from a good old line-oriented program. To state it in another way, I don't need to run vi in my "evolved shell", but I need to be able to run apt get install and observe a normally looking progress bar instead of colorful garbage interleaved with control characters that the Emacs "shell mode" does not understand.

By the way, as I read in the materials about the unsuccessful planned successor of Unix OS—the Plan 9 OS—its terminal (called 9term) was built around similar ideas of considering program output as buffers of text, and dropping support for terminal control sequences completely. 9term was behaving more like Emacs shell or eshell modes, representing the program output as a stream of text and making the terminal basically a text editor which allows the user to work both with the history of commands and with the program output as if they were a text file. In Plan 9, if someone needed a visual program, they had to make it a GUI app, and building "TUI" interfaces was considered a thing of the past.

The Helping Hand of AI

By then, I had already been using AI in "assistant" mode for quite a while. That means asking questions and then copy-pasting fragments of AI-generated code, basically it's a replacement for Web and forum searches. But true AI agentic coding (also known as "vibe coding") is completely different, and I had to start learning it.

Luckily, I was already experiencing something similar for a couple of years thanks to the reliance of modern tech companies on "vendors", or what is called "outsourcing." In this model, if there is a programming task that can be delegated to another company, this should be preferred to using the company's own engineers time on it. This is just basic cost reduction effort. Since outsourcing companies are located in geographical areas with lower labor costs, they can offer much lower hourly rate.

So, I was already doing a lot of programming tasks in the manner where I was just formulating some high-level idea of what should be done, and how the produced code should be tested. Then I was reviewing the code contributions from my vendors, and passing them back my comments. Does that sound familiar? Right, this is very similar to how "vibe coding" works, with two major differences: the cycle speed with vendors is typically longer, measured in days, thus you don't get that "vibe" feeling, and the capabilities of human vendor programmers used to be better than of AI agents. I said "used to be" because with late 2025 models I had noticed a big shift in their programming abilities.

I decided the time was right to unleash the power of those new AI models to help me to finally fix my Big Annoying Thing with Emacs. Since I knew what I need to achieve, I did not have to ask AI to come up with an "implementation plan." Instead, I started with writing tests for my new Emacs mode extension. For this, I still used AI in assistant mode, and it was really helpful in constructing these pesky ANSI escape sequences for the scenarios that I cared about. The LLM was also able to analyze a full output from apt package installation session in order to find out which terminal control sequences it uses, and create a script which emulates its output.

Having these tests, I had established a "continuous integration" (CI) loop where I was loading the code of my extension into Emacs (by this time, there was no code yet), launching those test scripts, and comparing the results with "golden" outputs which I have produced with screen. Time to unleash a fully autonomous AI coding!

This is where the real fun began. I was using Gemini CLI, and at first I've made a mistake of letting it use the 2.5 version of the LLM. And it was really struggling, to the point that it could not even write a correct Lisp code. Lisp syntax is very minimalist and consists of lots of parentheses that need to be balanced. Surprisingly, Gemini 2.5 had big problems with that. It actually broke my CI loop at first, because it was writing code that was causing Emacs to fail to load the module, or to hang up completely. This was something I had never experienced with vendors (I told you, so far humans were better programmers than AI). After I made my CI loop more resilient, the AI has entered its own loop by endlessly trying to fix the Lisp syntax, and never succeeding, eventually falling into a mode when it was continuously streaming its looping chain of thought into my terminal. Having wasted a couple of hours on this, I was about to give up and was considering to switch back to assistant mode of coding.

But then I've done two things: had switched to the "latest preview" model of Gemini, which was 3 at that time, and, again with help of AI, improved the project instructions, specifically insisting that the agent writes "Parinfer-compatible" code, and verifies it thoroughly. This was a night and day improvement—the agent finally managed to fix almost all failing tests by writing correct code, and I started feeling good vibes.

Over a week, during my spare time, we had finished the implementation to the point that I could really run my build script in Emacs "compilation" buffer and the output was looking exactly as it looks on a terminal with full capabilities. During this period, I was following the usual principles of Test-Driven Development: always write the test first, make sure the changes fix it, and do not regress anything else, then do a refactoring of code, and of test. So it's like a real engineering cycle, except that I only had to type in plain English—no more coding myself.

I also realized that the AI agent is capable not only of writing code and tests, but actually investigating problems, and it can even write their own tools for that—like a real human programmer! At that point my feelings towards the agent shifted, and I started to consider them as my colleague, at least, like a robotic colleague, something like WALL-E robot, maybe. I still had to help this robot sometimes to fix Emacs Lisp parenthesis issues. That was mainly because I wanted to save my time, and also money.

Yes, one thing I would like to mention is the cost of this exercise. I have ended up spending about $55 on the inference, which is of course not a lot, but you need to realize that this wasn't a big project either. So when I'm reading about huge projects that involve hierarchies of AI agents, I think they can burn a lot of money each day! Besides all the useful work that the agents do, when a problem gets really hard for them, they can easily end up dwindling down into a "confusion spiral," and that is all at your expense! So be careful—I really would not let a swarm of agents to work without close human supervision.

Parting Thoughts

If you are interested, the resulting Emacs extension is here. I called it comint-9term to indicate that it extends comint (command interpreter) mode of Emacs, and it delivers the spirit of the 9term terminal from the Plan 9 OS.

The code is complete now, I'm just planning to continue fixing any edge case scenarios that I might encounter. After all, as I have explained above, this hybrid terminal scenario is a bit unusual, and operates on a boundary between dumb and smart terminals, so some scripts or programs with super creative approach to progress displaying may cause issues. But the AI has added "tracing" sub-mode, so whenever that happens, I can grab a trace, and give it to my AI agent for analysis.

From this "vibe coding" experience, and also from reading experiences of other people, I think this new human-computer interaction mode will stay. Even if some of the companies that are currently making "frontier" LLMs will collapse due to economic reasons, the technology is out there, and people will find ways to make it more economical and efficient.

AI is definitely the new way for writing computer programs, and I think it may change how we treat our phones (or TVs, or cars). Since the appearance of the first iPhone, it was always annoying me that smartphones and tablets were always treated as "embedded" devices, which means, you had to program them using a "real" computer, despite the fact that the CPU power of modern phones is orders of magnitude greater than the supercomputers of the 1980s, let alone the personal computers. Compared to a Z80 (the heart of ZX Spectrum), a modern phone is like a starship compared to a bicycle.

I know, one big obstacle of using your phone for programming was absence of a real keyboard. Since phones do not have a convenient keyboard, writing a program for them in a "traditional" way was a pain. But not anymore! Finally, with AI it's possible to write a program for your phone using only your phone (in theory, at least), by talking to an agent who is building and debugging the app for you. Thus, personal devices can become something like home computers have become for kids 40 years ago. So, despite all the "AI gloom" regarding its economic effects, I look into the future with big enthusiasm.

Saturday, November 22, 2025

LXmini Full Range Driver Alternatives

There was one question that could never escape the back of my mind since I have built my desktop version of LXmini—is it possible to fix the distortion issue with the full range driver? This issue was also the major negative point in the Erin's review of LXmini.

Because the full range driver used for LXmini—SEAS FU10RB—was released 15 years ago, I thought, perhaps there are any better alternatives on the market that have been developed with modern materials and better technologies, so they can provide more even (without the prominent bump between 1 and 2 kHz), and hopefully achieve a 10 dB or better improvement in the overall distortion level? I checked distortion measurements for various drivers available on Zaph|Audio and at Erin's Audio Corner, but could not find any small "full range" drivers that would come up as obviously superior to SEAS FU10RB in terms of distortion. One notable exception are drivers made by Purifi, however, with the current global trade situation getting them in the USA is very costly.

The Candidates

Scanning through the available stock of MadiSound and Parts Express, I have came up with a very scarce list of possible candidates for high fidelity full range 3"/4" drivers:

  • SEAS MU10RB-SL—this is the "midrange driver" of the LX521 speaker (I guess, the "SL" suffix are the initials of S. Linkwitz). But the fun fact is that is has the same specs as FU10RB, except for the properties of the suspension. The suspension in the "midrange" version is stiffer, which limits the cone excursion, and as a result the driver has poorer bass. But the bass is not an issue for its use in LXmini—there is a woofer for that, and I was thinking that perhaps the difference in suspension makes a positive effect on reducing distortion (spoiler: not so much!). I haven't found any reputable measurements for this driver so decided to try to measure it myself.

  • MarkAudio MAOP-5. This driver looks a bit exotic—it has no "spider" suspension part (the corrugated fabric supporting the cone), so in contrast to MU10RB-SL, this driver has less suspension force than FU10RB. I was a bit suspicious about the consequences of this design decision, but I have found no measurements for the distortion to reveal the effect of this approach so it was interesting to try to measure this driver myself.

  • Tang Band W3-1878 also has an unusual look thanks to the massive motor and a specially designed "phase plug", which reminds me of grilles that are used on some measurement microphones. This driver was measured by the same Erin on a Klippel jig back in 2011, but the results were not cross-posted to his site. In the post Erin mentions that his distorion measurements are "relative", which most probably means they are not calibrated to a specific SPL standard, and frankly I did not understand how to read this particular distortion graph, but from Erin's own comments the distortion is on a good level.

And that's basically it! I have used two more drivers for comparison:

  • One is obviously the FU10RB itself, to make sure that I'm using its measurements taken under the same conditions as from the contenders.
  • And the midrange driver that I had recovered from a broken Cambridge Audio Minx Go portable speaker. It is build as a "full cone", and the cone is made of paper. This one I used as an "anchor" in my measurements—it would be a miracle if it could actually beat any of the speaker drivers above, and the "miracle" would mean that my measurements have gone wrong.

Measurement Setup

I used my QuantAsylum stack consisting of the original QA401 analyzer, QA460 transducer driver, QA492 microphone preamp (this model is relatively new), and Earthworks M30 microphone. I powered both the QA460 and QA492 from a portable Jackery battery because my mains power is rather noisy, and the laptop was also running on battery power. Still, initially I had some issues with the mains-induced noise which I was asking about on the QuantAsylum forum. As that thread indicates, I traced the issue down to a poorly shielded USB cable which I used to power the QA492. Also, after the conversation with Matt of QuantAsylum, I obtained shorting BNC plugs and used them to cover up any unused inputs both on QA401 and Q492. This has reduced electrical noise to the minimum.

Since Earthworks M30 goes beyond the standard 20 kHz range, I was running the measurements up to 35 kHz and was using 192 kHz sampling rate on QA401. I was only using the log sweep method which is sufficient to get basic understanding of the non-linearities in the measured system. I did not have enough spare time to run the stepped sine method with sufficient resolution, and I was interested in making relative comparisons, so using the log sweep was fine.

Impedance, Sensitivity, and Efficiency

First, I measured for each driver its impedance, sensitivity, and acoustic radiation efficiency.

Measuring impedance is straightforward with QA460. Below is the summary graph for all drivers:

The impedance plots are abbreviated as follows:

  • CA is the Cambridge Audio driver;
  • FU is the SEAS FU10RB (the original driver of LXmini);
  • MA is the Mark Audio MAOP-5;
  • MU is the SEAS MU10RB-SL (the bass-reduced version of FU10RB);
  • TB is the Tang Band W3-1878.

We can see that both MU10RB-SL and MAOP-5 have a nominal 4 Ohm impedance. The FU10RB is also 4 Ohm nominally, but the actual impedance is more like 6 Ohm. And both W3-1878 and the CA speakers are honest 8 Ohm drivers, with CA being a true "midrange" driver has steeply increasing impedance above the midrange.

Sensitivity measurement was done by by applying to the driver under test a 1 kHz test tone with 1 V RMS amplitude, and measuring the resulting SPL from 1 meter distance. Due to no-baffle mounting the result is lower than the "official" sensitivity spec. The difference between the drivers is not very critical because I use a 100 Watt amplifier and listening to the speakers from 70–100 cm distance, so even an 8 Ohm driver can work well assuming that it is efficient acoustically. Below are my results:

Driver SPL dBA (1kHz/1V/1m)
Cambridge Audio (CA) 60
FU10RB (FU) 62
MAOP-5 (MA) 74
MU10RB-SL (MU) 68
W3-1878 (TB) 68

It's interesting that MAOP-5 despite having the same impedance at 1 kHz as the MU10RB-SL ends up being louder. However, the FU10RB is also quieter than W3-1878 (TB) despite that the former has lower impedance. Is it because the non-linearity in the 1–2 kHz region causes losses in acoustic transfer? The Cambridge Audio driver is unsurprisingly the quietest because at 1 kHz it has effective impedance 16 Ohm.

And this brings us to acoustic radiation efficiency. In this test I checked what is the SPL level for the same 1 kHz test tone at the same distance of 1 m, but this time the level of the electrical input for each driver was adjusted to achieve 105 dB SPL at the 5 mm distance from the driver's cone. Note that this is different from the sensitivity, and characterizes the ability of the cone to work efficiently as an ideal piston.

Driver SPL dBA (1kHz/1m)
Cambridge Audio (CA) 66
FU10RB (FU) 72
MAOP-5 (MA) 73
MU10RB-SL (MU) 72
W3-1878 (TB) 65

Here, both SEAS drivers and MAOP-5 show almost the same result, while both W3-1878 and the CA driver are 6 dB worse. The point of measuring the radiation efficiency is that even if one driver has lower distortion than another, and both have the same sensitivity, still due to lower efficiency one of the drivers has to be driven with higher voltage relative to another driver in order to achieve the same SPL at the listening position.

Impulse and Frequency Response

Now the results from a logsweep. Since I was using the driver in an open configuration, the same way as it is mounted in LXmini, and my room is small, I was measuring the logsweep in the close proximity of the driver cone—about 5 mm.

Below are impulse and step responses of CA and W3-1878 drivers:

We can see that CA driver is seriously damped and the impulse decays quickly. While W3-1878 is also damped well, it exhibits quicker back and forth motion, allowing for a wider high frequency extension.

It's interesting to compare the SEAS "siblings":

They look similar, yet the "midrange" MU10RB-SL exhibits more back and forth motion about 200 μs after the initial impulse, but then the oscillations become less severe, although the overall step response ends up being almost the same.

And now the interesting part, this is the IR of MAOP-5 driver computed from a 40–35000 Hz sweep:

The lack of damping is apparent here. Is it—is it good? Certainly not. Initially I thought that perhaps this is a kind of "exotic" drivers that promise "euphonic" non-linearities for a pleasant sound? I started experimenting with the test setup: first I swapped the QA460 amp for the same amp I'm actually using in my LXdesktop setup: QSC SPA4-100, but the IR stayed the same. Then I started playing with the parameters of the sweep and figured out that if I limit the high frequency range to the standard 20 kHz then the ringing is gone:

The IR still has more fluctuations than IRs of other drivers, but at least now there are no high frequency modulations. I suppose, the stiff material of the driver's cone causes it to go into very high frequency oscillations. Although these should be above the human hearing, they still can cause more non-linear behavior when excited. For this driver it is strictly required to use a low-pass filter.

And below are pairwise comparisons of frequency responses of all drivers corresponding to these IRs. In fact, the response of MAOP-5 driver does not change in the range of 40–20000 Hz regardless of the sweep's upper bound frequency:

These are "nearfield" responses so they are not super useful for evaluating a dipole. Still, we can see that CA driver is indeed a "midrange" driver with a steep downwards slope after 10 kHz, so it's clearly unsuitable for use in LXmini-based designs.

There is an expected difference between FU10RB and MU10RB-SL in the bass response, otherwise they are indeed very similar. And finally, this is the comparison between MAOP-5 and FU10RB:

We can see that the overall shapes of the frequency responses are close. One interesting point is the high frequency behavior of these full range drivers. Since all the cones operate in "break-up" mode, the material of the cone affects the response a lot. We can see that both MAOP-5 and W3-1878 have a null at about 9 kHz in this arrangement, while both SEAS drivers have two: near 5 and 7.5 kHz.

Distortion

Finally, the graphs we have been looking for. As I mentioned, the distortion measurement is derived from the same logsweep, I did not use the stepped sine method. But I have done the sweep at two SPL levels (as measured near the driver cone): 105 dB and 96 dB. Below are the measurements for each driver done at the higher level showing 2nd to 4th harmonics (the levels of other harmonics are benign):

And the summary graph comparing them:

Looking at the graphs, almost all the drivers seem to be from the same league—I do not see an obvious winner, but I do see an obvious loser: the CA driver again. For others, as we know, FU10RB has higher distortion levels in the midrange, and MU10RB-SL is not much better, and it has these strange peaks between 2–3 kHz and 3–4 kHz, although they are very sharp so likely not audible. MAOP-5 driver has issues in the 2–3 kHz region, while W3-1878 looks like the most linear with the exception that distortion seriously increases past 10 kHz.

This is a comparative graph at the 96 dB output level:

The peaks after 5 kHz seem to be measurement artifact as I see them for all drivers, it's just they are drowned in noise for other drivers but clearly visible for W3-1878 which was measured on a different day. I suppose, for cleaner results I would need to use the stepped sine method.

Conclusions

We can see that MU10RB-SL variation is not significantly better than the original FU10RB, only a bit. While W3-1878 driver can be thought as a winner from the distortion level perspective, recall that it has lower acoustic radiation efficiency, which means I might need to drive it harder in order to achieve the desired loudness at the listening position. So, it looks like in order to make the final decision I will need to build one sample of LXdesktop with MAOP-5 driver, one sample with W3-1878, and compare them with my original LXdesktop speaker, with all samples tuned to the same target, of course. That should be a fun experiment, looking forward to it!