Monday, March 2, 2026

On Terminals, Emacs, and AI Coding

Not an audio post, just some thoughts on programming hackery, see also my old similar posts On Keyboards and Long Live 16 Color Terminals.

I think we have now entered the "Golden Age" of programming and use of computers in general—all thanks to AI. That's because Programming tasks that once took days can now be finished in a couple of hours. A personal example of this is my experience extending the Emacs editor. This is ages old "universal" editor—its original concept was created in 1976 at the MIT AI Lab, while GNU Emacs program that I use today was created in the mid-80s. However, it is still popular today among programmers and geeks thanks to its infinite possibilities in customizing and embracing new technologies. I use it every day and continue customizing it to my needs.

The customization of Emacs is done by writing some Emacs Lisp code. If your need is simple, like creating some custom action, or fixing some annoying behavior, a small code snippet usually suffices. If your demand is more serious, like enabling code highlighting and completion for some obscure programming language, you need to create a relatively big chunk of code, which in Emacs world is called a "package." If you are lucky, someone has already faced the same problem and written a package to solve it. In that case, the only thing you need to do is to plug this package into your configuration.

If your need is more or less unique for some reason, then you have to write code yourself. It's not super hard, but it's tedious, mainly because you need to study extensive Emacs APIs and consider how to express the solution in terms of list structures manipulation. Also, you need to write code that handles errors, and make sure that your solution works with acceptable speed. In order to accomplish all this, you used to have to comb through Emacs documentation, source code, and if nothing helps, resort to seeking help at online forums—this is what programming used to be. Since I usually needed to deal with this stuff during my work hours, the thought of writing a complex Emacs extension in my spare time sparked no joy.

Enter the Age of AI. Now, if I need to solve some simple Emacs customization task, or fix an annoyance, I can simply ask an LLM, "In Emacs, how do I ... ?", and it comes with a helpful answer and a snippet of Emacs Lisp code which does the thing you have asked about. In this way, I quickly resolved minor Emacs "friction points" that had bugged me for years. One big annoying thing still remained though—adding support for proper displaying of output from semi-interactive build and install scripts and programs in "shell" and "compilation" modes of Emacs. To explain this problem, I first need to give a brief lesson in computer history.

A Brief Course in Unix Terminals (and Editors) Evolution

The interface between humans and computers has evolved constantly since their inception. At first, humans had to program computers—that is, setting them up for solving a particular problem—by patching cables between physical ports, or flipping myriad switches (examples from Wikipedia). And computers were presenting the results of their work using arrays of lights. After the next computer interface evolution step people could enter data into computers using manually perforated cards, and the computer could print the calculation results on long rolls of paper, using motorized versions of typewriters. Both of those kinds of interfaces were not very much interactive, and required a good amount of planning ahead in order to avoid wasting precious compute time.

The interaction aspect had been improved by combining the aforementioned motorized typewriters with a typewriter-like keyboard (see this article about the legendary "Model 33" teletype). Finally, humans could type their commands or code in, and the computer could print the result immediately. You might have noted that this is already very similar to how we are interacting with AI assistants on the Web today, except that teletypes were much noisier.

Crucially, even at this early stage, a distinction between "symbol" and "control" characters was already apparent. When you see a printed page, you only see the letters of alphabet and punctuation—that's symbols. However, for producing this page, the typewriter operator (being that human or a computer) also needed to use some commands in order to drive the typewriter actions like performing a step roll of the paper, or move the printing head forward or back. Each of these actions was coming to the typewriter on the same lines as printable symbols, and is encoded using a control character.

Thus, when a computer sends the result to a typewriter terminal, symbol characters are interleaved with control characters in the same "stream" of data. Same thing for the human—although most of the typewriter keys are for inserting symbols, some keys like "carriage return" and "backspace" are for sending commands. The computer also has a bonus command called "bell" which it had adopted from telegraphy (so it actually predates computers). This command originally rang a physical bell inside the teletype machine. Nowadays, the computer just emits a short beep in order to attract the operator's attention.

Despite being quite basic, this typewriter interface had already opened a way to create interactive text editor programs. The most well known line-oriented text editor is ed. Its interface was designed to be very minimalist and terse, in order to save paper. ed was called "the standard text editor" in Unix OS—and now this is a hacker's joke. In fact, ed is still supplied with most OSes that descend from Linux.

Upon launching ed, you see no prompt; the program simply waits for your commands. The command is typically just one character, plus parameter. If ed does not understand your command, it prints ?, that's it. Since the file you are editing can be lengthy, ed does not reveal its contents—instead, you have to explicitly ask it to show a region of lines, and of course you can not edit them "in place", you need to enter a command for making each edit. Also, since on a typewriter it's not possible to correct typing mistakes in-place—there was no concept of "line editing" using "cursor left / right" commands, you could only use "backspace" and then re-type part of your command, or discard the entire command line and re-type from scratch. Needless to say, editing text files using these basic capabilities required very good memory, skills, and a lot of patience.

Nevertheless, it was the best user interface that programmers had at that time, and in fact the early Unix OS code was written using ed. As Brian Kernigan recalls in "UNIX A History and a Memoir" book, there were three main components that allowed development of Unix for PDP-7 computer on the computer itself: an editor (and that was ed!), an assembler, and a kernel.

In the next evolutionary step, paper teletypes were replaced by "glass teletypes." These used CRT displays instead of paper, and their keyboards started resembling modern ones. Note that these early glass teletypes did lack the scrollback buffer, thus the lines that have been scrolled away are gone forever. In some sense, this was a downgrade from paper rolls. On the other hand, since typed characters were appearing on a screen, it was possible to make in place corrections in the typed command text, and even move the cursor left and right in order to correct typos in the middle of the command—no more retyping!

The capabilities provided by this new type of terminals have spurred improvements to ed editor. First, it has got a command for showing a whole page of text from the file being edited, taking advantage of the silent nature of video terminals. This version was called em. Note that em still had to be conservative in what it displays because the connection line between the terminal and the computer was often painfully slow. Implementing in-place editing for a whole document—a "visual" editing—was not yet possible.

A lot of standard Unix utilities like bash, cat, du still operate in a similar line-oriented mode and thus are still technically compatible with typewriter-style terminals. In fact, Emacs exploits this fact by emulating a "dumb" terminal (that's another name for the kind of terminal that only understands basic cursor moving commands) in its "shell" mode. But in fact, since Emacs works on a real computer, its shell mode is much smarter, because it can hold the screen history of your entire session, and you can go back to any previous command, change it as needed, and send again.

Back in time, similar capabilities had also appeared in new generations of terminals that got their own CPU and RAM, and thus could hold in their memory much more than just one screen of text. They also got colors! The companies making them have coined the term "smart terminal." The terminal technologies started to be a hot topic among technology companies (very much like AI these days), and there was a "Cambrian explosion" of terminal models, each with their own set of features.

These new features of smart terminals gave birth to a whole new set of control commands. Since the controlled display area had expanded from a single line into a two-dimensional array, there were commands that control the cursor position on the screen, and perform screen clearing and scrolling. For compatibility, and due to technical constraints, these new commands were not single characters anymore (as "backspace" and "carriage return" are), but rather entire sequences of characters, starting with the "escape" command symbol.

The terminal controlling commands were still sent inline with the printable symbols. When a smart terminal saw a command, it processed it immediately. This was usually resulting in a cursor position change, or the current color change, or enabling bold font, or something else (note that some similarity with HTML language can be noted, except that unlike HTML tags, control commands do not have a closing pair). If it was a symbol character outside of the command, the terminal just printed it. To get a sense of how many terminal models and various types commands were there, take look at the "terminal information database" here.

Finally, smart terminals created possibility for visual editing, and the ex editor was reworked into vi. The first versions of vi were implemented by relying on ex running in visual mode. You could still use the same commands that ex has inherited from ed, but you could also navigate and scroll the document you are editing. The modern versions of vi still use these two modes of operation.

Of course, other visual versions of existing UNIX system utilities started to appear, for example, top is a visual interactive version of ps (processes control), and more and less are visual pagers, offering an alternative to line-oriented cat.

By the way, that terminal information database I mentioned above was not created just for lessons in computer history. In fact, it was solving the problem of standardization of the "terminals zoo". When a visual program runs, it needs to know what the terminal is capable of, and also what is the exact control characters sequence for each terminal command (remember that there were hundreds of smart terminal models). There was a library called curses which acted as a translator between a program and a terminal. Unfortunately, a lot of modern command-line scripts and utilities are unaware of this translation mechanism, and use control sequences "blindly", assuming that the terminal is able to interpret them correctly. In part, this works because these days we use "terminal emulator" programs that typically use the same set of control characters. But when this is not the case, the user starts seeing a flurry of cryptic sequences that start with ^[ (the escape character) in the program's output.

Besides the need for standardization, another interesting engineering aspect that had emerged was the ability of terminal virtualization. Since, as I mentioned previously, terminal control characters are in the same stream with program's output, and user cursor control commands are in the same stream with program's input, the standard UNIX mechanism for I/O streams redirection created a possibility of emulating terminal behavior within a visual program. Normally a visual program assumes that it uses the entire screen of a physical terminal (the terminal provides its dimensions in rows and columns). But if one program launches another, it can direct the I/O streams of the child process into itself, and maintain a virtual terminal for it. For example, the parent program may run several child processes and maintain a virtual screen for each of them (this is what the utility called screen does). Or it can allocate a subsection of the terminal (half of a screen, for example) to a child process, and run two of them side by side. This kind of virtual terminal manager programs is called "terminal multiplexer", with tmux being a well known example.

An interesting question emerges: since there is only one user, with only one keyboard, if a terminal multiplexer is running two other visual programs side by side, which one is receiving the input? The answer is—the input is received by the parent program, which then sends it to the child process which currently has "focus." To give the user the ability to switch focus, or send any other commands to the parent program, they need to prepend their input with a "control sequence" (or "escape sequence"). As an example, for screen, the default control sequence is Ctrl-a. When screen receives it, it understands that it does not need to retranslate what follows this command to the child process, but rather interpret it itself.

The virtualization can be arbitrarily nested. For example, you can launch screen, and inside it launch another instance of it, but you need to be careful with control sequences. screen has a command "send escape command", thus in order to send a control command to the nested screen we send the command "send escape sequence" which is interpreted first by the parent screen, and after it executes it (and thus sends the escape further down), the nested screen enters the command interpretation mode. If you need to send control commands to the nested screen frequently, you should really change the default "escape sequence" for it (to Ctrl-b, for example) so that the parent screen sends it directly.

Another problem that virtual terminal programs solve is working around the fact that in Unix, a process can only be "bound" to one terminal only. This reflected the normal use case of a user logging from a terminal into the OS, and when they log out, all their processes are automatically terminated. The only possibility for a process to outlive its terminal is to "detach" from it and become a "daemon" process. In fact, most of the OS own processes are daemons, so they can run even if no users are logged in. However, users cannot easily interact with a daemon—normally, the output from it goes into a log file, and commands are sent to the daemon using UNIX signals.

Terminal multiplexers / emulators created a new possibility of user session persistence. Since they launch the user's program under a virtual terminal, it is not bound to a physical terminal, and can run until next system restart, like a daemon. However, since the multiplexer also has a user-visible part, that user program can interact with the user normally. And in fact, only that user-visible part of the terminal multiplexer gets terminated if the user's physical terminal gets disconnected from the OS. When the user reconnects, it can re-attach to the already running session of screen, and continue their work. But this feature makes the implementation of screen and tmux rather complicated because the user can reconnect using a different kind of terminal. Thus, essentially the terminal multiplexer needs to perform adaptation of terminal control sequences that the user's visual program is sending to the virtual terminal into equivalent control sequences of the current physical terminal.

What is Wrong with Terminal Emulation in Emacs?

With that history in mind, I can explain the Big Friction Point that I had with running command-line programs under Emacs.

Emacs entered the text editor scene much later than vi, and it was designed right from the start to be a visual editor, so it does not have the line-oriented mode like vi does. Moreover, since Emacs pretends to be an operating system in itself (that's another popular hacker's joke), it offers both dumb terminal emulators (the "shell" mode, and the "compilation" mode), as well as full smart (or visual terminal) emulators. That means, you can run vi inside Emacs if you wish to. The caveat with visual terminal emulation in Emacs is that it goes against two important principles of its design.

First, very much like Macintosh OS design, Emacs strives for having unified key mappings across its editing modes, that's to avoid interrupting users "mental flow." However, a vi instance running inside an Emacs virtual terminal still expects standard vi input. And it will assume that it has entire control of the user's input and output. So, as it was discussed above on the example of screen, the terminal emulator of Emacs normally needs to capture entire user input and send it to vi. And in order to interrupt that, the user needs to send some "escape" command to Emacs. Thus, visual terminal emulators in Emacs also need to have at least two "modes", and this is inconvenient, because it breaks normal key chord control for the user. The built-in emulator called term calls these modes char (in which the user input goes into the child process) and line (in which the child process does not receive any input and the user can manipulate its output using normal Emacs commands). And if you are interacting with the nested app and the rest of Emacs, you need to switch them frequently.

The second thing of visual terminal emulation that goes against Emacs design, is that a program designed for a smart terminal can only have one output view (recall that in the design of Unix, an interactive program can be bound to a single instance of terminal only). Although both tmux and screen do allow connecting multiple clients to the same session—this is often used for "live" or pair programming sessions—but since the program "sees" only one terminal—the virtual terminal which tmux or screen emulates—it can adjust its view to one terminal size only. There are only two viable choices for the terminal multiplexer here in terms of which terminal size it can report to the nested app: either use the size of the smallest terminal for all connected physical terminals, or report the size of the "current" one, and admit corrupt visual state on physical terminals that have non-matching size. Note that this problem is somewhat a corner case for traditional terminal multiplexers, but Emacs naturally allows viewing the same text file (which is abstracted into a "buffer") in multiple views simultaneously, this situation can happen quite naturally. And in this case, a buffer associated with a terminal for a visual program can be displayed correctly in a single view of the buffer only.

So, basically, existing terminal emulation solutions for Emacs are all of two kinds. The first kind simulates a dumb terminal (like "shell" mode does), and this allows to transform the output of the program into a normal Emacs text buffer (with colors attributes, thanks to ansi-color package). And this, in its own turn, allows this buffer to be manipulated using standard Emacs editing commands, and also allows displaying it simultaneously using views of different sizes. But if the user runs a utility that uses more "advanced" terminal control characters, the output from it can go awry. As I mentioned before, a lot of terminal-based utilities, including build tools, and even OS's own tools, do not query the terminal type and just assume that they can use arbitrary terminal displaying tricks for their fancy progress bars.

The second kind of Emacs terminal emulator provide full screen emulation. From what I have seen on various forums, a lot of Emacs users sidestep the problem of garbled output by resorting to this kind of emulation. That is, these people run their shells and builds in full terminal emulators under Emacs. Some of these emulators, like eat, offer to solve the keyboard input problem by providing a third "hybrid" mode—eat calls it semi-char—where most symbols are sent into the child process, but some are interpreted as usual by Emacs. So maybe the user can remain in the "semi-char" mode for longer time while interacting with an app running under Emacs, but once they will need, for example, to copy some output from it, and for that they will have to engage into mode switching, which can be disrupting to their mental flow.

So we see, that the hybrid solution from eat is applied on the user input side. My idea was to make a hybrid on the program's output side instead. That is, to evolve the shell mode into a third kind of terminal emulator, which still mostly supports the dumb terminal emulation mode, but allows the child process to use a subset of control characters for displaying fancy progress statuses. After all, once the long action carried by the app completes, it normally erases all its intermediate output, and the result looks very much like an output from a good old line-oriented program. To state it in another way, I don't need to run vi in my "evolved shell", but I need to be able to run apt get install and observe a normally looking progress bar instead of colorful garbage interleaved with control characters that the Emacs "shell mode" does not understand.

By the way, as I read in the materials about the unsuccessful planned successor of Unix OS—the Plan 9 OS—its terminal (called 9term) was built around similar ideas of considering program output as buffers of text, and dropping support for terminal control sequences completely. 9term was behaving more like Emacs shell or eshell modes, representing the program output as a stream of text and making the terminal basically a text editor which allows the user to work both with the history of commands and with the program output as if they were a text file. In Plan 9, if someone needed a visual program, they had to make it a GUI app, and building "TUI" interfaces was considered a thing of the past.

The Helping Hand of AI

By then, I had already been using AI in "assistant" mode for quite a while. That means asking questions and then copy-pasting fragments of AI-generated code, basically it's a replacement for Web and forum searches. But true AI agentic coding (also known as "vibe coding") is completely different, and I had to start learning it.

Luckily, I was already experiencing something similar for a couple of years thanks to the reliance of modern tech companies on "vendors", or what is called "outsourcing." In this model, if there is a programming task that can be delegated to another company, this should be preferred to using the company's own engineers time on it. This is just basic cost reduction effort. Since outsourcing companies are located in geographical areas with lower labor costs, they can offer much lower hourly rate.

So, I was already doing a lot of programming tasks in the manner where I was just formulating some high-level idea of what should be done, and how the produced code should be tested. Then I was reviewing the code contributions from my vendors, and passing them back my comments. Does that sound familiar? Right, this is very similar to how "vibe coding" works, with two major differences: the cycle speed with vendors is typically longer, measured in days, thus you don't get that "vibe" feeling, and the capabilities of human vendor programmers used to be better than of AI agents. I said "used to be" because with late 2025 models I had noticed a big shift in their programming abilities.

I decided the time was right to unleash the power of those new AI models to help me to finally fix my Big Annoying Thing with Emacs. Since I knew what I need to achieve, I did not have to ask AI to come up with an "implementation plan." Instead, I started with writing tests for my new Emacs mode extension. For this, I still used AI in assistant mode, and it was really helpful in constructing these pesky ANSI escape sequences for the scenarios that I cared about. The LLM was also able to analyze a full output from apt package installation session in order to find out which terminal control sequences it uses, and create a script which emulates its output.

Having these tests, I had established a "continuous integration" (CI) loop where I was loading the code of my extension into Emacs (by this time, there was no code yet), launching those test scripts, and comparing the results with "golden" outputs which I have produced with screen. Time to unleash a fully autonomous AI coding!

This is where the real fun began. I was using Gemini CLI, and at first I've made a mistake of letting it use the 2.5 version of the LLM. And it was really struggling, to the point that it could not even write a correct Lisp code. Lisp syntax is very minimalist and consists of lots of parentheses that need to be balanced. Surprisingly, Gemini 2.5 had big problems with that. It actually broke my CI loop at first, because it was writing code that was causing Emacs to fail to load the module, or to hang up completely. This was something I had never experienced with vendors (I told you, so far humans were better programmers than AI). After I made my CI loop more resilient, the AI has entered its own loop by endlessly trying to fix the Lisp syntax, and never succeeding, eventually falling into a mode when it was continuously streaming its looping chain of thought into my terminal. Having wasted a couple of hours on this, I was about to give up and was considering to switch back to assistant mode of coding.

But then I've done two things: had switched to the "latest preview" model of Gemini, which was 3 at that time, and, again with help of AI, improved the project instructions, specifically insisting that the agent writes "Parinfer-compatible" code, and verifies it thoroughly. This was a night and day improvement—the agent finally managed to fix almost all failing tests by writing correct code, and I started feeling good vibes.

Over a week, during my spare time, we had finished the implementation to the point that I could really run my build script in Emacs "compilation" buffer and the output was looking exactly as it looks on a terminal with full capabilities. During this period, I was following the usual principles of Test-Driven Development: always write the test first, make sure the changes fix it, and do not regress anything else, then do a refactoring of code, and of test. So it's like a real engineering cycle, except that I only had to type in plain English—no more coding myself.

I also realized that the AI agent is capable not only of writing code and tests, but actually investigating problems, and it can even write their own tools for that—like a real human programmer! At that point my feelings towards the agent shifted, and I started to consider them as my colleague, at least, like a robotic colleague, something like WALL-E robot, maybe. I still had to help this robot sometimes to fix Emacs Lisp parenthesis issues. That was mainly because I wanted to save my time, and also money.

Yes, one thing I would like to mention is the cost of this exercise. I have ended up spending about $55 on the inference, which is of course not a lot, but you need to realize that this wasn't a big project either. So when I'm reading about huge projects that involve hierarchies of AI agents, I think they can burn a lot of money each day! Besides all the useful work that the agents do, when a problem gets really hard for them, they can easily end up dwindling down into a "confusion spiral," and that is all at your expense! So be careful—I really would not let a swarm of agents to work without close human supervision.

Parting Thoughts

If you are interested, the resulting Emacs extension is here. I called it comint-9term to indicate that it extends comint (command interpreter) mode of Emacs, and it delivers the spirit of the 9term terminal from the Plan 9 OS.

The code is complete now, I'm just planning to continue fixing any edge case scenarios that I might encounter. After all, as I have explained above, this hybrid terminal scenario is a bit unusual, and operates on a boundary between dumb and smart terminals, so some scripts or programs with super creative approach to progress displaying may cause issues. But the AI has added "tracing" sub-mode, so whenever that happens, I can grab a trace, and give it to my AI agent for analysis.

From this "vibe coding" experience, and also from reading experiences of other people, I think this new human-computer interaction mode will stay. Even if some of the companies that are currently making "frontier" LLMs will collapse due to economic reasons, the technology is out there, and people will find ways to make it more economical and efficient.

AI is definitely the new way for writing computer programs, and I think it may change how we treat our phones (or TVs, or cars). Since the appearance of the first iPhone, it was always annoying me that smartphones and tablets were always treated as "embedded" devices, which means, you had to program them using a "real" computer, despite the fact that the CPU power of modern phones is orders of magnitude greater than the supercomputers of the 1980s, let alone the personal computers. Compared to a Z80 (the heart of ZX Spectrum), a modern phone is like a starship compared to a bicycle.

I know, one big obstacle of using your phone for programming was absence of a real keyboard. Since phones do not have a convenient keyboard, writing a program for them in a "traditional" way was a pain. But not anymore! Finally, with AI it's possible to write a program for your phone using only your phone (in theory, at least), by talking to an agent who is building and debugging the app for you. Thus, personal devices can become something like home computers have become for kids 40 years ago. So, despite all the "AI gloom" regarding its economic effects, I look into the future with big enthusiasm.

Saturday, November 22, 2025

LXmini Full Range Driver Alternatives

There was one question that could never escape the back of my mind since I have built my desktop version of LXmini—is it possible to fix the distortion issue with the full range driver? This issue was also the major negative point in the Erin's review of LXmini.

Because the full range driver used for LXmini—SEAS FU10RB—was released 15 years ago, I thought, perhaps there are any better alternatives on the market that have been developed with modern materials and better technologies, so they can provide more even (without the prominent bump between 1 and 2 kHz), and hopefully achieve a 10 dB or better improvement in the overall distortion level? I checked distortion measurements for various drivers available on Zaph|Audio and at Erin's Audio Corner, but could not find any small "full range" drivers that would come up as obviously superior to SEAS FU10RB in terms of distortion. One notable exception are drivers made by Purifi, however, with the current global trade situation getting them in the USA is very costly.

The Candidates

Scanning through the available stock of MadiSound and Parts Express, I have came up with a very scarce list of possible candidates for high fidelity full range 3"/4" drivers:

  • SEAS MU10RB-SL—this is the "midrange driver" of the LX521 speaker (I guess, the "SL" suffix are the initials of S. Linkwitz). But the fun fact is that is has the same specs as FU10RB, except for the properties of the suspension. The suspension in the "midrange" version is stiffer, which limits the cone excursion, and as a result the driver has poorer bass. But the bass is not an issue for its use in LXmini—there is a woofer for that, and I was thinking that perhaps the difference in suspension makes a positive effect on reducing distortion (spoiler: not so much!). I haven't found any reputable measurements for this driver so decided to try to measure it myself.

  • MarkAudio MAOP-5. This driver looks a bit exotic—it has no "spider" suspension part (the corrugated fabric supporting the cone), so in contrast to MU10RB-SL, this driver has less suspension force than FU10RB. I was a bit suspicious about the consequences of this design decision, but I have found no measurements for the distortion to reveal the effect of this approach so it was interesting to try to measure this driver myself.

  • Tang Band W3-1878 also has an unusual look thanks to the massive motor and a specially designed "phase plug", which reminds me of grilles that are used on some measurement microphones. This driver was measured by the same Erin on a Klippel jig back in 2011, but the results were not cross-posted to his site. In the post Erin mentions that his distorion measurements are "relative", which most probably means they are not calibrated to a specific SPL standard, and frankly I did not understand how to read this particular distortion graph, but from Erin's own comments the distortion is on a good level.

And that's basically it! I have used two more drivers for comparison:

  • One is obviously the FU10RB itself, to make sure that I'm using its measurements taken under the same conditions as from the contenders.
  • And the midrange driver that I had recovered from a broken Cambridge Audio Minx Go portable speaker. It is build as a "full cone", and the cone is made of paper. This one I used as an "anchor" in my measurements—it would be a miracle if it could actually beat any of the speaker drivers above, and the "miracle" would mean that my measurements have gone wrong.

Measurement Setup

I used my QuantAsylum stack consisting of the original QA401 analyzer, QA460 transducer driver, QA492 microphone preamp (this model is relatively new), and Earthworks M30 microphone. I powered both the QA460 and QA492 from a portable Jackery battery because my mains power is rather noisy, and the laptop was also running on battery power. Still, initially I had some issues with the mains-induced noise which I was asking about on the QuantAsylum forum. As that thread indicates, I traced the issue down to a poorly shielded USB cable which I used to power the QA492. Also, after the conversation with Matt of QuantAsylum, I obtained shorting BNC plugs and used them to cover up any unused inputs both on QA401 and Q492. This has reduced electrical noise to the minimum.

Since Earthworks M30 goes beyond the standard 20 kHz range, I was running the measurements up to 35 kHz and was using 192 kHz sampling rate on QA401. I was only using the log sweep method which is sufficient to get basic understanding of the non-linearities in the measured system. I did not have enough spare time to run the stepped sine method with sufficient resolution, and I was interested in making relative comparisons, so using the log sweep was fine.

Impedance, Sensitivity, and Efficiency

First, I measured for each driver its impedance, sensitivity, and acoustic radiation efficiency.

Measuring impedance is straightforward with QA460. Below is the summary graph for all drivers:

The impedance plots are abbreviated as follows:

  • CA is the Cambridge Audio driver;
  • FU is the SEAS FU10RB (the original driver of LXmini);
  • MA is the Mark Audio MAOP-5;
  • MU is the SEAS MU10RB-SL (the bass-reduced version of FU10RB);
  • TB is the Tang Band W3-1878.

We can see that both MU10RB-SL and MAOP-5 have a nominal 4 Ohm impedance. The FU10RB is also 4 Ohm nominally, but the actual impedance is more like 6 Ohm. And both W3-1878 and the CA speakers are honest 8 Ohm drivers, with CA being a true "midrange" driver has steeply increasing impedance above the midrange.

Sensitivity measurement was done by by applying to the driver under test a 1 kHz test tone with 1 V RMS amplitude, and measuring the resulting SPL from 1 meter distance. Due to no-baffle mounting the result is lower than the "official" sensitivity spec. The difference between the drivers is not very critical because I use a 100 Watt amplifier and listening to the speakers from 70–100 cm distance, so even an 8 Ohm driver can work well assuming that it is efficient acoustically. Below are my results:

Driver SPL dBA (1kHz/1V/1m)
Cambridge Audio (CA) 60
FU10RB (FU) 62
MAOP-5 (MA) 74
MU10RB-SL (MU) 68
W3-1878 (TB) 68

It's interesting that MAOP-5 despite having the same impedance at 1 kHz as the MU10RB-SL ends up being louder. However, the FU10RB is also quieter than W3-1878 (TB) despite that the former has lower impedance. Is it because the non-linearity in the 1–2 kHz region causes losses in acoustic transfer? The Cambridge Audio driver is unsurprisingly the quietest because at 1 kHz it has effective impedance 16 Ohm.

And this brings us to acoustic radiation efficiency. In this test I checked what is the SPL level for the same 1 kHz test tone at the same distance of 1 m, but this time the level of the electrical input for each driver was adjusted to achieve 105 dB SPL at the 5 mm distance from the driver's cone. Note that this is different from the sensitivity, and characterizes the ability of the cone to work efficiently as an ideal piston.

Driver SPL dBA (1kHz/1m)
Cambridge Audio (CA) 66
FU10RB (FU) 72
MAOP-5 (MA) 73
MU10RB-SL (MU) 72
W3-1878 (TB) 65

Here, both SEAS drivers and MAOP-5 show almost the same result, while both W3-1878 and the CA driver are 6 dB worse. The point of measuring the radiation efficiency is that even if one driver has lower distortion than another, and both have the same sensitivity, still due to lower efficiency one of the drivers has to be driven with higher voltage relative to another driver in order to achieve the same SPL at the listening position.

Impulse and Frequency Response

Now the results from a logsweep. Since I was using the driver in an open configuration, the same way as it is mounted in LXmini, and my room is small, I was measuring the logsweep in the close proximity of the driver cone—about 5 mm.

Below are impulse and step responses of CA and W3-1878 drivers:

We can see that CA driver is seriously damped and the impulse decays quickly. While W3-1878 is also damped well, it exhibits quicker back and forth motion, allowing for a wider high frequency extension.

It's interesting to compare the SEAS "siblings":

They look similar, yet the "midrange" MU10RB-SL exhibits more back and forth motion about 200 Î¼s after the initial impulse, but then the oscillations become less severe, although the overall step response ends up being almost the same.

And now the interesting part, this is the IR of MAOP-5 driver computed from a 40–35000 Hz sweep:

The lack of damping is apparent here. Is it—is it good? Certainly not. Initially I thought that perhaps this is a kind of "exotic" drivers that promise "euphonic" non-linearities for a pleasant sound? I started experimenting with the test setup: first I swapped the QA460 amp for the same amp I'm actually using in my LXdesktop setup: QSC SPA4-100, but the IR stayed the same. Then I started playing with the parameters of the sweep and figured out that if I limit the high frequency range to the standard 20 kHz then the ringing is gone:

The IR still has more fluctuations than IRs of other drivers, but at least now there are no high frequency modulations. I suppose, the stiff material of the driver's cone causes it to go into very high frequency oscillations. Although these should be above the human hearing, they still can cause more non-linear behavior when excited. For this driver it is strictly required to use a low-pass filter.

And below are pairwise comparisons of frequency responses of all drivers corresponding to these IRs. In fact, the response of MAOP-5 driver does not change in the range of 40–20000 Hz regardless of the sweep's upper bound frequency:

These are "nearfield" responses so they are not super useful for evaluating a dipole. Still, we can see that CA driver is indeed a "midrange" driver with a steep downwards slope after 10 kHz, so it's clearly unsuitable for use in LXmini-based designs.

There is an expected difference between FU10RB and MU10RB-SL in the bass response, otherwise they are indeed very similar. And finally, this is the comparison between MAOP-5 and FU10RB:

We can see that the overall shapes of the frequency responses are close. One interesting point is the high frequency behavior of these full range drivers. Since all the cones operate in "break-up" mode, the material of the cone affects the response a lot. We can see that both MAOP-5 and W3-1878 have a null at about 9 kHz in this arrangement, while both SEAS drivers have two: near 5 and 7.5 kHz.

Distortion

Finally, the graphs we have been looking for. As I mentioned, the distortion measurement is derived from the same logsweep, I did not use the stepped sine method. But I have done the sweep at two SPL levels (as measured near the driver cone): 105 dB and 96 dB. Below are the measurements for each driver done at the higher level showing 2nd to 4th harmonics (the levels of other harmonics are benign):

And the summary graph comparing them:

Looking at the graphs, almost all the drivers seem to be from the same league—I do not see an obvious winner, but I do see an obvious loser: the CA driver again. For others, as we know, FU10RB has higher distortion levels in the midrange, and MU10RB-SL is not much better, and it has these strange peaks between 2–3 kHz and 3–4 kHz, although they are very sharp so likely not audible. MAOP-5 driver has issues in the 2–3 kHz region, while W3-1878 looks like the most linear with the exception that distortion seriously increases past 10 kHz.

This is a comparative graph at the 96 dB output level:

The peaks after 5 kHz seem to be measurement artifact as I see them for all drivers, it's just they are drowned in noise for other drivers but clearly visible for W3-1878 which was measured on a different day. I suppose, for cleaner results I would need to use the stepped sine method.

Conclusions

We can see that MU10RB-SL variation is not significantly better than the original FU10RB, only a bit. While W3-1878 driver can be thought as a winner from the distortion level perspective, recall that it has lower acoustic radiation efficiency, which means I might need to drive it harder in order to achieve the desired loudness at the listening position. So, it looks like in order to make the final decision I will need to build one sample of LXdesktop with MAOP-5 driver, one sample with W3-1878, and compare them with my original LXdesktop speaker, with all samples tuned to the same target, of course. That should be a fun experiment, looking forward to it!

Saturday, September 20, 2025

Visualizing Phase Anomalies

In my unofficial contest of LCR upmixers I encountered multiple cases where the extracted center channel had audible anomalies. These could most often be described as "musical noise" or "underwater gurgling." This kind of artifacts can also be heard when listening to audio content which had passed through low bitrate lossy codecs. One of the reasons that these artifacts can occur is that the audio signal gets processed in the frequency domain, and during the processing the original phase information has been lost or degraded.

For my experiments, I was using two kinds of signals: pink noise and simple two instruments music. From the information theory perspective they sit at the opposite ends of the spectrum: the noise is entirely chaotic and lacks any meaningful information, while music is highly organized on many layers, so let's consider these cases separately.

Pink Noise

One thing that I personally find interesting is realizing how important the relative phase of various frequency components is. For example, if we look just at the frequency spectrum of the phantom center signal extracted by Bertom Phantom Center 2 from uncorrelated stereo pink noise, we will see that the magnitude spectrum is in fact correct and matches the usual spectrum of a pink noise (maybe it is not as "smooth", but these are very minor irregularities):

Yet, an untrained ear can easily hear that this signal doesn't sound like "clean" pink noise and has many artifacts:

So all the issues are actually due to the phase component. But how to understand what is exactly wrong with it? I'm a "visual" person, so I like looking at graphs. However, the phase of audio signals is challenging to visualize. On its own, it's not nearly as intuitive as the magnitude spectrum. In fact, the visualization of the phase of real world signals is even less intuitive than the time domain (waveform) view.

In the particular case of the noise, the phase must be random, basically like the noise itself:

So if we look at the raw view phase of a "proper" pink noise and try to compare it with the phase of pink noise that has artifacts, we will not be able to see much of a difference. Getting to a visualization that works requires some understanding and creativity.

We can ask ourselves—what is the nature the artifacts that we are observing? That's actually the product of our hearing system which automatically tries to find patterns—repeating fragments—in any information it receives. This is normal because all the important sounds that we need to hear: voices, noises from other creatures, and the sounds of nature also have patterns in them. In a "clean" pink noise everything is very much shuffled and the hearing system is unable to detect any patterns so it just perceives it as "noise" (note since we can name it using only one word, the entire "noise" phenomenon actually is just another sound pattern!).

Since it is possible to generate an infinite amount of correctly sounding versions of pink noise—we can just run the random numbers generator over and over again—presence of artifacts does not mean that we have "deviated" from some perfect condition of the phase of the noise signal. Instead, it simply means that the artifacts are some periodic structures created due to corrupted phase information. Because of that, one way of trying to visualize these artifacts is to use some algorithm which is looking for repeating information. One example of such algorithm is the pitch detector. Fortunately, Audacity includes one, and it indeed shows something for the pink noise with artifacts. Check below:

On the top is the result of pitch spectrogram applied to a clean pink noise, under it there are spectrograms of the noise extracted by Bertom Phantom Center 2, UM225 in mode 6, and Dolby Surround algorithms. We can see that the clean pink noise only shows some patterns at low frequencies, and these actually are just processing artifacts (I've seen patterns at low frequencies when examining a pure 1 kHz sinusoid). But!—the noise with actual audible artifacts shows some patterns in the region of 500–3500 Hz where the human ear is very sensitive, and that's what our ear is hearing.

A Bit More on Phase

So I mentioned above that the phase is very non-intuitive, but I also mentioned that the phase is actually very important for proper signal reconstruction. I would like to expand and illustrate these ideas a bit more before we proceed to musical signals analysis.

First of all, let's separate cases of impulse responses and usual musical signals. I'm bringing up impulse responses here because probably most often you could have seen phase graps in audio analysis tools like Room EQ Wizard (REW). You probably know that the value of the phase, since it's an angle of a periodic function normally only goes from -180° to 180° and wraps there. For impulse responses, the continuity of the phase between adjacent FFT bins is very important. That's why phase views always include an "unwrap" function, which lines up an otherwise "jumpy" phase into a continuous line.

However, for usual musical signals phase unwrapping rarely makes any sense because transitions between FFT bins do not have to produce a continuous phase. Take the case of the noise for example—here the bins change completely independent from one another, and that's why trying to "unwrap" phase of a noise will not produce any meaningful visualization.

Yet, in signals that have some structure, for example, in musical signals—there actually exist very important relationship between phases of groups of bins, but not necessarily ajacent ones. If you recall, the FFT decomposes the source signal into a set of orthogonal sinusoids. Now, if we imagine adding these sinusoids in order to get our original signal back, we can realize that relative phases of sinusoids are very important for creating various shapes of the signal in the time domain. For example, let's consider a pulse which has the initial strong transient part. In order to create that part from a set of sinusoids, their phases must be aligned so that their peaks mostly coincide. As I explained in an older post, the result of summing of sinusoids with similar amplitudes greatly depends on their relative phases. When phases are aligned, two sinusoids can produce a signal with at most +6 dB boost, but if their phases are in an inverse relation, then they can cancel each other completely instead.

Below is an illustration how set of sinusoids forms a pulse signal when summed:

In this picture, we see the time domain view of the original signal—it has the base frequency of 128 Hz, and the first 9 sinusoids (these contribute the most of the signal's energy) below it. We can also see the waveform which we get by summing these 9 sinusoids. It's not quite the original signal yet, but it close enough already. If we kept adding the sinusoids specified by the remaining FFT bins, we would eventually reconstruct the source signal. It's interesting to see that the amplitudes of the basis sinusoids are quite small (0.02 absolute value, or less), yet they manage to create a peak which reaches almost 0.420 times larger (!)—on the positive side and lower than -0.4 on the negative side. In order to be able to achieve this magnification it's very important to maintain alignment between their phases!

The problem is that the alignment itself is not possible to see with a "naked eye", as easy as, for example, we can see a fundamental and its harmonics on a magnitude graph. Phase alignment is much more "technical", in a sense that the values of phases are relative to the phase of the corresponding basis sinusoid at the sample 0, and the change with a different speed depending on the bin frequency. If we look at the usual frequency domain graphs: the magniture and the phase, the phase part is not very "illustrative":

As an another example, on the series of graphs below I'm shifting the pulse forward in time. Since its shape is obviously preserved, the relationship between the phases remains the same because the shape of the signal is not changing, yet the values of phases are "jumping" around with no obvious pattern:

On the other hand, if we try to "play" with phase values we can easily disrupt the phase alignment, and the pulse starts to "smear" or even changes its shape completely. In the examples below, I have tried several things: adding random shifts to the phases—this makes the signal "jittery," replacing all phase values with zeroes—this got me a completely different signal, fully symmetric, and finally, I created a "minimum phase" version of the signal by making sure that it has the most energy in the beginning, like an acoustic pulse:

So, the phase of the signal is really-really important. But if looking at the raw phase graph does not really help us in detecting disruptions of the phase information, what should we use then? The answer is that we should use various derivatives of a spectrogram that take the phase information into account. A "classical" spectrogram only shows the magnitude, which, as we can see, means that we are throwing away half of the information about the signal. But some types of spectrograms incorporate phase information into the picture. For example, below is the "classical" spectrogram of the signals from the last example:

We can see the main problem of this visualization—the spectrogram view loses the information about the exact moment when the pulse happens. But if we use a "reassigned" spectrogram, then the frequency-domain view becomes much sharper in the cases when the phase information is consistent. But "mangled" (randomly shifted) phase also produces a blurry image even on a reassigned spectrogram:

Now we have some clues, let's looks at our music signals.

Music Signals

With the uncorrelated pink noise signal, we were in a strange situation where a "reference" extracted center signal did not exist because in theory there is no correlation between the channels and thus there no "correlated phantom center" to extract. We could only compare the extracted center channel with some "theoretical" pink noise, and look at presence of patterns. However, in the case of music signals I do have the "source of truth"—the signals that I had used to create my mix.

However, another consideration that we need to take into account is that none of the upmixers I tested, except "AOM Stereo Imager D," was able to separate the center instrument from the side instrument cleanly. In other words, the extracted center, instead of containing only the "centered" instrument (the guitar) also had a mix in of the saxophone sound (which was panned hard left). Similarly, the left channel also had the saxopone with a mixed in sound of the guitar. For example, comparing the original clean saxophone (bottom) with the processed version (top), we can see that new harmonics have been mixed into the original signal:

If we look at the extracted center channel (which contains the guitar), we actually can see some blurriness of transients compared (at the top) compared to the original clean signal (at the bottom):

That indicates that the phase of the extracted signal is not as good as it was in the original signal. Even more drastic is phase mangling in the right channel which in the source stereo did not contain any hard panned instrument tracks, and only carried the equivalent with the left channel part of the centered instrument. After extracting it, in the ideal case the right channel should become empty, but instead it contained a very poor sounding mix of both instruments, although at very low volume. For comparison purposes, I have normalized its level with other channels. From looking at the reassinged spectrogram we can see a lot of blurriness so there is no surprise that it sounds pretty artificial:

Conclusions

Looking at how hard it is actually to separate stereo signal into components, I'm amazed with the capabilities of our hearing system that can do it so well. Of course, the extraction techniques based on such low level parameters as cross-correlation can't achieve the same result because they do not "understand" the underlying source signals. Source separation (or stems separation) using neural networks trained on clean samples of various types of musical instruments and speech can produce much better results, especially if the reconstruction is able create natural phase—annotations to some of the tools often mention that.

As for my initial task of finding a visual representation for phase issues, I don't think I have fully succeeded. So far, I've only found representations that can illustrate a problem after it has been detected by ear. But I wouldn't rely on these visualizations alone, without listening, for judging the quality of any algorithm.