Tuesday, April 25, 2023

Headphone Stereo Setup Improved, Part I

It was a long time ago since I started experimenting with a DSP setup for headphone playback, which achieves more realistic reproduction of regular stereo records originally intended for speakers. This is similar to what "stereo spatialization" does. Since then, I have been experimenting with various settings for my DIY spatializer with an aim to make it more immersive and natural, and have learned new things along the way.

In this post, I would like to present an updated version of the processing chain along with the discussion of the underlying approach. Since there is a lot of material to cover, I decided to split the post into two parts. In the first part, I talk about relevant research and outline the processing pipeline. In the second part, I will tell you about the process of individual tuning of the setup.

New Considerations

My initial understanding was that I needed to model a setup of stereo speakers in a room. However, after reading more works by S. Linkwitz about stereo recording and reproduction: "Stereo Recording & Rendering—101", and "Recording and Reproduction over Two Loudspeakers as Heard Live" (co-authored with D. Barringer): part 1, part 2, I realized that a good stereo recording captures enough spatial information of a real or an artificially engineered venue, and although it was mixed and mastered on speakers, and thus was considered for a speaker reproduction, speaker playback is not the only way to reproduce it correctly. In fact, reproduction over stereo speakers has its own well-known limitations and flaws. Moreover, if the speakers and the room are set up in a way which works around and minimizes the effect of these flaws, the speakers "disappear" and we hear the recorded venue, not the speakers and the room. Thus, I realized, if I take my initial intention to the limit and strive to model this ideal speaker setup on headphones, then I just need to work towards reproducing on headphones the recorded venue itself, since an ideal speakers-room setup is "transparent" and only serves as a medium for the reproduction.

Clean Center Image

So, what is the fundamental flaw of the speaker reproduction? As many audio engineers point out, there are various comb filtering patterns which occur as a result of summation of fully or partially correlated delayed outputs from the left and right speakers. The delay occurs because the signal from the opposite speaker arrives to the ear a bit later after the signal from the "direct" speaker. There is a very detailed paper by Timothy Bock and D. B. Keele "The Effects Of Interaural Crosstalk On Stereo Reproduction And Minimizing Interaural Crosstalk In Nearfield Monitoring By The Use Of A Physical Barrier" (in 2 parts: part 1, part 2), published in 1986. Their modeling and measurements demonstrate that comb filtering increases with correlation, thus the center image which is formed by fully correlated outputs is the most affected one. Floyd Toole also expresses his concerns about the change of the timbre of the center image caused by comb filtering in his seminal book on sound reproduction, see the Section 7.1.1.

The solution for crosstalk reduction used by Bock & Keele employed a physical barrier between the stereo speakers—remember, it was 1986 and high quality audio DSP was not nearly as affordable as it is these days. In fact, their solution was sort of a prototype for the family of DSP technologies which is now known as Ambiophonics. Whereas, Floyd Toole advocates for multi-speaker setups—the more speakers, the better—so that each source ideally gets its own speaker. This is where the mainstream "immersive audio" technology is heading.

With headphones, interaural crosstalk isn't a problem by design—especially for closed back over-ears and IEMs, and the phantom center image is reconstructed ideally by our brain using correlated signals from the left and right earphones. However, it is more difficult to match binaural signals that lack a sufficient degree of correlation. We need to help the brain by making signals more coherent. Although this can also create some comb filtering, it's well under our control.

Mid/Side Processing

My takeaway from these considerations and experiments is that the center channel should be left intact as much as possible. What is the "center channel" in a stereo recording?—It's the sum of the left and right channels. In the audio engineering world, this is known as the "Mid" component of the "Mid/Side" representation. Note that "Mid" is actually more than just the center. If we consider what happens when we add left and right channels together (L+R), we can observe the following results:

  • fully correlated images sum and become twice as loud (+6 dB) over non-correlated ones;
  • uncorrelated images—those that exist in the left or right channel only, still remain, but they are "softer" than the center image;
  • reverse correlated (or anti-correlated) images—those that exist both in the left and the right channel but have their phases reversed, disappear.

The "Side" channel which is created by subtracting one channel from another (L-R) produces a complementing signal and contains anti-correlated and uncorrelated images, and fully anti-correlated images dominate.

Note that the M/S representation is a "lossless" alternative to the traditional L/R representation. The elegance of this pair of representations is that the same way as we get M/S from L/R by summing and subtracting the channels, we get the L/R back from the M/S using the same operations:

  • M + S = (L + R) + (L - R) = 2L;
  • M - S = (L + R) - (L - R) = 2R.

Thus, assuming that the processing is carefully designed to avoid clipping the signal due to doubling of the amplitude, we can convert back and forth between stereo and Mid/Side as many times as we need.

Thanks to their simplicity and usefulness, M/S encoding and decoding are built-in tools of every DAW. However, to simplify my processing chain, I prefer to use a dedicated plugin by Voxengo called MSED. The benefit of using MSED is that it can work "inline", which means it encodes stereo as M/S, processes it, and then converts back to stereo. The classical trick to make the stereo image wider in headphones is to increase the level of the side component compared to mid, see this paper for an example. We can also suppress the mid or side component entirely. This is how the stereo output looks like in this case:

We can see that the "Mid Only" output is essentially the mid component duplicated to both left and right channels, thus left and right channels become fully correlated, effectively this is what a "mono" signal is. While the "Side Only" output is still a "stereo" signal in which the left and right channels are reverse correlated.

By looking at the waveforms above, we can confirm that we get the original signal back by summing "Mid Only" and "Side Only" tracks together. Thus, it is possible to apply different processing to these parts and be sure that we preserve all the aural information from the source recording.

Stereo Reverb Of The Real Room

Even during my initial experiments, I understood that for increasing envelopment and spaciousness, a reverb must be used. What I didn't fully understand back then was that the more uncorrelated reverb impulses between left and right channels are, the better it works for listener envelopment. This idea was explored by Schroeder in his works on reverb synthesis by DSP (see the Section 3.4.4 in the "Applications of Digital Signal Processing to Audio and Acoustics" book). While correlated reverbs effectively create strong reflections—as if there were lots of hard surfaces around, and this sounds more like the ratcheting echo that we encounter in tunnels.

If you recall from my older post, initially I was using a synthesized reverb produced by the NX TrueVerb plugin. Later I switched to reverb that I extracted from the Fraunhofer MPEG-H authoring plugin. This reverb is used by the plugin for rendering objects in the binaural mode (for headphones). This reverb has more natural sounding and was seemingly recorded in some real room because after looking at its specter I could see signs of room modes. Impulses of its left and right channel were decorrelated enough—overall Inter-Channel Cross Correlation (ICCC), as reported by Acourate is less than 12%. However, while listening to reverb alone, I could still hear a slightly ratcheting echo—why is that?

I checked autocorrelation for each channel in Audacity and found that I can see lots hard reflections in them:

These reflections create comb filtering patterns that sound like the ratcheting effect. So I decided to try another reverb impulse—this time the actual reverb of my room, as it was measured using my stereo speakers. I had obtained these left and right channel impulses as a byproduct of tuning a desktop version of LX Mini speakers with Acourate—another project to write about some time later. In fact, this reverb impulse response had turned out to be much better, while the ICCC figure was just about 1% higher compared to the MPEG-H reverb. Take a look at the autocorrelations of channels:

So, in fact, it was my mistake that I was shunning away from using my room's actual reverb, considering it being "non-ideal." And thanks to the fact that headphones create a more controlled environment, I could adjust the direct-to-reverb ratio to be anything I want. As a result, I created a reverb environment which has reverb even lower than EBU requirements for a studio (EBU 3276), as follows from the analysis displayed by Acourate for a room of the same dimensions as my room:

Note that the level of reverb depends on the headphones used, and this particular graph is for the setting for open-back headphones (Shure SRH1840).

This is an improvement over my initial processing setup which was only compliant with more "relaxed" recommendations for reverb times for a "music presentations" room (DIN 18041, see the picture in the older post here).

The important thing about preparing the impulse response of the reverb is to cut out the first strong impulse of the signal, leaving only late reflections and the reverb "tail." In the processing chain, the "direct" signal comes from another track. By separating the direct signal from the reverb, it becomes much easier to adjust the ratio between their levels, and this becomes important once we try using different types of headphones, more about this in the upcoming part II.

The Effect Of Room Divergence On Externalization

Yet another finding has solidified my conclusion about the need of using a real room reverb. The paper "Creating Auditory Illusions with Binaural Technology" by Karlheinz Brandenburg et al., published in the collection of papers "The Technology of Binaural Understanding" edited by Jens Blauert and Jonas Braach describes an interesting experiment that explores the connection between the "inside the head" (ITL) localization and the room reverberation impulses used for binaural simulation. The study confirms that use of a reverb impulse which matches the room provides better externalization, while a divergence between visually predicted and aurally experienced reverb conditions incurs a confusion. This is commonly referred to as a "room divergence effect". Since it's a psychoacoustic effect, the exact outcome is somewhat complicated and depends on many parameters.

My layman understanding is that the divergence effect is especially pronounced when using open-back headphones, since they don't provide any isolation from external sounds. Thus, unless the room where you are listening to the headphones is completely isolated from the world outside, you still hear the sounds from the "real" world, processed with the real room acoustics. This forms an "expectation" to the auditory system of how external sounds should sound like. If the reverb used for headphone processing does not match this expectation, the brain gets confused, and it's more likely that the auditory image will collapse to ITL. Obviously, closed-backs and especially IEMs isolate better, and for them this effect might be less important for consideration. However, our eyes still see the room, and this can also create expectations about the reverb. Thus, using a real room reverb seems to improve chances for experiencing externalization in headphones, compared to using an "abstract" modeled reverb.

Application Of The Room Reverb

Recalling my original intention to leave the center sources intact, applying the reverb might look like a contradicting requirement. However, with Mid/Side processing it's possible to have both—the idea is that we apply a stronger version of the room reverb to the Side output, and a softer (more attenuated) version to the Mid output.

Since the Side Only output from MSED already contains uncorrelated and reverse correlated signals, "fuzzing" it even more with an uncorrelated stereo reverb does not hurt. In fact, it only makes it better—more spacious and lasting longer, giving the hearing system a better opportunity to analyze the signal. To help the brain even more, we also apply cross-feed to the result. Since cross-feeding is essentially a more sophisticated version of summing of left and right channels, it has a similar effect, that is, it amplifies correlated signals and suppresses reverse correlated signals. However, thanks to the fact that in cross-feed summing is weighted across the frequency spectrum, this effect is much weaker, the application of cross-feed does not produce a fully correlated output, and this is what we want.

When I listen to this "Side Only" reverb in headphones, the representation is fully externalized. When I stand in front of the speakers, it feels like I hear them playing. However, since I'm listening to anti- and uncorrelated parts, the audio image is "smeared" and serves only for the purpose of envelopment. For a better effect, the reverb used for the "Side Only" channel is massaged by a gentle Bessel low-pass filter with the corner frequency at 5 kHz. This simulates natural shadowing of signals that come from the back.

Leaving the center channel completely devoid of reverberation makes it sound in headphones too "dry" and too close. That's why in addition to the relatively strong room reverb applied to the Side Only output, I also apply much weaker and more delayed room reverb to the "Mid Only" component of the input signal. The idea is that this delayed reverb should be unnoticeable by our "conscious" part of the hearing apparatus, and only act as a spaciousness and distance hint to lower layers of brain processing. Thus, this extra reverb mostly relies on the precedence effect, complementing the direct sound and reinforcing it, while still being perceived as a part of it (a.k.a. "fusion").

Listening to this "Mid Only" reverb in headphones, I hear a "focused" sound of the speakers in the room. That's because the signal is formed from a "mono" signal. However, application of an uncorrelated stereo reverb "smears" it and adds some width. In order to find the desired delay and attenuation for the "Mid Only" reverb, I play a dry recording of some strong percussive instrument, like bongos, and increase the delay and reduce its level until I stop noticing the reverb. Yet, when I toggle the track with this extra reverb on and off, I can hear the change in the perceived distance to bongos. If the delay is increased too much, it "breaks" the precedence effect and the reverb turns into an echo.

Diffuse And Free Field Equalization

A lot of discussions are devoted to the topic of recommended headphone tuning. There are two cases representing the poles of the discussion. Günther Theile in his highly cited work "Equalization of studio monitor headphones" argues that the diffuse field (DF) equalization is the only correct way to tune the headphones, since this way the headphones do not "favor" any particular direction and thus provide the most neutral representation of the reproduced sound. A similar point of view is expressed by the founder of Etymotic, Mead Killion in his old blog post.

On the other side, there is the idea that the headphones must be tuned after the canonical 60 degree speaker setup, as measured in a free field (FF), or in an anechoic chamber. In practice, when listening to a "raw" (non-binauralized) stereo in headphones, none of these tunings work satisfactory for the general audience, and headphone makers usually settle up upon some compromise which keeps listeners engaged, based on an "expert opinion" or studies. One well-known example is, of course, the Harman target curve. There is also an interesting systematic approach for blending the curves of the free and diffuse field curves based on the room acoustics, proposed in the paper with a rather long title "Free Plus Diffuse Sound Field Target Earphone Response Derived From Classical Room Acoustics Theory" by Christopher J. Struck. The main idea is to find the frequency where the free field of the room turns into the diffuse field, and use that frequency as the "crossover" point for the DF and FF response curves.

Personally, I'm in the "diffuse field tuning" camp. This choice is rather logical if we aim for tonally neutral equipment. After all, we intend to apply any corrections in the digital domain and don't want to deal with undoing the "character" of the DAC, the amplifier, or the headphones that we use.

Returning to the paper by Brandenburg et al., another interesting finding which it points out is that the source directions for which achieving externalization in headphones is the most difficult are the full frontal and the full backward ones (0 and 180 degrees in the median plane). The hypothesis is that this happens due to the well-known "front-back confusion" from the Duplex theory. I decided to aid the brain to resolve this confusion by giving correlated sounds an "FF-like" frequency shape, and giving their counterparts—anti-correlated sounds a "DF-like" shape. In order to do that, I used the results of yet another paper, "Determination of noise immission from sound sources close to the ears" by D. Hammershøi and H. Møller. It provides averaged frequency shapes for FF and DF sound sources measured at various points of the ear: blocked ear canal, open ear canal, and the eardrum. Using the tabulated data from the paper, I could create "FF-to-DF" and "DF-to-FF" compensation curves. Below are the graphs of the "DF-to-FF" curves, marked with "BE", "OE", and "ED" for the measurement points listed above. The "FF-to-DF" curves can be obtained by inverting these graphs.

Since the paper uses averaged data, the curves are rather smooth except for the high frequency part starting at 6.3 kHz, which reflects the effect of the pinna filtering and the ear canal resonance. Thus, I decided to have two versions for each compensation curve: a complete one, and the one which only goes up to 5 kHz. When applying the "complete DF-to-FF at the eardrum" curve to the "Mid Only", component I could indeed make it sound more "frontal" (when using Shure SRH1840 headphones, at least). While applying the "low pass FF-to-DF at the eardrum" compensation to the "Side Only" component makes it more "enveloping."

The Effect of Adding Harmonic Distortion

Yet another surprising effect which I have discovered myself is how adding harmonic distortions affects apparent source width (ASW). By adding to the "Mid Only" reverb the 2nd harmonic, I could make it sound more "focused." While adding the 3rd harmonic to the "Side Only" reverb makes it even wider. Just to reiterate, the harmonics are only added to reverbs, not to the direct sound, thus the overall level of added harmonics is minimal.

Since I don't entirely understand the nature of this effect, I will try to find more information on its possible cause later.

The Goal of Spatialization

After going through all these lengthy explanations, you may be wondering what's the actual outcome of this? After all, there are commercially available spatializers, with no-hassle setup, with head tracking, etc. Are there any benefits besides learning more about how the human auditory system works? I've done some comparison with the spatialization available on the iOS platform, and I would claim that my DIY spatialization has higher quality. I compared the sounding of tracks which I use for tuning my setup via the iOS spatializer and mine, and I find mine to be more precise and more realistic, allowing to achieve a true state of "immersion."

It's an interesting experience which sort of comes and goes, and depends on the track and the headphones being used. After 12–15 minutes of listening, the brain gets accustomed to the reproduction and eventually starts "believing" that it actually hears the world created by the track. Headphones "disappear"—they feel no different from a hat—we "know" when wearing a hat that it's not the hat who creates the auditory world around us, and I do "know" in the "immersed" state that the surrounding sound does not originate from the headphones. The eyes start automatically following sound sources when they move, and I can really feel their presence. It's also super easy to turn my auditory attention focus from one object to another. It's really a sense of immersion, and it's similar to the feeling of "transparent reproduction" of music via speakers—sort of "audiophile nirvana."

So, yeah, for me it's more interesting to build my own setup and I believe that I can make it sound more natural than affordably priced commercial solutions. A similar thing is with speakers. Sure, there exist a lot of really good speakers, which may work fantastically out of the box, however some people, myself included, find it rewarding to build—if not design—their own.

Topology

OK, if you are still with me, let's take a look at the topology of the processing chain:

Let's go over the processing blocks of the chain. Note that the actual values for plugin parameters are only specified as an example. In the Part II of this post, I will go through the process of finding the suitable values for particular headphones and ears.

The "Input" block is just a convenience. Since my chain is "implemented" as a set of tracks in Reaper, having a dedicated input track makes it easier to switch the input source, try rendering a media file (via the master bus) in order to check for the absence of clipping, and apply attenuation if necessary.

The 3 processing blocks are wired in parallel, and in fact consist of the same set of plugins, just with different settings. The purpose of having the same set of plugins is to make time alignment easier. Although Reaper can compensate for the processing delay, sometimes this does not work right, and having the same set of plugins works more reliably.

The first processing block is for the "Direct" output. According to the principle of keeping the direct output as clean as possible, the only plugin which is engaged here is the cross-feed plugin 112dB RedLine Monitor which is set to the "classical" 60 deg speaker angle, no attenuation of the center, and emulation of distance turned off.

The "Side Reverb" block only processes the Side component, by toggling on the "Mid Mute" button on the Voxengo MSED plugin. As I mentioned above, the room reverb applied here was low-passed. The reverb is applied by MeldaProduction MConvolutionEZ. The cross-feed plugin uses a different setting than the "Direct" block—the center attenuation is set to the maximum, -3 dB and a slightly wider speaker angle: 70 deg is used. This is to avoid producing overly cross-correlated output. Then, also as explained above, the 3rd harmonic is added by using the Fielding DSP Reviver plugin.

The "Mid Reverb" block processes the Mid component only. It uses a whole version of the room reverb, with a higher delay. The cross-feed uses the same angle as the Direct output, for consistency, while the center attenuation is at -3 dB to produce more uncorrelated output. The Reviver plugin is set to add the 2nd harmonic.

The output from all 3 processing blocks is mixed together in different proportions. While the Direct output is left unattenuated, the reverb inputs are attenuated significantly. The actual values depend on the headphones used. Levels that are needed for open-back headphones are so low that the overall frequency response deviation from a flat line is within 1 dB.

The shaping of the signal that happens in the "Output" block is more significant. In fact, the whole purpose of the "Output" block is to adjust the output for the particular headphones. First, per-frequency left-right balance is corrected using the linear phase equalizer LP10 by DDMF—this is similar to the technique originally proposed by David Griesinger.

Then the Goodhertz Tone Control plugin is used to adjust the spectral tilt. The slopes are set to 0% both for bass and treble. This creates a very smooth tilt which practically does not affect the phase, and thus there is no need to switch the plugin into the "Linear Phase" mode. Note that although LP10 can also apply a spectral tilt, it's less flexible than what Tone Control can do. Finally, the MConvolutionEZ plugin, operating in "Mid" and "Side" modes, is used to apply "DF-to-FF" or "FF-to-DF" correction curves.

Obviously, linear phase plugins create significant latency, thus this setup is not intended for a "real-time" playback. However, using linear phase mode is worth it. I actually tried doing headphone balance adjustments using a regular minimum phase equalizer, and the result was much more "fuzzier." In fact, I can hear the same kind of "fuzziness" in the iOS spatializer running in "head tracking" mode. It seems that minimum phase equalization with narrowband filters causes a significant increase of ASW of sound sources.

What's Next

In the upcoming Part II of this post, I will provide steps on finding the right values to configure the processing components. These parameters are printed in italics on the processing chain scheme from the previous section.

Tuesday, November 15, 2022

Long Live 16 Color Terminals

This blog entry is about the process I went through while designing my own 16 color terminal scheme, as an improvement to "Solarized light". Since I invested some time in it, I decided that I want to document it somewhere, just in case if later I will need to go back and revisit things.

What Is This All About

I need to make some introduction into terminals to ensure that I'm on the same page with readers. Terminals were one of the first ways to establish truly interactive communication between people and computers. You type a command, and the computer prints the result, or vice versa—the computer asks you "do you really want to delete this file?", and you type "y" or "n". First terminals were sort of electric typewriters—noisy and slow, thus the conversation between computers and humans was really terse. However, even then interactive text editors had become technically feasible, take a look at the "standard text editor" of UNIX—ed. Later, so called "glass terminals" (CRT monitors with keyboards) arrived, giving an opportunity to more "visual" and thus more productive interaction, and the "Editor war" had begun.

And basically, these visual terminals is what is still being emulated by all UNIX derivatives these days: the "text mode" of Linux, XTerm program, macOS Terminal app, countless 3rd party terminals, even browser-based terminals—these can run on any desktop OS. In fact, I use hterm for hosting my editor where I'm preparing this text.

As the terminal technology was evolving over time, it was becoming more sophisticated. Capabilities of teletype terminals were very basic: print a character, move the caret left or right, go to next line. "Glass terminals" enabled arbitrary cursor positioning, and then with the advent of new hardware technologies, color was added. Having that the evolution of hardware was taking time, color capabilities were developing in steps: monochrome, 8 colors, 16 colors, 256 colors, and finally these days—"truecolor" (8-bit color). Despite all the crowd excitement about the latter, I believe that "less is more," and the use a restricted color set in text-based programs still has some benefits.

Before I go into details, one thing that I would like to clarify is the difference between the number of colors that are available to programs running under a terminal emulator (console utilities, editors with text UI, etc), and the number of colors used by the terminal emulator program itself. The terminal program is in fact only limited by the color capabilities of the display. Even when a console utility outputs monochrome text, the terminal emulator can still use full color display capabilities for implementing nice-looking rendering of fonts and for displaying semi-opaque overlays—the cursor being the simplest example. Thus, setting the terminal to the 16-color mode does not mean we get back to 1980-s in terms of the quality of the picture. And unless one runs console programs that, for example attempt to display full color images using pseudo-graphic characters, or want to use gradient backgrounds, it might get unnoticed that a 16-color terminal mode is in fact being used.

Getting Solarized

I remember the trend popular among computer users of avoiding being exposed to the blue light from computer displays—maybe it is still a thing?—even Apple products offer the "Night shift" feature. Users of non-Apple products got themselves yellow-tinted "computer" glasses or were following advises to turn down the blue component in color settings of their monitors. The resulting image looks more like a page of a printed book when reading outdoors (if tuning is done sensibly, not to the point when white color becomes bright yellow), and probably makes less strain on the eyes. The same result on a terminal emulator can be achieved without any hardware tweaks by applying a popular 16-color theme called "Solarized light" by Ethan Schoonover.

I remember being hooked on the "signature" yellowish background color of this theme (glancing over shoulders of my colleagues, a lot of people are). I never liked the "dark" version because it does not look like a paper page at all. So I was setting up all my terminal emulators to use "Solarized light", and was quite happy about the result.

However, at some point I noticed that color-themed code in my Emacs editor—I run it in "non-windowed", that is, text mode under the aforementioned hterm terminal emulator—does not look like screenshots on the Ethan's page. Instead, C++ code, for example, looked like this (using some code from the Internet as an example):

I started digging down for the cause of that and discovered that every "mature" mode of Emacs basically declares 4 different color schemes: for use with 16-color terminals, and for 256- (actually, >88) color terminals, both having a version for dark and light terminal background. Sometimes, a scheme specific to 8-color terminals is also added. Below is an example from font-lock.el:

(defface font-lock-function-name-face
  '((((class color) (min-colors 88) (background light)) :foreground "Blue1")
    (((class color) (min-colors 88) (background dark))  :foreground "LightSkyBlue")
    (((class color) (min-colors 16) (background light)) :foreground "Blue")
    (((class color) (min-colors 16) (background dark))  :foreground "LightSkyBlue")
    (((class color) (min-colors 8)) :foreground "blue" :weight bold)
    (t :inverse-video t :weight bold))
  "Font Lock mode face used to highlight function names."
  :group 'font-lock-faces)

Thus, in reality in my C++ example display only foreground and background text colors are originating from the Solarized theme, and all other colors are coming from the 256-color scheme of the Emacs C++ mode. Names of colors used in this case (like "LightSkyBlue" above) come from the "X11 palette", and there are many gradations and tints to choose from for every basic color.

In fact, this is one of the drawbacks of the 256- and true-color modes (in my opinion, of course)—apps have too much control over colors, and this leads to inconsistency. For me, too much effort would be required to go over all Emacs modes that I use and ensure that their use of colors is mutually consistent. Whereas, in the 16-color mode not only apps have to use a restricted set of colors, but the set itself is in fact a terminal-controlled palette. Thus, the app only specifies the name of the color it wants to use, for example "red", and then the terminal setup defines which exact tint of red to use. So, one day I switched my terminal to only allow 16 colors, and restarted Emacs...

...And I did not like the result at all! Yes, now I could see I'm indeed using the palette of the "Solarized light" theme, but the result looks quite bleak. I took another look at the screenshots on the Ethan's page and realized that to me the colors of the Solarized palette look more engaging on a dark background. I read that the point of Ethan's design was to allow switching between dark and light backgrounds with a minimal reshuffling of colors, and still having "identical readability." However, to my eyes "readability" wasn't the same as "looking attractive."

As I tried using the Solarized light palette for doing may usual tasks in Emacs, I found that it has a couple more shortcomings. Let's look at the palette:

One thing that bugged me is that the orange color does not look much different from red. I can see that even with color blocks, and with text the similarity goes up to the point that when I was looking at text colored in orange, I could not stop myself from perceiving it as being in red. People are not very good at recalling how "absolute" colors look, we are much better at comparing them when they are side by side.

Another serious problem was the lack of "background" colors enough for my text highlighting needs. I'm not sure about Vim users, but in Emacs I have a lot of uses for background highlights. I can enumerate them:

  • highlighting the current line;
  • text selection;
  • the current match of interactive search;
  • all other matches of interactive search;
  • character differences in diffs (highlighted over line differences);
  • highlighting of "ours", "theirs", and patch changes in a 3-way merge;
  • and so on.

Most of those highlights must have a color on their own so they don't hide each other when I combine them, and they must not make any of text colors unreadable due to a poor contrast. As an example, if I have a colorized source code, and I'm selecting text, I still should be able to see clearly every symbol of it. This is where the Solarized palette falls short, and I can easily explain why.

Color Engineering

One of the defining features of the Solarized palette is that it was created using the Lab color space. Previously, 16 color palettes were usually assigned colors based on the mapping of the color number in the binary form: from 0000 to 1111, onto a tuple of (intensity, red, green, blue) bits, not caring too much about how the resulting colors look to users. Whereas, the Lab color space is modeled after human perception of color, and can help in achieving results which are more consistent and thus more aesthetically pleasing.

The first number in the Lab triad is the luminosity of the color. Let's look at the "official" palette definition in this model:

SOLARIZED L*A*B
--------- ----------
base03    15 -12 -12
base02    20 -12 -12
base01    45 -07 -07
base00    50 -07 -07
base0     60 -06 -03
base1     65 -05 -02
base2     92 -00  10
base3     97  00  10
yellow    60  10  65
orange    50  50  55
red       50  65  45
magenta   50  65 -05
violet    50  15 -45
blue      55 -10 -45
cyan      60 -35 -05
green     60 -20  65

We can see that most colors have luminosity in the range between 45 and 65, with only two of them having either low luminosity: base03 and base02, or high luminosity: base2 and base3. Thus, these colors are the only ones that can serve as backgrounds that work with any text color. Having that one of those 4 background colors is the actual background, only 3 remain—certainly not enough for my use case.

After considering these shortcomings, I decided to tweak the "Solarized light" palette to better suit my needs. Below is the list of my goals:

  1. Use colors that look more vivid with a light background.
  2. Make sure that no two colors look alike when used for text.
  3. Provide more background colors.

And the list of my non-goals, compared to Ethan's goals:

  1. No need to use text colors with a dark background.
  2. Can consider bold text as yet another text color.

In the design process I also used the Lab color space. Thanks to the non-goal 1., I was able to lower the minimum luminosity down to 35. I made some of the colors more vivid by increasing color intensities—as a starting point I took some of the colors used by the 256 color scheme of the C++ mode in Emacs.

In order to make the orange color to be visually different from red, I created a gradient between red and yellow, and picked up the orange tint which I was seeing as "dividing" between those two, in order to guarantee that it is the most distant tint from both red and yellow.

I decided to reduce the number of gray shades in the palette. For non-colored text, I planned to use the following monotones:

  • base1 for darker than normal text;
  • base2 for lighter than normal, less readable text, I moved it to the "bright black" position in the palette;
  • bold normal text for emphasis.

And here comes a hack! I moved the normal text color (base00 in the "Solarized light" theme) out of the palette and made it the "text color" of the terminal. Remember when I said that the terminal emulator program does not have to restrict itself to 16 colors. Most contemporary terminal emulators allow to define at least 3 additional colors which do not have to coincide with any of the colors from the primary palette: the text color, the background color, and the cursor color. The first two are used "by default" when the program running in the terminal does not make any explicit text color choice. Also, any program that does use colors can always reset the text color to this default.

Let's pause for a moment and do some accounting for colors that I have already defined in the 16 color palette:

  • 2 text colors (plus the text color in the terminal app);
  • 8 accent colors: red, orange, yellow, green, cyan, blue, magenta, and violet;
  • 1 background color (this is used when one needs to print text on a dark background, without dealing with "reverse" text attribute which usually looks like a disaster).

Thus we have 16 - 11 = 5, which means there are 5 color slots left for highlights, that's 2 more than in the "Solarized light" theme, and they are real colors, not shades of gray! Since I removed or moved away the shades of gray used by the original Colorized palette, I placed the highlights where grays used to be, as "bright" versions of corresponding colors.

When choosing color values for the highlights, I deliberately made them very bright (high value of luminosity) to make a good contrast will any color used for text. One difficulty with very bright colors is making them visually distinctive, to avoid confusing "light cyan" with "light gray" for example.

This is the palette I ended up with, and it's comparison with "Solarized light" (on the left):

And below is the comparison of the Lab values, along with a "web color" RGB triplet. Compared to the initial table I took from the Solarized page, I have rearranged colors in the palette order:

PALETTE        SOLARIZED L*A*B       COLORIZED   HEX
-------------- --------- ----------  ----------  -------
Black          base02    20 -12 -12  20 -12 -12  #043642
Red            red       50  65  45  40  55  40  #b12621
Green          green     60 -20  65  50 -45  40  #1f892b
Yellow         yellow    60  10  65  60  10  65  #b68900
Blue           blue      55 -10 -45  55 -10 -45  #268bd2
Magenta        magenta   50  65 -05  35  50 -05  #94245c
Cyan           cyan      60 -35 -05  55 -35  00  #249482
Light Gray     base2     92  00  10  92  00  10  #eee8d6
Bright Black   base03    15 -12 -12  65 -05 -02  #93a1a1
Bright Red     orange    50  50  55  55  35  50  #c7692b
Bright Green   base01    45 -07 -07  96 -10  25  #eef9c2
Bright Yellow  base00    50 -07 -07  94  03  20  #ffeac7
Bright Blue    base0     60 -06 -03  90  00 -05  #dfe3ec
Bright Magenta violet    50  15 -45  40  20 -65  #3657cb
Bright Cyan    base1     65 -05 -02  94 -08  00  #ddf3ed
White          base3     97  00  10  97  00  10  #fdf6e4
Text Color                           50 -07 -07  #657b83
Cursor Color                     Bright Magenta, opacity 40%

(Note that even for colors that retain their Lab values from Solarized, I may have provided slightly different RGB values compared to those you can find on Ethan's page. This could be due to small discrepancies in color profiles used for conversion, and unlikely produce noticeable differences.)

Compared to the "Solarized light" palette, I have redefined 6 accent colors, and thrown away 2 "base" colors. I decided to name my palette "Colorized," both as a nod to "Solarized" which it is based on, and as a reference to the fact that it looks more colorful than its parent.

Emacs Customizations

Besides defining my own palette, I also had to make some tweaks in Emacs in order to use it to full extent. While customizing colors of the C/C++ mode, I made it visually similar to the 256 color scheme I was using before, but more well-tempered:

Shell Mode

It's a well known trick to enable interpretation of ANSI escape sequences for setting color in the "shell" mode of Emacs:

(require 'ansi-color)
(add-hook 'shell-mode-hook 'ansi-color-for-comint-mode-on)

What is less known is that we can then properly advertise this ability to terminal applications via the TERM variable by setting it to dumb-emacs-ansi. This is a valid termcap / terminfo entry, you can find it in the official terminfo source from the ncurses package.

Besides that, it's also possible to map these ANSI color sequences to terminal colors arbitrarily. For example, I mapped "bright" colors onto bold text. This comes handy both for the original Solarized palette and my Colorized one because "bright" colors in it are in reality not bright versions of the first 8 colors, thus when apps try using them the resulting output looks unreadable.

The full list of Emacs customizations is in this setup file. It's awesome that when using it, I naturally forget that only 16 colors (OK, to be fair, 17, if you recall the terminal text color hack) are used. This way, I have proven to myself that use of "true color", or even 256 color terminal is not required for achieving good looks of terminal applications.

Conclusion

Big kudos to Ethan Schoonover for creating the original Solarized theme and explaining the rationale behind it. The theme is minimalist yet attractive, and proves that it's possible to achieve more with less.

Monday, September 5, 2022

MOTU: Multichannel Volume Control

Going beyond simple 2-channel volume control still presents a challenge, unfortunately. The traditional design is to view a multichannel audio interface as a group of multiple stereo outputs, and provide an independent volume control for each group, without an option to "gang" multiple outputs together. True multichannel output devices are normally associated with A/V playback, and indeed modern AVRs do present flexible options for controlling the volume, including support even for active crossovers. For example, the Marantz AV7704 which I was using for some time has this option. However, AVRs usually have a large footprint in terms of consumed space.

Computer-based solutions are even more flexible, and in recent years come in compact forms and fanless cases, making them a more attractive alternative to AVRs. I was using a PC running AcourateConvolver also for a long time. I didn't mind that it applies attenuation in the digital domain, because it does that correctly, with a proper dithering. However, Windows does not appear as a hassle-free platform to me, because it always unceremoniously wants to update itself and restart exactly when you don't want it to do so.

After the Windows computer which I was using for AcourateConvolver broke, the solution I had switched to was RCU-VCA6A by RDL (Radio Design Labs), which seems to work reliably, does not want to update itself, and does not introduce any audible degradation into the audio path (at least, compared to the VCA built into the QSC amplifier). But still, it's an extra analog unit, could we get rid of it?

Turns out, the solution was right there all the time since I started my audio electronics hobby. It can be trivially done using ingenious control software of "MOTU Pro Audio" line of audio interfaces, which includes my MOTU Ultralite AVB. Interestingly, the first thing that I had noticed when I bought this audio interface is that the control app runs in the browser, talking to the web server running on the card. This was in a sharp contrast to the traditional approach to install native apps on the host Mac or PC.

What I failed to realize, though, is that the control app actually has two layers. There is the visible UI part, but also the invisible server which provides a tree-like list of the audio card resources. Thus, in order to automate anything related to the audio interface management there is no need to work "through" the web app, it's allowed to talk to the server part directly. This is great, because any automation which tries to manipulate the UI is fragile by definition.

I've found the reference manual for MOTU's AVB Datastore API, which is indeed very simple, and any CLI web client can work with it. Another useful fact that I have discovered by accident is that although the datastore server reports the range of accepted values for the trim level of analog outputs 1–6 being only from -24 dB to 0 dB, it happily accepts lower values, thus the effective trim level of all analog outputs is the same, going down to -127 dB.

I decided to re-purpose some of the physical controls on the card itself to serve my needs. Since I have a dedicated headphone amplifier, I never use the phones output, thus the rotary knob and the associated trim level for the phones output can instead be used to control the trim level of the analog outputs. When turning the knob, the current volume level is displayed on the audio interfaces' LCD screen. This is needed because the knob is just a rotary encoder, not a real attenuator, thus it does not have an association between the current level and the knob's position. This fact actually makes it much less comfortable than the VCA volume pot which I've made myself to use with RCU-VCA6A. With the encoder it's convenient to perform small adjustments—a couple of dBs up or down, but turning the volume all the way down requires too many turns. Thus, I also wanted to have a mute button. I decided that since I never have used the second mic input, I can use it's "Pad" button to fulfil this role. The "Pad" button has a state, and it's lit when it's turned on.

In order to implement these ideas, I had to write a script. I decided not to use Python to avoid over-designing the solution, instead I turned to something really simple, stemming from the original UNIX systems of 70-s—a bash script. In fact, this solution is truly portable as the POSIX subsystem is part of any modern operating system, including even Windows (via WSL).

The logic of the script is simple. I use the volume of the phones output as the source value for all analog outputs driving my desktop speakers: outputs 1–5. The volume of the output 6 is used to store the mute level. Thus, in practice it can be a full mute, or it can be just a "dim" (for example, -40 dB). The value of the pad for Mic 2, as I've mentioned before, is used to switch between normal and mute trim levels. This way, the control logic can be described as follows:

  1. Read the current values of the "main" and "mute" trims, and the value of the mute toggle switch.
  2. Depending on whether the device is supposed to be muted according to the switch, swap the values as needed.
  3. Apply the volume to the analog outputs.

Then the script uses the ETag polling technique to ask the server to report back when any of the values have changed as a result of user's action (this is also described in MOTU's manual). Then all goes back to the start.

The full script code is here on GitHub, it's only about 70 lines of code. If needed, this way of controlling the MOTU interface can be extended to be fully remote.