Saturday, March 26, 2022

Modular Audio Processing System, Part I

Recently I have completed assembling a system which I use for my casual audio playback needs and for audio experiments. I decided to reflect on the process of choosing and tuning the components of the system, also mentioning other options I was considering.

The goal of the system is to accept audio from various sources: files, mobile devices, web browsers, then process it to apply necessary room/speaker correction and binaural rendering, and then play on speakers and on headphones. There are of course numerous ready made commercial solutions for this, packing everything into one unit, however my intention was to have a truly modular system where each component can be replaced, and new functionality can be added if needed. Essentially, the system is built around a Mac computer with a pro soundcard. The challenging part was to figure out what additional equipment I need, and how to organize it physically, so that it does not just lay as a pile on the table, entangled by a web of cables.

For a long time already I stick to the "half rack" (9.5 inches) equipment form factor. I have a couple of racks and some audio equipment which either was designed for this format, or can be easily adapted to it. Here is how my current rack looks like:

Below is the schematics of connections between the blocks:

As I've mentioned before, the heart of the system is a Mac—an old model Mini from 2014, which I'm also using to type this post in. I've highlighted the inputs and the outputs of the system. Some of the input and output ports are mounted on the back panel:

All other interconnections are hidden inside the rack, and this makes the result to look nice if not "professional." Now let's consider the system component by component.

Mac Mini + MOTU UltraLite AVB

These two components essentially make a single unit equivalent to a dedicated DSP system, but in my opinion, more flexible. I wrote about the capabilities of UltraLite before: here and here. The features that I use for my needs are:

  • The routing matrix which is convenient for collecting various inputs: from applications, from external hardware, and via Ethernet. Note that I only use digital inputs on UltraLite in order to avoid adding noise.
  • AVB I/O is worth mentioning on its own as, I think, it's much more flexible than traditional point-to-point digital audio interfaces: SPDIF and USB.
  • DSP with some basic EQ functionality, as I wrote in the earlier post on the speaker system setup, it's enough for some basic corrections, however it is incapable of "serious" linear phase processing, which is done on Mac.

As for the functionality where UltraLite falls short, Reaper DAW running on Mac Mini helps to fill in the gaps. Here I can run linear phase FIR filters, linear phase equalizers, crossfeed and reverberation for headphones, and create arbitrary audio delays for synchronizing audio outputs. Note that although Mac Mini isn't a fanless computer, it stays perfectly quiet while running all this processing. Only sometimes does it briefly turn on the fan—this sounds like a loud exhale—and then keeps itself silent for awhile.

Having AVB input is nice as it allows using other computers for providing audio. Although Mac Mini runs Reaper alone with no glitches, launching a web browser inevitably introduces them—modern browsers are very heavy CPU-wise and perform a lot of disk I/O. That's why whenever I have to use a browser-based streaming client, I prefer to run it on a separate computer or a mobile device. The beauty of AVB is that, on Mac at least, one does not need to use any extra hardware audio interfaces, the only thing that is needed is a Thunderbolt to Ethernet dongle.

I also use UltraLite as a DAC—it has 8 line level outputs. Despite that interfaces by MOTU are considerably cheaper than functionally equivalent interfaces from RME, the quality of their analog outputs are on par, if not better. For example, below is a comparison of THD+N measurement for 0 dBFS 1 kHz tone, at +13 dBu of UltraLite's line out vs the line out of RME FireFace UCX (1st gen), both interfaces running at 48 kHz sampling rate, as measured by E-MU 0404 (this is to eliminate any possible bias when the card is measuring itself via an analog loopback):

Note on the output level—RME has selectable output level for its line outputs with the following modes: -10 dBV, +4 dBu, and Hi Gain (see the tech specs). From measurements, I've found the +4 dBu mode to be the "sweet spot": the signal level is high enough, yet distortions are lower than on Hi Gain. On MOTU Ultralite, the same output level is achieved by setting the output trim to -6 dB (that is, the level of full volume output on UltraLite is equivalent to Hi Gain mode on FireFace).

As we can see from the direct comparison, the output from MOTU is cleaner (this could also seen from comparing tech specs, but I wanted to double-check that, also MOTU specifies THD+N of output as low as -112 dB, I could not confirm that). Note that the 2nd generation of RME FireFace UCX declares better specs, but it's still much more expensive than MOTU. The huge benefit of the RME interface is its rock solid stability. With MOTU I'm occasionally running into the issue with emerging high-pitched signal dependent noise, fixing which requires rebooting the interface.

Lossless Wireless Output via Audreio

All the connections in my system use wires. However, sometimes it's nice to be able to drop the headphone wire, especially when I'm listening to music while doing things away from the computer. Surely, I did try a Bluetooth option, using a transmitter based on a Qualcomm chipset, which supports aptX HD codec—still lossy but sounding good nevertheless. However, it seems that BT transmitters, despite what their marketing materials say about the range of the connection, are mostly designed for the case when the listener sits in front of a TV. Once there is no straight line of sight between the TX and the RX, or when I move a bit further away, the connection switches to a lower bitrate, and finally degrades to the SBC codec which sounds noticeably lossy.

Because of this limitation of Bluetooth I decided to use WiFi. Its stations have more powerful transmitters and can turn up the power, even beaming the radio waves towards particular direction—do everything in order to keep a high bitrate connection with the receiver. I use my phone as an endpoint and run Audreio plugin inside Reaper on Mac to transmit audio to their app running in the phone. Since I'm not interested in low latency, I set the largest buffer size, and this works well with almost no dropouts as I'm moving around my house. Unfortunately, further development of Audreio has been canceled, however the last released app and plugin versions are stable and I haven't run into any issues while using them.

Naturally, with lossless audio delivered to the phone, it would be unwise to use wireless headphones. Instead, I usually use ER4SR by Etymotic, powered by a simple headphone DAC dongle by UGreen company, we will check its performance later.

Drop + THX AAA 789 Headphone Amp

I picked this amp because it has balanced line inputs and variable gain setting. I prefer balanced line inputs because there can be strong electro-magnetic fields inside the rack due to presence of power supplies. The Drop amplifier it is indeed very linear, thanks to the THX schematics.

Other options that I tried:

  • Built-in output on MOTU UltraLite. This one I only use for "debugging" or quick A/B comparisons. MOTU's output has relatively high output impedance, not as linear as Drop, and the volume control is not very comfortable.
  • Phonitor Mini. I used these for a long time and they were my reference for the crossfeed implementation. They also have balanced inputs and a dedicated mute switch, so you can leave the volume setting intact when you need a brief pause. However, both units that I had suddenly broke at some moment.
  • AMB M3 (my build). Due to its high power, this amplifier is also very linear under normal listening conditions, however it has two shortcomings: lack of balanced inputs, and high level of cross-talk between channels (not to be confused with crossfeed) due to it's "active ground" design. More about this in my old post.

I don't consider a headphone amplifier as the tool to make listening more "enjoyable." In my setup all psychoacoustic improvements are achieved by DSP processing. Thus, what is left to the headphone amp is to be as "transparent" as possible, and that means:

  • linearity and low noise,
  • close to zero output impedance, and
  • consistent left-right balance across the volume range.

It's interesting to compare output from Drop's amplifier with the mobile DAC by UGreen into the same resistive 33 Ohm load. Drop is driven by the line out of MOTU. UGreen dongle is connected to an iPhone running Audreio app. The signal source is the same—REW. Volumes are set to provide the same output voltage of about 2 dBu: for UGreen this is maximum volume, Drop still has some room for increasing the volume, and the gain setting is I (minimal gain):

Unfortunately, there has been some mains noise during measurement which has degraded the calculated THD+N value. However, the THD figure is the same as we have seen on the MOTU's line out directly, this is very good.

It's clear that the dongle is struggling at this output level. Better THD (less harmonics) are seen when the output volume is reduced to about 3/4, however this lowers the signal-to-noise ratio. Still, I think for a $15 dongle the results are good. I was also impressed that the output impedance of the dongle is only 0.3 Ohms. I'm glad that the manufacturer follows good engineering practices.

RDL RU-RCA6A Multichannel Volume Control

With excellent hybrid analog-digital volume controls that modern audio interfaces offer, one could wonder is there still a need to use an external analog volume control. Well, sometimes there is. In my topology, only part of the outputs from UltraLite go to speakers, and there are more than two outputs that need to be controlled simultaneously. Surprisingly, doing this in a convenient manner is still a challenge! Usually volume controls in sound cards only work for stereo pairs of channels or for each individual channel. Even MOTU's excellent web interface these it no possibility to "bind" several channels into a group for volume control purposes.

That's why I decided to use an external volume control unit. Another valid use case to use it is when there is a need to build a multichannel output out of multiple stereo DAC units. "Multichannel" does not necessarily mean surround sound, it might be a stereo setup with line level "active" crossovers for the purpose of bi- or tri-amping.

Side note: these days there are plenty of "multi-room" playback systems offering time and volume level synchronization while playing over computer wireless or wired networks. It's an interesting option to explore, but keep in mind that most of those consumer network protocols, like AirPlay, are limited to CD audio quality (44 kHz / 16 bits). I will talk about AirPlay in the next part of this post.

My specific requirement to the unit was the form factor. It's easy to find a multichannel AV "prepro", like the Marantz unit I wrote about some time ago, however they are all made in the standard 19" "full rack width" format. Luckily, there exist "pro" equipment, but unfortunately it's typically quite expensive. Two "mid-price" pieces of equipment that I could find are: RCA6A by Radio Design Labs (RDL) and Volume8 by Sound Performance Lab (SPL—same company that makes Phonitors). I plan to do an in-depth comparison of these units some time later. For now, I can tell that Volume8 is all about the quality of the audio, while RCA6A was built with a focus on providing remote control options. The option which I chose to use with RCA6A is a simple 10 kOhm pot which I put in an enclosure:

Matte finish of the big black aluminum knob I found on Amazon pairs nicely with the plastic of the box.

What about the quality of RCA6A? Yes, for a purist in me, it could be better. Below is the output from the RCA6A at unity gain fed by UltraLite's line output at the same output volume setting I was using to make the THD+N measurement:

As we can see, RCA6A adds some non-negligible odd harmonics. Note that the level of the 2nd harmonic which originates from UltraLite remains the same, while the 3rd harmonic gets almost 13 dB higher. The level of noise goes up by 3 dB. However, because I still use inexpensive KRK Rokit monitors, I doubt that RCA6A is a "bottleneck" in terms of audio quality.

To be continued

As shown on the photo of the rack and the scheme, there are also a couple of digital audio units and the power supply. I will discuss them in the next post.

Saturday, December 18, 2021

YouTube Music and Intersample Peaks

A while ago I was checking how Google Play Music and the playback chain following it are handling intersample peaks. Since then, GPM was retired and replaced with YouTube Music (YTM), browsers got uncountable number of updates, and so on. Did the situation with digital headroom improve? I was stimulated to check this by the fact that I tried using YTM in Chrome browser on my Linux laptop, and was disappointed with the quality of the output. Before that, I was using YTM on other OSes, and it was sounding fine. Is there anything wrong with Linux? I decided to find out.

I have updated my set of test files. I took the same test signal I used back in 2017: a stereo file where both channels carry a sine at the quarter of the sampling rate (11025 or 12000 Hz), with a phase shift at 45 degrees. The left channel has this signal normalized to 0 dBFS, and this creates intersample overs peaking at about +3 dBFS, and the right channel has this signal at the half of the full scale (6 dB down), which provides enough headroom and should survive any transformations:

I have produced a set of test files to include all the combinations of the following attributes:

  • sample rate: 44.1 and 48 kHz;
  • bit width: 16 and 24;
  • dither: none and triangular.

There are two things that I can validate using these signals: non-linearities introduced by clipping or compression of intersample peaks, and whether the inter channel balance stays the same. For measuring non-linearities I used the THD+N measurement. Although, due to the fact that the signal is at the quarter of the sampling rate, even the second harmonic is out of frequency range, the "harmonic distortion" part of this measurement doesn't make much sense, however the "noise" part still does. There is a strong correlation between the look of the frequency response graph and the value of the THD+N.

I have uploaded my test signals to YouTube Music and then measured the THD+N in the following clients:

  • the official web client running in recent stable versions of Chrome and FireFox on Debian Linux, macOS, and Windows,
  • and the official mobile apps running on an Android phone (Pixel 5) and iPad Air.

All the outputs were measured using a digital capture chain. For macOS and Windows I used a hardware loopback on RME FireFace card. For Linux I used Douk Audio USB to S/P-DIF digital interface (Mini XMOS XU208) which was connected by optics to the FireFace card. For mobile devices I used a dual USB sound card iConnectAudio4 by iConnectivity. The sound cards were configured either at 44.1 or at 48 kHz.

Observations and Results

The first thing I've noted was that YouTube Music stores audio tracks at 44.1 kHz sample rate (this is confirmed by looking at the "Encoding Specifications" in the YT tech support pages), and 48 kHz files got mercilessly resampled, clipping the channel with overs quite severely. This can be easily seen by comparing the difference between the L&R channels of the signal played back—it's only 4.34 dB instead of 6 dB. Below is the spectrum of the 48 kHz test signal after it has went through YTM's server guts:

Also, it can be seen from the graph, YTM does some "loudness normalization" by scaling the amplitude of the track down, likely after resampling it to 44.1 kHz. This causes the peaks on both channels to be down by about 11 dB. Actually, that's good because it provides needed headroom for any sample rate conversions happening after the tracks leave the YTM client.

As for the lossy compression, it actually doesn't add much artifacts, as we can see from this example:

Yes, there is a "bulb" around the original signal likely added due to the fact that the codec works in the frequency domain and has reduced resolution. However, the THD+N of this signal is just 3 dB down (-103.4 dB) from the 16-bit dithered original (-106.8 dB), and it's still on par with capabilities of good analog electronics. So, lossy codec is not on the list of my concerns for the content on YTM.

Desktop Clients

On desktop, the difference in the measurements only depends on the browser. However, the trouble with Linux is that both Chrome and FireFox always switch the output to 48 kHz as they start playing even if the PulseAudio daemon is configured to use 44100 Hz for both the "default" and "alternative" sample rates. As we will see, this does a bad job for Chrome and likely was the reason why I felt initially that something is going wrong with YTM on Linux.

Yet another interesting observation on the desktop is that in case the browser does a bad job of resampling, bringing the digital volume control down on the YTM client does not provide any extra headroom for the browser's processing. That was a bummer! Apparently, the order of the processing blocks has changed, compared to Play Music, putting the digital attenuation after resampling, maybe because YTM uses some modern web audio API which gives the browser more control over media playback.

Here is a summary of THD+N measurements for Chrome and FireFox for cases when the system output is either at the "native" sampling rate—44.1 kHz or at 48 kHz. On the left there are baseline numbers for the original dithered signal, measures for the left and the right channel are delimited with a common slash:

Signal Chrome to 44 Chrome to 48 FireFox to 44 FireFox to 48
24/44, -146.7 / -139.2 -102.7 /-103.7 -29.6 / -82.6 -103.4 / -103.7 -103.3 / -103.5
16/44, -106.8 / -95.5 -102.9 / -97.8 -29.6 / -82.5 -102.1 / -97.8 -102.3 / -97.6
24/48, -147.5 / -139.7 -17.7 / -98.4 -17.7 / -80.6 -17.7 / -98.4 -17.7 / -98.4
16/48, -106.7 / -95.6 -17.7 / -89.4 -17.7 / -79.7 -17.7 / -89.5 -17.7 / -89.3

As we can see here, Chrome doesn't do a good job when it has to resample the output to 48 kHz, thus on Linux the only option is to use FireFox instead of it. And obviously, even FireFox can't undo the damage done to the original 48 kHz signal with intersample overs.

My guess would be that the audio path on FireFox uses floating point processing which creates necessary headroom, while Chrome still uses integer arithmetic.

Mobile Clients

Results from iOS are on par with FireFox confirming that this is likely the best result we can achieve with YTM. Android adds more noise:

Signal Android to 44 Android to 48 iOS to 44 iOS to 48
24/44, -146.7 / -139.2 -92.9 / -92.2 -92.9 / -92.2 -102.8 / -102.2 -102.8 / -102.2
16/44, -106.8 / -95.5 -92.6 / -88.3 -92.6 / -88 -101.8 / -97.7 -102 / -97.7
24/48, -147.5 / -139.7 -17.7 / -92 -17.7 / -92 -17.7 / -98.5 -17.7 / -98.5
16/48, -106.7 / -95.6 -17.7 / -87.9 -17.7 / -87.8 -17.7 / -89.4 -17.7 / -89.4

I had a chance to peek "under the hood" of Pixel 5 by looking at the debug dump of the audio service. What I could see there is that there are extra sample rate conversions happening on the way from the YTM app to the USB sound card. The app creates audio tracks with 44100 Hz sample rate. However, the USB audio on modern Android phones is managed by the same SoC audio DSP used for built-in audio devices, to bring down latency when using USB headsets. The DSP works at 48 kHz. Thus, even when the USB sound card is at 44.1, the audio tracks from YTM first got upsampled to 48 kHz to get to the DSP, and then DSP downsamples them back to 44.1 kHz for the sound card. I guess, on Apple devices either this pipeline is more streamlined, or everyone (including the DSP) use calculations providing enough headroom.

Conclusions

I think, it is all pretty clear, but here is the summary how to squeeze out the best quality from YouTube Music:

  • on desktop, when using Chrome (or Edge on Windows), set the sampling rate of the output to the native sample rate of YTM: 44.1 kHz, if that's not possible, use FireFox;
  • on Linux, always use FireFox instead of Chrome for running YTM client, because even lowering the digital volume on the YTM client does not prevent from clipping;
  • due to the fact that YTM applies volume normalization, there is no need to worry about having digital headroom on the DAC side;
  • any 48 kHz or higher content needs to be carefully resampled to 44.1 kHz before uploading to YTM to prevent damage from their sample rate conversion process.

Monday, November 1, 2021

Headphone Stereo Setup

After making a satisfying desktop stereo setup I decided to do something similar with headphones. As I had discussed before, right out of the box no headphones sound convincing to me simply due to physics and psychoacoustics issues that can't be solved using traditional headphone construction. As a result it's just not possible to reproduce a stereo record intended for speakers and expect having instruments placed correctly in the auditory image, even on "flagship" headphones. I'm always puzzled when I encounter phrases like "rock-solid imaging," in headphone reviews especially accompanied with measurement graphs confirming that the left and the right earphones are perfectly matched. I don't know—perhaps the reviewer has a perfectly symmetric head and body, and ideally matched ears—for my aging ears I know that the right one is about 3 dB more sensitive than the left one, so on perfectly matched headphone drivers I naturally have the auditory image shifted to the right slightly.

On the other hand, in order to achieve convincingly sounding stereo reproduction in headphones it's not necessary to go "full VR", measure individual HRTF of the listener in an anechoic chamber, and then perform physically correct simulation of speakers as virtual sound sources placed in front of the listener in a room, and moving around as the listener's head moves. In fact, after trying to use Waves NX for some time, I've found that head tracking only creates an additional distraction as it requires periodic resetting of the "neutral" head position due to headband shifting on the head. So I wanted something simpler, and I think I've found a good middle ground for myself with my setup.

In my headphone setup I follow the same principles as when setting up the desktop speakers—get the most important things right first, and then tune up the rest, getting as close to "ideal" as possible, but stopping when the cost of the next improvement becomes too high. However, the implementation of these principles is a bit different. There isn't as much of "physical alignment" in the headphone setup as one have to do for speakers. The only thing I had to ensure is that the headphone amplifier stays linear and doesn't produce distortions. Then most of the setup happens on the DSP side. But even there a distinction between "main" and "fine" tuning do exist.

As I had explained in my earlier post on headphone virtualization, reproduction over headphones lacks several components that we take for granted when listening over speakers:

  1. Room reverberation. This is a very important component which significantly supports the sound of the speakers themselves and also helps to place reproduced sources correctly in the auditory image. Acousticians love to talk about "direct-to-reverb" sound ratio when considering opera halls and other venues, as this is one of the parameters which separates good sounding spaces from bad sounding ones.

  2. Acoustical leakage between speakers. This is considered as a negative factor in VR over speakers applications, because for VR one needs to control precisely the sound being delivered to each ear, however stereo recordings actually rely on this acoustical leakage. Without it, sources that are hard panned to one channel tend to "stick" to the headphone which is playing them, narrowing the sound stage considerably.

  3. Assymetries in the human body and the hearing system. Listening over headphones makes the sounds coming into left and right ear very symmetric and this confuses the auditory system. Also, with aging, sensitivity of ears becomes less and less symmetrical, requiring individual tuning of headphones.

To achieve more realistic reproduction over headphones we need to replicate the effects of the factors listed above. Some manufacturers of headphones tried to do that in hardware, and we got products like AKG K1000 "earspeaker" headphones, which I guess sound almost right for stereo records, but are quite cumbersome to use, not mentioning the price. A good pair of open over-ear headphones can also come close to naturalistic stereo reproduction because they allow for some inter-aural leakage as well as slight interaction with the room. However, closed over-ear headphones and IEMs are hopeless in this respect, and only electronic or digital solutions can help them to produce a speaker-like stereo soundstage.

Before we dive into details of my setup, there are two main factors that are indicative for me when judging correctness of headphones tuning:

  • The sound is localized outside of the head. Although the actual perceived distance still depends on the recording, and sometimes it feels that the vocals are still very close to your face—for lots of modern records that's in fact the "artist's intention"—however, by quick A/B comparison with unprocessed headphone sounding one can quickly understand that although the sound appears to be close to the face, it's definitely not inside the head.

  • Every instrument can be heard distinctively, similar to how it sounds over well-tuned stereo speakers. By replicating the natural HRTF of the person via headphone tuning we "place" each frequency band correctly in the auditory image, and this allows the auditory system to separate auditory streams efficiently.

As a final analogy, putting on properly tuned headphones feels similar to wearing VR glasses—you feel "immersed" into scene, as if you are peeking into it via some "acoustic window."

The Tuning Process

The process of headphone tuning can be separated into several phases:

  1. Simulate ideal reverberation conditions for the actual room we are listening in. Although we can simply capture the reverb of the room, it's usually far from "ideal" due to strong reflections. If you went all the way building an ideal physical room—congratulations!—you can just use the captured response directly. However, one can just build a great virtual version of their room instead.

  2. Adjust the crossfeed and direct-to-reverb (D/R) ratio making sure that phantom sources end up placed correctly, especially those in "extreme" positions—outside the speakers. This tuning also moves the acoustic image out of the head.

  3. Tune the inter-aural frequency balance. This way we emulate the natural HRTF and any deficiencies of the individual's hearing apparatus that the brain got accustomed to.

  4. Finally, as an optional step we can use time domain signal correction to ensure that the electrical signal reaching the headphones has properties close to those of an ideal low-pass filter.

As for the headphones choice, my intention was to create tuning to use with Shure SRH-1540 closed-back headphones. These headphones are very comfortable to wear: lightweight, having a negligible pressure on the head, and not causing my ears to sweat. However, their factory tuning is too much V-shaped—a strange choice for "studio" headphones by the way. I strongly prefer the tuning of headphones made by Audeze because it resembles very close the sound of properly tuned speakers (and I have confirmed that by measuring with in-ear microphones), but the weight of magnetic planar headphones literally brings my head down (I made a comparison of the weights in one of my previous posts), and their thick faux leather pads quickly turn my ears into hot dumplings. So I ended up using Audeze EL-8 closed back as a tool for tuning, but after finishing with it I put them back into their box.

Reverberation

The idea behind replicating the reverberation of the room is that once we enter a room our hearing system adapts to it, and uses reflected sounds and reverberation as a source of information for locating positions of sources. This happens unconsciously—we just "feel" that the sound source is out there, without actually "hearing" the reflected sounds, unless the delay is large enough to perceive them as echoes. Thus, when we replicate the reverberation of the room over headphones, this helps the auditory system to perceive the sounds we hear as happening around us, in the room.

I captured the reverberation of my room using the classic technique of recording a pop of a balloon. Then I took the NX TrueVerb plugin and tried to tune the parameters so that resulting reverb matches the one I've captured. Speaking of "ideal" reverberation—I liked the idea of "ambechoic" room pioneered by G. Massenburg—I read about in the book "Acoustic Absorbers and Diffusers". The physical implementation of "ambechoic" requires using a lot of wide-band diffusers in order to "break up" all the reflections while retaining the energy of the reflected sound. In the virtual recreation, I simply turned off early reflections simulation and set the density of the emulated reverb tail to maximum value, and this is what I've got (ETC graph):

The first strong reflection (marked by the cursor) is created by the Redline Monitor, more on that later. Note that the reverb tail still looks a bit spiky, but this is the best I could obtain from TrueVerb.

I'm not very good at matching reverbs "by ear" so I used two tools: the IR measurements of Smaart V8 and the RT60 calculator of Acourate. The first has a good proprietary algorithm for finding the D/R ratio and overall decay times, the second shows in a convenient form decay times for each frequency band, and can display tolerance curves from standards.

Below are side by side comparisons of ETC for the real vs. emulated rooms as shown by Smaart V8:

I tried to get them as close as TrueVerb's controls allowed me to do. The early decay time (EDT) of the simulation is much shorted due to absence of early reflections, but I don't think it's an issue. The RT60 time is 25% shorter—I was trying to make it the same as of the room, however there are limits on the granularity of settings in TrueVerb. However, this shorter time is still good according to the comparison graph below—it shows per-frequency decay times along with tolerance boundaries from DIN 18041 standard for music reproduction calculated by Acourate for the room of my size:

Although I didn't try matching reverbs "by ear" I still listened to them carefully as measurements alone do not provide the full picture. During my early experiments I was intending to use the built-in reverb of my MOTU soundcard—after all it comes for free! However, despite looking similar on the measurement side, MOTU's reverb sounded horrible with very distinctive flutter echo. By the way, dry recordings of percussive musical instruments like castanets or bongos are turned out to be excellent for revealing any flaws in artificial reverbs.

Cross-feed and D/R ratio

TrueVerb was designed to be sufficient on its own for providing a stereo reverb and controlling its frequency response. However, the degree of control it provides wasn't enough for my specific needs. As a result, I ended up using the mono version of TrueVerb on two parallel buses, and augment it by Redline Monitor and an equalizer. Here is the connections diagram:

Note that TrueVerb outputs the reverb tail only. This way, I've got full control over the stereo dispersion and the spectral shape of the reverb tail. After playing with different settings on Redline Monitor I've ended up with 90 degree soundstage—this way, the reverb is sounding "enveloping", which was exactly my goal.

The direct sound is placed on a separate bus, with its own instance of Redline Monitor and own set of cross-feed parameters. By altering the volume control on this bus I can change the direct to reverb ratio.

On the Redline Monitor for the direct sound I've pinned the "speaker distance" parameter to the minimum value above zero: 0.1 meter. What I've found is that zero distance doesn't provide convincing externalization, however increasing of the speaker distance adds considerable combing effect, see my previous post about Redline Monitor for graphs. What I could see on the ETC graph is that enabling the "speaker distance" knob adds virtual reflections. Here I compare the settings of 0 meter distance, 0.1 meter, and 2.0 meter:

I suppose, the presence of reflections emulates the bounce of the sound from the mixing console (since Redline Monitor is intended for studios). As the "speaker distance" increases, the level of these reflections becomes higher compared to the direct impulse. That's understood—the further one moves away from the speakers, the more levels of the direct sound and the first reflection become similar. However, this increases the amplitude of comb filtering ripples, thus the minimum possible "speaker distance" is what we want to use. This settings keeps the emulated reflection at -26 dB, below the level of the direct sound—an acceptable condition if we consider a real acoustic environment.

After fixing the speaker distance I've spent some time tweaking multiple parameters which have turned out to be interconnected for the auditory system since changing one had effect on another:

  • the soundstage width,
  • the attenuation of the center channel (both parameters are on the Redline Monitor), and
  • relative levels between the direct sound bus and the reverb bus (D/R ratio).

While tweaking them I used stereo soundstage test tracks from Chesky Records demo CDs to ensure that sounds panned to left and right positions sound in headphones as if they are indeed coming from the corresponding speaker, and that "extreme" left and right—beyond the speakers—are reproduced convincingly. I also used music tracks with strong, energetic "in your face" mix (album "Cut" by the industrial "supergroup" C-Tec) to ensure that I could put the vocals further away from my face.

I tried to avoid attenuating the reverb too much compared to the direct sound as this dramatically decreases the perceived distance to the source. However, having the reverb too strong was breaking the perception of "extreme" left and right source positions, and so on. So finding the sweet spot for the combination of the simulation parameters turned out to be a challenging task and it actually gave me some intuitive understanding of how real speakers can interact with a real room.

Aligning Amplitude Response

Basically, what I have achieved through the previous stages is creating a virtual speaker setup in a virtual room with reverb similar to the one I have in my real room. Now I had to align the frequency response of that setup—as I hear it via the headphones—with the frequency response of my real speakers—as their sound reaches my ears. This process is often referred to as "headphones equalization." Traditionally it's done using a head and torso simulator, but I don't have one so I used in-ear microphones on my own head—that's even better because done this way the tuning becomes personal.

I used my Sennheiser Ambeo Headset for this task. I have captured the amplitude response of the speakers in Smaart V8 over the Ambeo sitting in my ears. Then I captured the amplitude response of EL-8s—also via Ambeo—and it has turned out to be quite close to speakers—no surprise that I like the sound of EL-8s so much. I must note that the positioning of centered banded noise was still wrong in EL-8 headphones. So even if I'd chosen to stick with them I still have to do some personal tuning, more about this later.

Nevertheless, what I wanted is to tune my SRH-1540s. I started measuring them, and they turned out to be way off the speaker sound "baseline": too much bass, and too much treble—the V-shape tuning in action. So I started equalizing them "in real time"—by adjusting the equalizer. I used a linear phase equalizer (LP10 by DDMF) to avoid altering the inter-aural time difference (ITD). This is because sharp EQ curves implemented using minimum phase filters can significantly affect the phase and thus change the ITD, since the tuning for the left and right ears is not symmetric.

After setting the amplitude response, I removed Ambeo from my ears—what a relief!—and performed final tuning strokes to make sure that all frequency bands are positioned consistently within the auditory image. This is extremely important in order to avoid spreading of auditory images of individual instruments.

For this step of tuning I used test signals generated by DGSonicFocus app by Dr. Griesinger. The app produces bands of noise centered between the channels. It can produce either correlated or decorrellated noise—I was using the latter option. When listening over correctly tuned speakers these test signal create a phantom center image. Thanks to my initial amplitude correction for headphone output, some of the bands were already placed correctly in the auditory image, but some still not, mostly in the high frequency range, because it's hard to tune high frequency region correctly from measurements only—they tend to be too volatile. So I used my ears instead, and by applying peaking EQs in the same linear phase equalizer managed to "move" all the bands to the center.

Below are the resulting EQ curves for SRH-1540. Note just how asymmetric they have to be in order to create a convincing auditory image for me over headphones:

I would compare this tuning process to making an individual pair of prescription glasses. Hopefully with advances in customer audio it will become much easier some day.

Time-Domain Tuning (optional)

Since I really enjoy what DSP filters produced by Acourate do to my speakers, I questioned myself whether it's worth to try applying Acourate to the headphone chain. After all, we are simulating speakers in a room so why not to try applying a room correction package to this simulation?

I did not plan doing acoustic measurements at the ear entrance as my equipment simply lacks the required precision. I decided to do the measurements at the analog electrical boundary by tapping into the headphone wires using my T-Cable. I temporarily bypassed the equalizer as it's linear phase, and its setting is asymmetric. From the measurements I've found that left and right outputs are almost identical, as I was expecting them to be on a proper audio chain. So, both the digital and the electrical analog chains are already almost perfect—is there really any room for improvement?

I ran Acourate's correction macros for these measurements, and it still managed to do something to the shape of the impulse response. Below is the difference, I think Acourate made it to look more like a minimum-phase response, notice the deeper "sagging" of the amplitude after the initial peak:

Did this correction change anything? Not too much in general, however percussion instruments started sounding a bit differently, and I would say towards more "natural" side. I loaded these corrections into a convolver plugin—adding it increased latency, but not significantly, since I already had the linear phase EQ plugin in the chain. Now I've got a feeling that I'm really done with the setup.

Putting it all Together

For completeness, here is the full processing chain I use for headphones tuning. I run it in Ardour together with the DSP filters for the speakers tuning:

Note that I marked how the sections of the chain conceptually relate to simulated speaker reproduction. As I noted previously, instead of multiple plugins for the "Room" part I could potentially use just one good reverb plugin, but I haven't yet found an affordable one which would fit my needs.

Despite using lots of plugins, the chain is not heavy on computations, and Ardour takes no more than 15% of CPU on my 2015 Mac mini (as measured by the Activity Monitor), leaving the fan being silent (and recall that Ardour also runs the speaker correction convolver).

Conclusions

Compared to setting up speakers, which was mostly done "by the book," setting up headphones required more experimenting and personal tweaking, but I think it was worth it. Would be interesting to do similar setup for IEMs, although doing measurements in this case for aligning with the speakers response will be challenging for sure.

About the time when I started doing these experiments, Apple has announced support for Atmos and binaural headphone rendering on headphones in their Music app. I took a try listening for some Atmos-remastered albums over headphones on an iPhone. The impression was close to what I have achieved for stereo recordings with my headphone setup—the same feeling of natural instrument placement, generally wider soundstage, and so on—definitely superior to a regular stereo playback over headphones. I was impressed that Apple and Dolby have achieved this effect over non-personalized headphones! On the other hand, expecting each album to be remastered in Atmos is unrealistic, so it's good I'm now able to listen to original stereo versions on headphones with the same feeling of "presence" that Apple provide in Atmos remasters.