Sunday, August 22, 2021

Automatic Gain Control

This post is based on the Chapter 11 of the awesome book "Human and Machine Hearing" by Richard Lyon. In this chapter Dr. Lyon goes deep into mathematical analysis of the properties of the Automatic Gain Control circuit. I took a more "practical" route instead, and did some experiments with a model I've built in MATLAB based on the theory from the book.

What Is Automatic Gain Control?

The family of Automatic Gain Control (AGC) circuits origins from radio receivers where it is needed to reduce the amplitude of the voltage received from the antenna in the case when the signal is too strong. In the old days the circuit used to be called "Automatic Volume Control" (A.V.C.), as we can see in the book on radio electronics design ("The Radiotron Designer's Handbook"):

However, the earliest AGC circuits can be found in the human sensory system—they help to achieve the high dynamic range of our hearing and vision. In the hearing system the cochlea provides the AGC function.

The goal of AGC is to maintain a stable output signal level despite variations in the input signal level. The stability of the output is achieved by creating a feedback loop which "looks" at the level of the output signal, and makes necessary adjustments to the input gain of the signal at the entrance to the circuit. This is how this can be represented schematically:

Note that the "level" is a somewhat abstract property of the signal. What we need to understand is that "level" can be tied, based on our choice, either to the amplitude of the signal or to it's power, and expressed either on a linear scale or on a logarithmic scale. There is also a somewhat arbitrary distinction between the "level" and the "fine temporal structure" of the signal. If we consider a speech signal, for example, it obviously has a high dynamic range due to fast attacks of consonant sounds. However, in AGC we don't examine the signal at such a "microscopic" level. There is always a time constant which defines the speed of level variations that we want to preserve in the output signal.

We want the gain changes to be bound to the "slow" structure of the output signal, otherwise we will introduce distortions. The AGC Loop Filter is used to express the distinction between "fast" and "slow" by applying smoothing the measured level. The simplest way of smoothing is applying a low-pass filter (LPF). Although it's common to define the LPF in terms of its cut-off frequency, another possible way is to use the "time constant", which defines the former.

AGC vs. Other Systems

There are two classes of systems that are similar to AGC in their function. The first class is comprised of systems controlled by feedback—these systems are studied extensively by the engineering discipline called "Control Theory". Schematically, a feedback-controlled system looks like this:

The big difference with AGC is that there is a "desired state" of the controlled system—this is what the control system is driving it towards. For example, in an HVAC system the reference is the temperature set by the user. In contrast, nobody sets the reference for an AGC circuit, instead, for any input signal that doesn't change for some time, the AGC circuit settles down on some corresponding output level which is referred to as "equillibrium."

Another class of systems that are similar to AGC are Dynamic Range Compressors, or just "Compressors", frequently used when recording from a microphone or an instrument in order to achieve a more "energetic" or "punchy" sound. The main diffence of a compressor from an AGC is that compressors normally use the input signal for controlling their output—this approach is called "feed-forward". Also, the design goal of the compressor is different from the design goal of an AGC, too—since it is used to "energize" the sound, adding harmonic distortions is welcome. Whereas, the design goal in AGC is to keep the level of distortions to minimum.

AGC Analysis Framework

The schematic representation for the AGC we have shown initially isn't very convenient for analysing it since the "controlled system" is completely a black box. Thus, the book proposes to split the controlled system into two parts: the non-linear part, which applies compression to the input signal, and the linear part which simply amplifies the compressed signal in order to bring it to the desired level. Note that since the compression factor is defined to be in the range [0..1], the compression always reduces the level of the input signal, sometimes considerably. Below is the scheme of the AGC circuit that we will use for analysis and modelling:

We label the outputs from the AGC blocks as follows:

  • a is the measured signal level;
  • b is filtered level;
  • g is the compression factor.

In the book, Dr. Lyon uses the following function for calculating g from b:

g(b) = (1 - b / K)K, K≠0

Below are graphs of this function for different values of K:

As the book says, the typical values used for K are +4 or -4.

AGC in Action

In order to provide a sense of the AGC circuit in action, I will show how the outputs from the blocks of the AGC change when it acts on an amplitude-modulated sinusoid. I used the same parameters for the input signal and the AGC circuit and obtained a result which looks very similar to the Figures 11.9 and 11.10 in the book.

The input signal is a sinusoid of 1250 Hz considered over a period of 1000 samples at the sampling rate of 20 kHz (that's 50 ms). Below are the input an the output signals, shown on the same graph:

And this is how the outputs from the AGC loop blocks: a, b and c change:

The level analyzer is a half-wave rectifier, thus we only see only positive samples of the output signal as the a variable. This output is being smoothed by an LPF filter with a cut-off frequency of about 16 Hz (10 ms—that's the 1/5th of the modulation frequency), and the output is the b variable. Finally, the gain factor g is calculated using the compression curve with K = -4. The value of g never exceeds 1, thus to be able to see it on the graph together with a and b we have to "magnify" it. The book (and my model) uses the gain of 10 for the linear part of the AGC (this is designated as H) to bring the level of the output signal after compression on par with the level of the input signal.

My implementation of the AGC loop in MATLAB is rather straightforward. I decided to take an advantage of "function handles", which are very similar to lambdas in other programming languages. The only tricky thing is to set the initial parameters of the AGC loop. Due to the use of feedback, there is a tricky situation with the values for the very first iteration, where the output isn't available yet. What I've found after some experimentation is that we can start with zeroes for some of the loop variables and derive the values of other variables for them. Then we need to "prime" the AGC loop by running it on a constant level input. After a number of iterations, the loop enters the equillibrium state. This is how the loop looks like:

function out = AGC(in, H, detector, lpf, gain)
    global y_col a_col b_col g_col;
    out = zeros(length(in), 4);
    out(1, b_col) = 0;
    out(1, g_col) = gain(out(1, b_col));
    out(1, y_col) = H * out(1, g_col) * in(1);
    out(1, a_col) = detector(out(1, y_col));
    for t = 2:length(in)
        y = H * out(t - 1, g_col) * in(t);
        a = detector(y);
        b = lpf(out(t - 1, b_col), a);
        g = gain(b);
        out(t, y_col) = y;
        out(t, a_col) = a;
        out(t, b_col) = b;
        out(t, g_col) = g;

And these are the functions for the half-wave rectifier detector and the LPF:

hwr = @(x) (x + abs(x)) / 2;

% alpha is the time constant of the LPF filter
lpf = @(y_n_1, x_n) y_n_1 + alpha * (x_n - y_n_1);

In order to be able to visualize the inner workings of the loop, the states of the intermediate variables are included into the output as columns.

Since the resulting gain is relatively high, reaching the value of 0.1 at the maximum, we use the compensating gain H=10. We can also see that the gain factor g shows a dependency on the input level. This leads to non-linearities of the output. Using MATLAB's thd function from the Signal Processing Toolbox we actually can measure it pretty easily on our sinusoid. Just as a reference, this is what the thd function measures and plots for the input sinusoid (only the 2nd and 3rd harmonics are shown):

And this is what is shows for the output signal from our simulation:

As we can see, there is a non-negligible 2nd harmonic being added due to non-linearity of the AGC loop.

Experiments with the AGC Loop

What happens if we change the level detector from a half-wave rectifier to a square law detector? In my model we simply need to replace the detector function with the following:

sqr_law = @(x) x .* x;

Below are the resulting graphs:

What changes dramatically here is the level of compression. Since the square law "magnifies" differences between signal levels, high level signals receive a significant compression. As a result, I had to increase the compensation gain H on 5 orders of magnitude (that's 40 dB).

The behavior of the gain factor g still depends on the level of the input signal, so the circuit still exhibits a non-linear behavior. By looking at the THD graph we can see that in this case the THD is lower than of the half-rectifier AGC loop, and the dominating harmonic has changed to the 3rd:

Another modification we can try is change the time constant of the LPF filter. If we make the filter much slower, the behavior of the gain factor g becomes much more linear, however the output signal even less stable than the input signal:

On the other hand, if we make the AGC loop much "faster" by shifting the LPF corner frequency upwards, it suppresses the changes in the input signal very well, but at cost of highly non-linear behavior of the gain factor g:

Can we achieve the higher linearity of the square law detector while still using the half-wave rectifier?

Multi-Stage AGC Loop

The solution dates back to the invention of Harold Wheeler who used vacuum tube gain stages for the radio antenna input. By using multiple stages, the compression can be increased gradually. Also, a stage with lower compression brings lower distortion. If we recall our formula for the compression gain (making K an explicit parameter this time):

g(b, K) = (1 - b / K)K, K≠0

We can see that by multiplying several functions that use smaller value of K we can achieve an equivalent of a single function with a bigger (in absolute value) K:

(g(b, -1))4 ~ g(b, -4)

Actually, if we change the definition of g to set the K as the divisor of b independently of the K power, we can obtain exactly the same function.

We can also vary the time constants of each corresponding LPF filter. This is how this approach looks schematically:

Each "slower" outer AGC loop reduces the dynamic range of the output signal, reducing the amount of the compression that needs to be applied for abrupt changes by inner "faster" loops, and thus keeping the distortion low.

I used 3 stages with the following LPF filters:

This is how the input / output and the loop variables look like:

We can still maintain a low compensating gain H in this case, and the behavior of the gain factor g is now more linear, and we can see this on the THD graph:

And here is the comparison of the outputs between the initial single stage approach with the square law and multi-stage approaches:

The multi-stage AGC yields a bit less compressed output, however it has less "spikes" on level changes.


It was interesting to explore the automatic gain control circuit. I've uploaded the MATLAB "Live" script here. I hope I can reimplement my MATLAB code in Faust to use as a filter for real-time audio. AGC is very useful for virtual conferences audio, as not all VC platforms offer gain control for participants, and when attending big conferences I often need to adjust volume.

Wednesday, July 21, 2021

Desktop Stereo Setup

A couple of months ago we have moved into our new house, and I had to spend some time setting up sound in my new office room. This time I decided to focus on the listening experience while working—that is, while I'm at my standing desk. I'm lucky enough being able to avoid using headphones most of the time. Thus the task was to create a good near-field stereo setup.

In my understanding, a good stereo reproduction means achieving a clean separation between virtual sources, and that feeling of being "enveloped" into the music. I want to perceive a wide soundstage which expands beyond the speakers. And I want to be able to almost feel the breathing of the vocalist.

What's great about the personal desktop setup is that clearly there is only one listening position, so it's much easier to optimize the sound field. What's harder, though is that the speakers are at a very close distance so it's not easy to make them to "disappear."

The Equipment

I decided to start with the equipment that I already have and see how far can I progress with it. This is my hardware:

  • 4 items of 2-inch wide sound absorbers by GIK Acoustics (freestanding gobos);
  • a pair of KRK Rokit 5-inch near field monitors—the old 2nd generation released in 2008; I was doing some measurements of them a while ago;
  • one Rythmik F12G 12-inch subwoofer;
  • my faithful MOTU UltraLite AVB audio interface;
  • a Mac Mini late 2014 model.

I couldn't use my LXminis on a desktop because they are too tall, so I had to stick with KRKs.

The Philosophy

These days it's rather easy to take some good DSP room correction product and leave to it all the hard work of tuning the audio system, expecting "magic." However, I decided to stick with a bit different approach and make sure that the DSP correction is only "a cherry on the top of the cake," meaning that before I apply it, I have already achieved the best possible result by other means.

I decided to proceed in the following steps:

  • make sure the acoustics of the room and the geometry of the setup are done right;
  • align the speakers as close as possible by using their built-in controls—both KRK fronts and the subs offer some;
  • apply basic DSP treatments with PEQ filters of MOTU AVB;
  • finish with a speaker / room FIR filter correction done using Acourate.

This approach has an advantage that on every stage we have as good as possible system, and try to improve it on the next stage. Also, the first 3 stages do not require the system being connected to a computer. I realized that it's a bonus after my Intel NUC machine which I used for running DSPs has died one day without a warning.

The Space

My office room is rather small: approximately 3.39 x 3.18 x [2.5–3.6] meters, and is highly asymmetric—which is actually a problem. Below is its plan:

There are not many options for placing the desk. I decided to stay away from the windows and put my setup into the niche on the opposite end. There I've mounted the sound absorbers on the walls surrounding the desk. This is how the setup looks like:

There are still some asymmetries (see the marks on the photo):

  • the ceiling is slanted;
  • the subwoofer is used instead of a speaker stand on the left;
  • there is a wall on the left, but an open space on the right.

The space behind my back (as I'm working at the desk) is completely untreated. More or less, the room space uses the same concept as LEDE design for audio control rooms. However, the amount of acoustic treatment on the front is certainly less than they use in professional environment—it's a living room, after all.

By the way, placing the subwoofer on the desk was intentional. Following the principle to start with physics, I decided to put it as close as possible to one of the front speakers in order to create a full-range speaker with almost no need to time align them using delays. Although this makes the setup even more assymetrical, the fact that the subwoofer only needs to cover the frequency range below 50 Hz makes it a minor problem.

Physical Alignment

So, placing of the subwoofer was one thing. As a happy coincidence, placing the left monitor on it for creating a full-range speaker have also put the monitor at the correct height relative to my ears. I often heard the advice that the tweeter of a multi-way speaker should be at the ear height, however, I don't think it's completely correct. As Bob McCarthy points out in his excellent book, in a system where the high- and low-frequency speakers are time aligned, placing the midpoint between them at the ear level better preserves the alignment in the horizontal plane:

No need to mention that both left and right speakers are set at the same height.

In the horizontal plane, the speakers and my head form the recommended equilateral triangle. Initially I tried "aiming" the speakers at the point immediately behind by back, thus forming the standard 60 degree angle. However, this has created a very narrow soundstage. After some experiments I have arrived to the arrangement with the speakers aimed at the point on the window far behind my back:

For precise aiming, I used laser distance meters placed on both speakers. They have crossed very close to each other, also confirming that both speakers are aligned vertically as well as horizontally:

The process of physical alignment was finished by placing the measurement microphone equidistant from speakers, according to length measurements. Now the time had come for acoustical alignment by means of electronics.

Basic Acoustic Alignment and Measurements

This step included making sure that the front speakers are aligned as much as possible, and the left speaker is aligned and synchronized with the sub. At first I was only using the speakers' built-in controls and watching the result in real time using the dual channel (transfer function) measurement in Smaart V8.

The KRK Rokits only offer two knobs: the overall volume and the volume of the high-frequency amplifier driving the tweeter. The controls of the Rythmik's sub amplifier are more sophisticated, allowing to adjust the bandwidth of the sub, the delay, enable one PEQ filter, and much more.

Besides looking at real-time measurements in Smaart, I've also made a traditional low speed log sweep in Acourate which had provided me a bit more insight into the problems that I had with my setup.

First, by looking at the impulse and the step response of the KRK Rokits, it had become apparent that the woofer's polarity is inversed:

Not sure why this was done by the speaker's designers. As a result, since the woofer also runs in an inverted polarity compared to the bass reflex port, their counteraction creates a very steep roll-off at low frequencies below the speaker's operating range. However, since the output from the woofer dominates in terms of delivered acoustic energy, it was also counteracting the subwoofer. The solution I've ended up with was inverting the polarity of both front speakers by using XLR phase inverters. This phase inversion isn't audible by itself, but it indeed helped to integrate the left speaker with the sub more easily.

The second finding was that the untreated part of the wall on the left, and the ceiling do create visible spikes on the ETC graph during the first 5 ms, which means they affect the direct sound of the speakers:

I have confirmed where the spikes are coming from by temporarily covering the wall and the ceiling with blankets. But I couldn't do much for now.

Looking at the amplitude part of the frequency response, I could see that the fronts have a sharp roll-off after 50 Hz, and that adding and aligning the sub allows to cover the missing low-end down to 15 Hz:

We can also see that the direct sound from the tweeters is mostly aligned, but the midrange is spiky both due to reflections and room modes which can go up to quite high frequencies in such a small room.

Correction with PEQs and Setting the Target Curve

After squeezing as much as possible from the built-in controls of the speakers and fixing the polarity problem, I did some corrections to the most egregious differences between the left and the right speakers using peaking equalizers built in MOTU's DSP.

By the way, so far my target was achieving a flat frequency response. The reason is that with such wiggly responses it's easier to see their alignment while they are jumping up and down around a horizontal line. But this isn't the desired frequency response for listening, so next I've started playing some music and adjusted the "Target Curve" by ear, also using the built-in PEQs of MOTU.

Ideally, I would like to use a "tilting" filter for high frequencies, however MOTU doesn't offer it. I've managed to simulate the tilt using a combination of a shelving filter plus 2 PEQ's to "pull" it up and start looking like a slope:

I've also reduced bass a bit because these KRK monitors use a bass reflex port which produces a bit artificial "booming" bass which tends to mask all other frequencies.

Fine Correction Using Acourate DSP

Having the speakers mostly aligned and adjusted to the desired target curve, there is still a room for some time-domain DSP correction. Let's look at the IR and the step response of the front speaker again:

As we can see, the drivers on the speaker are not time-aligned: we can see the first small spike from the tweeter, followed by a bigger polarity inversed spike from the woofer, and positive again spike from the bass reflex port. The resulting step response is far from being "tight." This is something that we can fix using a zero-phase FIR filter produced by Acourate.

One thing that I don't like about the default filters produced by Acourate is the associated time delay. By default, Acourate produces a filter of 65536 samples, with the peak in the middle. Applying such a filter results in adding a delay of 32768 samples—that's 680 ms—and this doesn't account for processing delays. The result is in practice close to 1 second. The author of Acourate—Dr. Br├╝ggemann—was well aware of this problem, so he added an option to produce filters of much shorter length—just 8192 samples, which is enough to achieve most from the corrective effect while keeping the latency relatively low.

Another technical issue was that since I didn't have a Windows PC anymore, I had to apply these filters on Mac. AudioVero's AcourateConvolver doesn't support Mac directly. I didn't want trying to run it on a virtual machine either. Instead I've ended up using free Audio Units convolver plugin by Home Audio Fidelity. The convolver works with the WAV filters exported from Acourate just fine. I only had to transform mono WAV files with filters into stereo, filling with zeroes the right channel on the filter for the left speaker, and the left channel on the filter for the right channel:

This is because the HAF filters also support crosstalk cancellation, which we don't use here. The latency resulting from the FIR filters and the plugin is on the order of 250 ms (when using 48 kHz sampling rate), and it actually works fine with videos and video conferences.

So, what are the results? I've took measurements once again on the corrected system. If we look in the time domain, the changes are all for good:

The time difference between the drivers is now gone, and they produce a nice triangular step response, very close to the response of an "ideal" low-pass filter:

Changes in the frequency domain are less dramatic, that's because we have already done most of the heavy lifting at the previous stages:

The phase of the speakers actually got more linear starting from 600 Hz:

Note that I only corrected the front speakers. The sub operates at frequencies which are not easy to correct using a FIR filter of reduced length.

Achieved Left/Right Symmetry

If you recall, my setup isn't completely symmetrical physically. Nor were entry-level studio monitors accurately matched by KRK. However, as the result of this laborous setup, the amplitude difference between the left and right front speakers is quite low, except for the low frequency range:

Acourate also calculates the Inter-Aural Cross-Correlation coefficient between the impulse responses of the left and the right speaker. It does that for several time windows varying in the duration: 10 ms, 20 ms, 80 ms. The first two results mostly depend on the direct sound from the speakers and the early reflections, the last one depends on reverberation in the room. Since the filters created by Acourate tend to bring both speakers to the same target curve, at least the two first IACC figures are expected to increase with the correction. In my case the improvement was not very substantial:

IR Time Before After Delta
0–10 ms 91.2% 91.7% +0.5%
0–20 ms 80.3% 80.4% +0.1%
0–80 ms 69.0% 69.8% +0.8%

Right, that's less than 1% improvement. However, numbers don't always tell the whole story. The time domain correction done by Accurate did improve something in the sound for sure—it has become more "transparent", reminding me of my LXminis. It has also became easier to subconscuously separate the sound sources in the recording, making the overall reproduction more natural.

The Costs of Corrections

To recap, I've aligned my desktop setup in 4 stages:

  • geometrical symmetry and acoustic treatment;
  • knobs on the speakers;
  • IIR filters in the MOTU AVB;
  • FIR filters by Acourate, applied in software DSP.

The first two stages add zero latency. Processing done by the sound card adds about 3–6 ms. After these first 3 steps the sound from the system was already good and enjoyable. The last correction added a substantial 250 ms delay, however it did improve the "fineness" of the system. There is definitely the rule of diminishing returns at work.

All steps, except the first one could be corrected by a sophisticated room / speaker correction system (like Acourate) in one set. Was it worth to do them one by one? For me, this makes sense. First, doing full correction with Acourate would require using the default long-tap filters, bringing the latency to uncomfortably high figures. Second, since we know the problems of the system at each stage, we can think of how they can be fixed at the root—usually that's the most efficient solution.

What's Next?

So what are the remaining problems and how they can be fixed? This is what I'm considering doing:

  1. Putting absorbers on the untreated wall and the ceiling to remove the remaining early reflections.
  2. Getting better front speakers. "Better" here means more point-source like. It could be either LXminis adapted for desktop use, or some coaxial speakers.
  3. Adding a second subwoofer to make the right speaker full-range, and thus achieving symmetry with the left one.
  4. Reducing the latency associated with the FIR correction by employing some hardware DSP.

Another interesting option for making this setup more "immersive" is to try to reduce cross-talk between the speakers. This again will require some serious DSP processing.

Saturday, March 27, 2021

Headphone Virtualization For Music Reproduction

This post is written after the presentation I gave to my colleagues. Here I tried to explain why making commercial stereo recordings to sound on headphones as good and natural as they can sound on a well tuned stereo speakers is not an easy task. This topic has much in common with the popular topics on "immersive" or "3D sound" on headphones because essentially we want to reproduce a recording in a way that makes the listener to believe that they are actually out there with the performance and forget that they even have the headphones on. However, this post deals specifically with commercial stereo recordings reproduction and does not touch topics of AR/VR.

Reproduction on Speakers

First we need to provide some context about speaker playback. Let's start with the simplest case of a mono speaker located in a non-anechoic (that is, regular) room. Imagine you are listening to some sound, for example pink noise, played over this speaker. Although it's a very simple case, it demonstrates several important things. We understand that the physical sound (acoustic waves) needs to be received by the sensory system of our body—mainly ears, processed by our brain, and as a result we have a perception (or acoustic image) of the physical source formed in our mind. We also see the speaker, and the perceived sound source will be in our mind anchored, or localized to the visual image of the speaker.

This audio perception has a lot of associated attributes in our mind. Some of them originate in the sound that is being reproduced by the speaker, like it's loudness and timbre. Some of them are are specific to the relative position of the speaker and the listener, and the properties of the room. Humans use both ears (binaural listening), and our brain manages to recognize the source in both audio inputs and derive the difference in sound levels and the times of arrival (known as Interaural Level and Time Difference, ILD and ITD) for roughly locating it in the horizontal plane of our mind's eye.

Moreover, in a non-anechoic room there will be reflections from the walls and other objects, and the information will be extracted by the brain from ILD and ITD of the reflected sounds to help us to estimate the distance to the sound source and even the size of the room.

Moving to a reproduction using two speakers gives a possibility to provide even more cues to the brain and create imaginary sound sources that are positioned outside of the actual sound sources—the speakers. However, with two speakers the acoustical picture becomes more complicated. Obviously, each ear receives sound from both speakers and from wall reflections. With a good stereo setup the listener can forget about the existence of the speakers and completely disentangle them from the sound they are producing.

Through the long history of the development of stereo recording and playback audio engineers learned how to use the stereo speaker arrangement for creating phantom sources that are located anywhere in the horizontal plane between the speakers and even outside them. As a matter of fact, most commercially available stereo recordings were produced for playback over speakers.

Use of multi-channel systems, especially with height channels helps to push envelope even further and produce phantom sources anywhere around the listener. Unlike a stereo setup where the perception of phantom sources might be quite sensitive to the listener location, multi-channel systems handle even multiple listeners with ease. Anyone who had a chance to visit a modern movie theater had experienced the wonders of this technology.


However, even on stereo systems some of advanced sound engineers manage to create phantom sources that are located above the speakers, to the side of, or in a close proximity to the listener. These effects are achieved by applying frequency filtering which imitates the physical filters of ear pinnaes and of the head. Some example of tracks that I personally like are "Edge of Life" by Recoil and "One Step Behind" by Hol Baumann.

This brings us the the topic of HRTF (Head-Related Transfer Function). It is used a lot in the context of AR/VR, however, for our particular topic what we need to understand that there exist two filters: the first is the physical filter which is located between a sound source and an eardrum: the combination of the torso, head, and the outer ear. They transform any external sound in a way that greatly depends on the location of its source to the ear.

The second filter exists in our auditory system. It is quite complex, it uses information arriving to both ears, visual cues, and our learned experience of living with the physical fiter of our body. Its goal is to "undo" the effect of the body filter and restore the original timbre of the sound source and use the information from the body filter for locating the sound source.

A simple an efficient demonstration of this filter at work, as pointed out by S. Linkwitz, is turning one's head from side to side while listening to music. Although the sound that reaches one's ear drums changes dramatically, the perception of the timbre remains stable and the sound source just changes its position in the auditory image. However, the filter of the auditory system doesn't restore the timbre completely. If you try compare the auditory image of the noise from ocean waves as heard facing them, and then from the back, the latter sound will be noticeably lacking the boost of high frequencies that our ears pinnaes add.

It is important to note that due to assymetry of human bodies the physical filters for the left and right ears are different, and so are the auditory system filters that counteract them. This assymetry plays an important role, along with ITLD and room reflections, in locating sound sources and placing them correctly on the auditory image. As C. Poldy notes in his tutorial on headphones, "the interaural differences are unique for each individual and could not be a characteristic of the sound source." This allows humans (and other creatures) to derive the direction of the sound without rotating their heads.

Very simplified model of HRTF filters at work (after D. Griesinger) is as follows:

The "Adaptive AGC" block helps to restore alterations of frequency response due to environmental conditions. This is similar to "auto white balance" function of human's vision system. It helps to recover the natural timbre of familiar sources which are altered, for example, by closely placed reflective surfaces.

Reproduction on Headphones

Now we put headphones on—what happens? Because the drivers of headphones located close to ears, or even in the ear canal, the natural physical filter is partially bypassed and is partially altered due to the change in the ear physics, for example, due to blocked ear canal, or new resonances added due to presence of ear cups around the ear. Left and right headphone speakers are usually tuned to be symmetric. The combination of these factors brings in misleading cues to the auditory system and it can't anymore use the localization mechanisms beyond those relying on simple interaural level difference. As a result, the auditory image "resets" to "inside the head" sensation.

Another difference from stereo speaker playback is that in headphones left and right channels of the recording do not "leak" to contra-lateral ears. This is a remarkably good property of headphone playback and it is used a lot for creating immersive experience, however it deviates from the reproduction setup what stereo recordings are created for. Some recording and artificial effects that are used for creating a wide auditory scene on stereo recordings inevitably stop working when playing over headphones.

There exist several known approaches for bringing headphone playback closer to speaker reproduction. I must note that some of them are specific to stereo music reproduction—they are not needed for binaural recordings and binaural renderings of multi-channel and object-based audio programs.


This is the technique that I was exploring a lot in the past, see my old posts about Redline Monitor Plugin and on Phonitor Mini. Crossfeed is based on adding of slightly delayed copies of sound from the counter channel to the direct channel. It is based on a simple spherical head model.

Adding a delayed copy of the signal to itself leads to comb filtering—it also occurs natually in speaker playback and is likely taken into account by the brain for approximating distances between audio sources. My opinion is that comb filtering should be kept to minimum to avoid altering the timbre of the sound. For music playback I would prefer the least amount of comb filtering, even if it results in less externalization over headphones.

Multi-channel Rendering

Rendering of multi-channel audio over headphones can be based on the same principle as crossfeed but with a more realistic head model, as it also needs to take into account natural suppression of high frequencies caused by pinnaes of the ears. It is likely that a binaural renderer for multi-channel audio relies on more realistic HRTFs. For example, below are HRTF filters used by my Marantz AV7704 when playing a 5.1 multi-channel program into the headphone output in "Virtual" mode:

An interesting observation is that the center channel is rendered using an identity transfer function, although normally a frontal sound source will be affected by HRTF, too.

The graphs above do not reveal how the simulation of acoustic leakage between speakers affects the output signal. On the graphs below the test signal is played simultaneously into the front left and front right channels. In the time domain we see a delayed signal from the counter channel (ETC is shown for clarity):

And in the frequency domain this unsurprisingly causes ripples to appear:

The headphone virtualizer in AV7704 doesn't go beyond simulating acoustic leakage and directional filtering. However, there is yet another big thing that could be added.


The rooms that we have at home rarely have extensive acoustic treatment similar to studios. Certainly, when setting up and tuning a speaker system in a room I try to minimize the impact of reflections during the first 25 ms or so, see my post about setting up LXmini in a living room. However, this setup is still "live" and has a long reverberation tail. The latter is obviously missing when playing over headphones. A slight amount of artificial reverb with controlled delay time and level helps to "live up" headphone playback and add more "envelopment" even for a stereo recording.

The standard LEDE design of audio studios also allowed for some diffused sound coming from the back of the listener. This sound, which is decorrellated with the direct sound from the speaker helps to enhance the clarity of the reproduction. In fact, the more it is decorrelated, the better, since that minimizes comb filtering.

Headphones Equalization

These days measuring headphones is a popular hobby among tech-savvy audiophiles. What these measurements show is that no two models of headphones are tuned the same way. Although there are well known "recommended" target curves like Harman Target Curve, or diffuse field target curve, which strives to make the sound pressure delivered to the microphones of a head and torso simulator to resemble the sound pressure they receive in a room with a lot of random reflections. However, each designer tends to bring in some "voicing" to stand off the crowd, and as a result, one might need to go a long way finding headphones that satisfy their musical taste. I guess, if the customers ears and body have similar dimensions as of some good headphone designer, the customer could be quite happy with the tuning.

I had some fun trying audio plugins for cross-tuning headphones to make them sound similar to other models, however the outcome of these experiments was still somewhat unsatisfying. The only equalization which seems to be useful is the one which ensures that the headphones deliver a flat frequency response to the eardrums. This is a "ground zero" equalization on top of which one can start putting on HRTFs and preference tuning curves.

One problem when trying to achieve the flat equalization by means of plugins is that the measurements that they use were taken on a head and torso simulator and don't take into account how the headphones interact with my ears, thus the resulting tuning is not flat. It's not even balanced correctly since my ears are not symmetric. It's very easy to demonstrate this by playing over headphones mono signal of banded tone bursts of chirps over the audible range—they move arbitrarily from left to right. This almost doesn't occur when playing the same signals over a tuned pair of stereo speakers because their sound passes through the "outer" HRTF filter—the body, and the audiory system can find a matching pair of HRTFs for compensation. When using headphones the matching pair of HRTFs can not be found, thus no compensation occurs.

This is actually a serious problem, and a lot of research related to HRTFs is devoted to finding ways of figuring out a personalized HRTF without physically taking the subject into an anechoic chamber to measure HRTFs directly. However, for simulating stereo speakers knowing full HRTFs (for sources in any direction) are not required. Still, some degree of personal headphone equalization is needed to achieve proper centering of mono images and placing the virtual speakers in front of the listener in horizontal plane.

Head Tracking

There is another way for dealing with the lack of a personal headphone equalization. Our hearing system takes a lot of cues from other sensory systems: visual, motion, sense of vibrations, and from higher levels of brain—all that to compensate for lacking and contradictory cues that our ears receive. By changing sound according to head movements, e.g. with use of some generic HRTFs, we can engage our adaptation mechanism to start making sense of the changes that they produce. Obviously, using person's own HRTF would be ideal, however providing auditory feedback for head movements relies on the ability of our brain to learn new things that are useful for survival.

Gaming-oriented headsets with head tracking, e.g. Audeze Mobius were available for a long time already. And lately, mass consumer-oriented companies like Apple have also adopted the head tracking technology for more realistic multi-channel audio reproduction over headphones, and a lot of other companies will undoubtely follow the suit.

What's Next?

I'm going to discuss how headphone virtualization is implemented in Waves Nx, and also my DIY approach based on D. Griesinger's ideas.

Saturday, January 23, 2021

Teensy Project: Talking ABC

As I had mentioned in my previous post, I was intending to build a talking Russian ABC for my daughter. It took me a lot of time to complete this project, and finally it's done:

This was an exciting if not somewhat exhausting effort, and I've learned a couple of things along the way. Making this ABC myself also made me to realize just how much complexity we are taking for granted in everyday things that surround us. Talking toys these days cost $19–$39 and we consider them as "cheap stuff." However, behind each of them there are likely days if not weeks of experimenting, designing, and testing. It's only thanks to mass production and outsourcing manufacturing to China that we can enjoy them for such low costs.

The Design

In a nutshell, the design of the ABC toy is pretty obvious: there is an input (buttons), an output (speaker), and the microcomputer (Teensy) which binds the things together. After my experiments with various audio options with Teensy, I've settled down on using the smaller version of Teensy (called 4.0) with Audio Shield which serves both as DAC, and as an SD card controller host, and "Noisy Cricket" amplifier to drive a single 0.5W speaker. The ABC is a standalone toy, so it must use a battery. I purchased a 2200 mAh, 3.7 V Li-Poly rechargeable battery and a charging board for it from Sparkfun. That's all the electronics involved here.

As for passive components—this toy needs buttons—a lot of them. The Russian alphabet has 33 letters, and I also needed 10 buttons for numbers, and 2 buttons for changing the mode. The ABC either pronounces the name for a letter, or the sound it stands for, along with the word on the picture:

In total, that's 45 pushbuttons. Finally, I needed a toggle switch to turn the toy on and off, and two LEDs: one to show that it's turned on, and another to show that the battery is charging. Charging is done via a micro-USB port. I've added another micro-USB port to extend the USB port of Teensy so it can be reprogrammed if needed without removing the back cover.

The number of pushbuttons used didn't allow wiring each one of them individually. Instead, I organized them into a grid. This is somewhat crude schematics of the toy:

I'll explain how the pushbutton grid works in a dedicated section. Physically the toy is built like a big but slim rectangular box with the front panel hosting all the components.

I used two identical ABS sheets for the front and the back panels. The frame is wooden and is attached permanently to the front panel. The toy is sturdy, if not a bit heavy. The ABS sheets are black, I needed to make them look friendly for a child, so I used self-adhesive films and some decals to make them look attractive for a kid. Film also covers the holes and heads of the screws used to attach the components.

Input Grid

There are 45 buttons to monitor. Monitoring each one of them individually would require the same amount of digital input pins. Although Teensy 4.1 potentially could handle that, I was using 4.0, and moreover, some of its pins were reserved for communicating with the Audio Shield board, leaving only just about 15 for handling the buttons. Thus, there was a need for some multiplexing. The idea is that we don't try to catch pressing of each button at all times, but rather query groups of them at periodic intervals. If the intervals are short the discrete nature of qurying is not noticeable by humans.

This is the schematics I've ended up with:

I use a 7x7 grid connecting digital inputs and outputs of Teensy. We go row by row, setting output level to "HIGH" and checking for each column what is the signal level. In order to minimize false triggering by static electricity, each input is connected to ground via a pulldown resistor. This works like a charm. The monitoring code is straightforward:

// Pin numbers used for outputs and inputs:
const int outs[] = { ... };
const int ins[] = { ... };

void setup() {
  // Initial configuration
  for (unsigned int out = 0; out < ARRAY_SIZE(outs); ++out) {
    pinMode(outs[out], OUTPUT);
    digitalWrite(outs[out], LOW);
  for (unsigned int in = 0; in < ARRAY_SIZE(ins); ++in) {
    pinMode(ins[in], INPUT);

void loop() {
  for (unsigned int out = 0; out < ARRAY_SIZE(outs); ++out) {
    digitalWrite(outs[out], HIGH);
    for (unsigned int in = 0; in < ARRAY_SIZE(ins); ++in) {
      if (digitalRead(ins[in])) {
        // Keypress detected
    digitalWrite(outs[out], LOW);

The actual code is a bit more complex due to the need to avoid restarting the sound if the button has been accidentally pressed twice. Full code for this project is published here on GitHub.

Attaching Audio Shield to Teensy

The Audio Shield is designed to cover all pins of Teensy, however it doesn't actually use all of them. So instead of soldering a row of female pin receptors on the audio shield and use full rows of male pins on Teensy, I've ended up with the following arrangement:

I soldered male pins to the audio shield and cut out those unused by it. I soldered angled male pins to the contact holes on Teensy above the removed pins. I used plastic shims on angles to make Teensy "float" above the shield. I called the resulting design "The Dreadnought" thanks to the gun-like pins on both sides of the board. There is also a double "tail" of pins on the back: the upper row is soldered to the holes, for providing power to Noisy Cricket, and the lower row is soldered to the plates on the bottom of Teensy for additional inputs.

This arrangement ended up to be slimmer than it would be if usual pairs of female/male pin rows were used and fitted even with some extra space into 3/4" height of the toy's internal compartment.

Tuning Audio Output

I tried my best achieving "transparent" sounding for the toy's speaker, unfortunately I fell short reaching that aim due to natural limits of this speaker. Nevertheless, at least I've found a very straightforward way for performing measurement through the entire Teensy / Audio Shield / Noisy Cricket stack, and also a way for quickly doing some DSP tuning using REW. Here are some technical details.

Initially when thinking about measurements, I was considering Teensy as a regular consumer audio device—output only, which means it must be tested using so called "open loop" technique. This involves somehow delivering test signals to the device, playing them, recording, and then analyzing "offline." This is a really tedious task, requiring a lot of experience for iterating quickly.

Another problem with the "open loop" technique is that the playback device and the recording device are both digital, yet unsynchronized, and this often produces artefacts when digitally processing the recording of the test signal due to slight variations between the actual sampling rates.

However, soon I realized that Teensy is actually is much more versatile than a regular microcontroller. First, it can act as a USB audio interface (see the details here), which means that the measurement application can work in real time, in a "closed loop" measurement mode which is more productive than "open loop." In theory, with a good I2S audio I/O board connected to Teensy it would be possible to run both playback and recoding from a measurement microphone through Teensy. However, the microphone input on the Audio Shield was not designed for acoustic measurements, thus an external audio card is required.

The external audio card needs a way to synchronize its clock with Teensy. Otherwise, as I've mentioned, there is a high chance of getting a skewed measurement. One approach to syncing two USB audio devices is to use the feature of macOS, as I've done previously for Ambeo headset. However, a better way is to utilize the built-in SPDIF output on Teensy. This is the diagram of the measurement loop I've ended up with:

Teensy provides clock to RME Fireface and does playback. Fireface handles input from a measurement microphone. This arrangement has demonstrated solid correlation in Smaart, which means we are actually measuring the output of the system and can tweak it.

For tweaking, I preferred to use REW. Teensy Audio Library offers a biquad filter component which accepts raw coefficients, and REW is very handy for generating them. This was my workflow:

  1. Measure the response of the ABC using REW.
  2. Go to EQ dialog. Use "Generic" equalizer mode.
  3. Adjust the target curve and let REW calculate correction filters. If there are too many of them (the biquad component on Teensy allows only 4), disable them, and ask REW to optimize using only the remaining ones.
  4. Save the biquad coefficients to a file for 44.1 kHz sampling rate (only Generic equalizer in REW allows choosing the SR).
  5. Paste generated biquads into code, negating the signs for a1 and a2 coefficients.
  6. Update the sketch on Teensy.
  7. Restart REW since unfortunately the USB audio interface exposed by Teensy resets after reflashing and REW (and any other audio program) loses it.

Below are frequency responses graphs below and after tuning. The main problem with this speaker / enclosure is the dip at about 5 kHz and following it—a huge scoop. This makes the overall sounding telephone-like, but it's hard to do anything about it:

In time domain, there is noticeable "boominess" in the low end:

Power Consumption

Since it's a battery powered device, I wanted to make sure that it doesn't run out of battery too quickly. In order to measure power consumption, I first measured the actual voltage provided by the battery when powering Teensy, it was 4.1 VDC. Then I dialed this voltage on a desktop power supply, powered the ABC from it, and checked the current displayed, it was 125 mA when idle and 150 mA when playing sounds. Having that the battery is rated at 2200 mAh, the toy can work for hours.

I checked whether Teensy can turn itself off, and found that it's only possible using an external circuitry for power control. I didn't consider this in the initial design, so I decided to go without it. In fact, my daughter is disciplined enough to turn the toy off after using, so there is really no need for this extra circuit.


So far, this was the longest project I had undertook. Next time, I would likely try to limit the time spent, as seeing no light at the end of the tunnel for a long time lowers your morale. It was a great relief to have this project finished.

The whole idea of using a microcontroller for doing audio automation seems very appealing though. I can see how Teensy can be used in various audio devices. I would also like to use Teensy in some audio processing project, but I need first to figure out how to go beyond the default 44.1 kHz, 16-bit mode for audio processing.