Saturday, December 18, 2021

YouTube Music and Intersample Peaks

A while ago I was checking how Google Play Music and the playback chain following it are handling intersample peaks. Since then, GPM was retired and replaced with YouTube Music (YTM), browsers got uncountable number of updates, and so on. Did the situation with digital headroom improve? I was stimulated to check this by the fact that I tried using YTM in Chrome browser on my Linux laptop, and was disappointed with the quality of the output. Before that, I was using YTM on other OSes, and it was sounding fine. Is there anything wrong with Linux? I decided to find out.

I have updated my set of test files. I took the same test signal I used back in 2017: a stereo file where both channels carry a sine at the quarter of the sampling rate (11025 or 12000 Hz), with a phase shift at 45 degrees. The left channel has this signal normalized to 0 dBFS, and this creates intersample overs peaking at about +3 dBFS, and the right channel has this signal at the half of the full scale (6 dB down), which provides enough headroom and should survive any transformations:

I have produced a set of test files to include all the combinations of the following attributes:

  • sample rate: 44.1 and 48 kHz;
  • bit width: 16 and 24;
  • dither: none and triangular.

There are two things that I can validate using these signals: non-linearities introduced by clipping or compression of intersample peaks, and whether the inter channel balance stays the same. For measuring non-linearities I used the THD+N measurement. Although, due to the fact that the signal is at the quarter of the sampling rate, even the second harmonic is out of frequency range, the "harmonic distortion" part of this measurement doesn't make much sense, however the "noise" part still does. There is a strong correlation between the look of the frequency response graph and the value of the THD+N.

I have uploaded my test signals to YouTube Music and then measured the THD+N in the following clients:

  • the official web client running in recent stable versions of Chrome and FireFox on Debian Linux, macOS, and Windows,
  • and the official mobile apps running on an Android phone (Pixel 5) and iPad Air.

All the outputs were measured using a digital capture chain. For macOS and Windows I used a hardware loopback on RME FireFace card. For Linux I used Douk Audio USB to S/P-DIF digital interface (Mini XMOS XU208) which was connected by optics to the FireFace card. For mobile devices I used a dual USB sound card iConnectAudio4 by iConnectivity. The sound cards were configured either at 44.1 or at 48 kHz.

Observations and Results

The first thing I've noted was that YouTube Music stores audio tracks at 44.1 kHz sample rate (this is confirmed by looking at the "Encoding Specifications" in the YT tech support pages), and 48 kHz files got mercilessly resampled, clipping the channel with overs quite severely. This can be easily seen by comparing the difference between the L&R channels of the signal played back—it's only 4.34 dB instead of 6 dB. Below is the spectrum of the 48 kHz test signal after it has went through YTM's server guts:

Also, it can be seen from the graph, YTM does some "loudness normalization" by scaling the amplitude of the track down, likely after resampling it to 44.1 kHz. This causes the peaks on both channels to be down by about 11 dB. Actually, that's good because it provides needed headroom for any sample rate conversions happening after the tracks leave the YTM client.

As for the lossy compression, it actually doesn't add much artifacts, as we can see from this example:

Yes, there is a "bulb" around the original signal likely added due to the fact that the codec works in the frequency domain and has reduced resolution. However, the THD+N of this signal is just 3 dB down (-103.4 dB) from the 16-bit dithered original (-106.8 dB), and it's still on par with capabilities of good analog electronics. So, lossy codec is not on the list of my concerns for the content on YTM.

Desktop Clients

On desktop, the difference in the measurements only depends on the browser. However, the trouble with Linux is that both Chrome and FireFox always switch the output to 48 kHz as they start playing even if the PulseAudio daemon is configured to use 44100 Hz for both the "default" and "alternative" sample rates. As we will see, this does a bad job for Chrome and likely was the reason why I felt initially that something is going wrong with YTM on Linux.

Yet another interesting observation on the desktop is that in case the browser does a bad job of resampling, bringing the digital volume control down on the YTM client does not provide any extra headroom for the browser's processing. That was a bummer! Apparently, the order of the processing blocks has changed, compared to Play Music, putting the digital attenuation after resampling, maybe because YTM uses some modern web audio API which gives the browser more control over media playback.

Here is a summary of THD+N measurements for Chrome and FireFox for cases when the system output is either at the "native" sampling rate—44.1 kHz or at 48 kHz. On the left there are baseline numbers for the original dithered signal, measures for the left and the right channel are delimited with a common slash:

Signal Chrome to 44 Chrome to 48 FireFox to 44 FireFox to 48
24/44, -146.7 / -139.2 -102.7 /-103.7 -29.6 / -82.6 -103.4 / -103.7 -103.3 / -103.5
16/44, -106.8 / -95.5 -102.9 / -97.8 -29.6 / -82.5 -102.1 / -97.8 -102.3 / -97.6
24/48, -147.5 / -139.7 -17.7 / -98.4 -17.7 / -80.6 -17.7 / -98.4 -17.7 / -98.4
16/48, -106.7 / -95.6 -17.7 / -89.4 -17.7 / -79.7 -17.7 / -89.5 -17.7 / -89.3

As we can see here, Chrome doesn't do a good job when it has to resample the output to 48 kHz, thus on Linux the only option is to use FireFox instead of it. And obviously, even FireFox can't undo the damage done to the original 48 kHz signal with intersample overs.

My guess would be that the audio path on FireFox uses floating point processing which creates necessary headroom, while Chrome still uses integer arithmetic.

Mobile Clients

Results from iOS are on par with FireFox confirming that this is likely the best result we can achieve with YTM. Android adds more noise:

Signal Android to 44 Android to 48 iOS to 44 iOS to 48
24/44, -146.7 / -139.2 -92.9 / -92.2 -92.9 / -92.2 -102.8 / -102.2 -102.8 / -102.2
16/44, -106.8 / -95.5 -92.6 / -88.3 -92.6 / -88 -101.8 / -97.7 -102 / -97.7
24/48, -147.5 / -139.7 -17.7 / -92 -17.7 / -92 -17.7 / -98.5 -17.7 / -98.5
16/48, -106.7 / -95.6 -17.7 / -87.9 -17.7 / -87.8 -17.7 / -89.4 -17.7 / -89.4

I had a chance to peek "under the hood" of Pixel 5 by looking at the debug dump of the audio service. What I could see there is that there are extra sample rate conversions happening on the way from the YTM app to the USB sound card. The app creates audio tracks with 44100 Hz sample rate. However, the USB audio on modern Android phones is managed by the same SoC audio DSP used for built-in audio devices, to bring down latency when using USB headsets. The DSP works at 48 kHz. Thus, even when the USB sound card is at 44.1, the audio tracks from YTM first got upsampled to 48 kHz to get to the DSP, and then DSP downsamples them back to 44.1 kHz for the sound card. I guess, on Apple devices either this pipeline is more streamlined, or everyone (including the DSP) use calculations providing enough headroom.

Conclusions

I think, it is all pretty clear, but here is the summary how to squeeze out the best quality from YouTube Music:

  • on desktop, when using Chrome (or Edge on Windows), set the sampling rate of the output to the native sample rate of YTM: 44.1 kHz, if that's not possible, use FireFox;
  • on Linux, always use FireFox instead of Chrome for running YTM client, because even lowering the digital volume on the YTM client does not prevent from clipping;
  • due to the fact that YTM applies volume normalization, there is no need to worry about having digital headroom on the DAC side;
  • any 48 kHz or higher content needs to be carefully resampled to 44.1 kHz before uploading to YTM to prevent damage from their sample rate conversion process.

Monday, November 1, 2021

Headphone Stereo Setup

After making a satisfying desktop stereo setup I decided to do something similar with headphones. As I had discussed before, right out of the box no headphones sound convincing to me simply due to physics and psychoacoustics issues that can't be solved using traditional headphone construction. As a result it's just not possible to reproduce a stereo record intended for speakers and expect having instruments placed correctly in the auditory image, even on "flagship" headphones. I'm always puzzled when I encounter phrases like "rock-solid imaging," in headphone reviews especially accompanied with measurement graphs confirming that the left and the right earphones are perfectly matched. I don't know—perhaps the reviewer has a perfectly symmetric head and body, and ideally matched ears—for my aging ears I know that the right one is about 3 dB more sensitive than the left one, so on perfectly matched headphone drivers I naturally have the auditory image shifted to the right slightly.

On the other hand, in order to achieve convincingly sounding stereo reproduction in headphones it's not necessary to go "full VR", measure individual HRTF of the listener in an anechoic chamber, and then perform physically correct simulation of speakers as virtual sound sources placed in front of the listener in a room, and moving around as the listener's head moves. In fact, after trying to use Waves NX for some time, I've found that head tracking only creates an additional distraction as it requires periodic resetting of the "neutral" head position due to headband shifting on the head. So I wanted something simpler, and I think I've found a good middle ground for myself with my setup.

In my headphone setup I follow the same principles as when setting up the desktop speakers—get the most important things right first, and then tune up the rest, getting as close to "ideal" as possible, but stopping when the cost of the next improvement becomes too high. However, the implementation of these principles is a bit different. There isn't as much of "physical alignment" in the headphone setup as one have to do for speakers. The only thing I had to ensure is that the headphone amplifier stays linear and doesn't produce distortions. Then most of the setup happens on the DSP side. But even there a distinction between "main" and "fine" tuning do exist.

As I had explained in my earlier post on headphone virtualization, reproduction over headphones lacks several components that we take for granted when listening over speakers:

  1. Room reverberation. This is a very important component which significantly supports the sound of the speakers themselves and also helps to place reproduced sources correctly in the auditory image. Acousticians love to talk about "direct-to-reverb" sound ratio when considering opera halls and other venues, as this is one of the parameters which separates good sounding spaces from bad sounding ones.

  2. Acoustical leakage between speakers. This is considered as a negative factor in VR over speakers applications, because for VR one needs to control precisely the sound being delivered to each ear, however stereo recordings actually rely on this acoustical leakage. Without it, sources that are hard panned to one channel tend to "stick" to the headphone which is playing them, narrowing the sound stage considerably.

  3. Assymetries in the human body and the hearing system. Listening over headphones makes the sounds coming into left and right ear very symmetric and this confuses the auditory system. Also, with aging, sensitivity of ears becomes less and less symmetrical, requiring individual tuning of headphones.

To achieve more realistic reproduction over headphones we need to replicate the effects of the factors listed above. Some manufacturers of headphones tried to do that in hardware, and we got products like AKG K1000 "earspeaker" headphones, which I guess sound almost right for stereo records, but are quite cumbersome to use, not mentioning the price. A good pair of open over-ear headphones can also come close to naturalistic stereo reproduction because they allow for some inter-aural leakage as well as slight interaction with the room. However, closed over-ear headphones and IEMs are hopeless in this respect, and only electronic or digital solutions can help them to produce a speaker-like stereo soundstage.

Before we dive into details of my setup, there are two main factors that are indicative for me when judging correctness of headphones tuning:

  • The sound is localized outside of the head. Although the actual perceived distance still depends on the recording, and sometimes it feels that the vocals are still very close to your face—for lots of modern records that's in fact the "artist's intention"—however, by quick A/B comparison with unprocessed headphone sounding one can quickly understand that although the sound appears to be close to the face, it's definitely not inside the head.

  • Every instrument can be heard distinctively, similar to how it sounds over well-tuned stereo speakers. By replicating the natural HRTF of the person via headphone tuning we "place" each frequency band correctly in the auditory image, and this allows the auditory system to separate auditory streams efficiently.

As a final analogy, putting on properly tuned headphones feels similar to wearing VR glasses—you feel "immersed" into scene, as if you are peeking into it via some "acoustic window."

The Tuning Process

The process of headphone tuning can be separated into several phases:

  1. Simulate ideal reverberation conditions for the actual room we are listening in. Although we can simply capture the reverb of the room, it's usually far from "ideal" due to strong reflections. If you went all the way building an ideal physical room—congratulations!—you can just use the captured response directly. However, one can just build a great virtual version of their room instead.

  2. Adjust the crossfeed and direct-to-reverb (D/R) ratio making sure that phantom sources end up placed correctly, especially those in "extreme" positions—outside the speakers. This tuning also moves the acoustic image out of the head.

  3. Tune the inter-aural frequency balance. This way we emulate the natural HRTF and any deficiencies of the individual's hearing apparatus that the brain got accustomed to.

  4. Finally, as an optional step we can use time domain signal correction to ensure that the electrical signal reaching the headphones has properties close to those of an ideal low-pass filter.

As for the headphones choice, my intention was to create tuning to use with Shure SRH-1540 closed-back headphones. These headphones are very comfortable to wear: lightweight, having a negligible pressure on the head, and not causing my ears to sweat. However, their factory tuning is too much V-shaped—a strange choice for "studio" headphones by the way. I strongly prefer the tuning of headphones made by Audeze because it resembles very close the sound of properly tuned speakers (and I have confirmed that by measuring with in-ear microphones), but the weight of magnetic planar headphones literally brings my head down (I made a comparison of the weights in one of my previous posts), and their thick faux leather pads quickly turn my ears into hot dumplings. So I ended up using Audeze EL-8 closed back as a tool for tuning, but after finishing with it I put them back into their box.

Reverberation

The idea behind replicating the reverberation of the room is that once we enter a room our hearing system adapts to it, and uses reflected sounds and reverberation as a source of information for locating positions of sources. This happens unconsciously—we just "feel" that the sound source is out there, without actually "hearing" the reflected sounds, unless the delay is large enough to perceive them as echoes. Thus, when we replicate the reverberation of the room over headphones, this helps the auditory system to perceive the sounds we hear as happening around us, in the room.

I captured the reverberation of my room using the classic technique of recording a pop of a balloon. Then I took the NX TrueVerb plugin and tried to tune the parameters so that resulting reverb matches the one I've captured. Speaking of "ideal" reverberation—I liked the idea of "ambechoic" room pioneered by G. Massenburg—I read about in the book "Acoustic Absorbers and Diffusers". The physical implementation of "ambechoic" requires using a lot of wide-band diffusers in order to "break up" all the reflections while retaining the energy of the reflected sound. In the virtual recreation, I simply turned off early reflections simulation and set the density of the emulated reverb tail to maximum value, and this is what I've got (ETC graph):

The first strong reflection (marked by the cursor) is created by the Redline Monitor, more on that later. Note that the reverb tail still looks a bit spiky, but this is the best I could obtain from TrueVerb.

I'm not very good at matching reverbs "by ear" so I used two tools: the IR measurements of Smaart V8 and the RT60 calculator of Acourate. The first has a good proprietary algorithm for finding the D/R ratio and overall decay times, the second shows in a convenient form decay times for each frequency band, and can display tolerance curves from standards.

Below are side by side comparisons of ETC for the real vs. emulated rooms as shown by Smaart V8:

I tried to get them as close as TrueVerb's controls allowed me to do. The early decay time (EDT) of the simulation is much shorted due to absence of early reflections, but I don't think it's an issue. The RT60 time is 25% shorter—I was trying to make it the same as of the room, however there are limits on the granularity of settings in TrueVerb. However, this shorter time is still good according to the comparison graph below—it shows per-frequency decay times along with tolerance boundaries from DIN 18041 standard for music reproduction calculated by Acourate for the room of my size:

Although I didn't try matching reverbs "by ear" I still listened to them carefully as measurements alone do not provide the full picture. During my early experiments I was intending to use the built-in reverb of my MOTU soundcard—after all it comes for free! However, despite looking similar on the measurement side, MOTU's reverb sounded horrible with very distinctive flutter echo. By the way, dry recordings of percussive musical instruments like castanets or bongos are turned out to be excellent for revealing any flaws in artificial reverbs.

Cross-feed and D/R ratio

TrueVerb was designed to be sufficient on its own for providing a stereo reverb and controlling its frequency response. However, the degree of control it provides wasn't enough for my specific needs. As a result, I ended up using the mono version of TrueVerb on two parallel buses, and augment it by Redline Monitor and an equalizer. Here is the connections diagram:

Note that TrueVerb outputs the reverb tail only. This way, I've got full control over the stereo dispersion and the spectral shape of the reverb tail. After playing with different settings on Redline Monitor I've ended up with 90 degree soundstage—this way, the reverb is sounding "enveloping", which was exactly my goal.

The direct sound is placed on a separate bus, with its own instance of Redline Monitor and own set of cross-feed parameters. By altering the volume control on this bus I can change the direct to reverb ratio.

On the Redline Monitor for the direct sound I've pinned the "speaker distance" parameter to the minimum value above zero: 0.1 meter. What I've found is that zero distance doesn't provide convincing externalization, however increasing of the speaker distance adds considerable combing effect, see my previous post about Redline Monitor for graphs. What I could see on the ETC graph is that enabling the "speaker distance" knob adds virtual reflections. Here I compare the settings of 0 meter distance, 0.1 meter, and 2.0 meter:

I suppose, the presence of reflections emulates the bounce of the sound from the mixing console (since Redline Monitor is intended for studios). As the "speaker distance" increases, the level of these reflections becomes higher compared to the direct impulse. That's understood—the further one moves away from the speakers, the more levels of the direct sound and the first reflection become similar. However, this increases the amplitude of comb filtering ripples, thus the minimum possible "speaker distance" is what we want to use. This settings keeps the emulated reflection at -26 dB, below the level of the direct sound—an acceptable condition if we consider a real acoustic environment.

After fixing the speaker distance I've spent some time tweaking multiple parameters which have turned out to be interconnected for the auditory system since changing one had effect on another:

  • the soundstage width,
  • the attenuation of the center channel (both parameters are on the Redline Monitor), and
  • relative levels between the direct sound bus and the reverb bus (D/R ratio).

While tweaking them I used stereo soundstage test tracks from Chesky Records demo CDs to ensure that sounds panned to left and right positions sound in headphones as if they are indeed coming from the corresponding speaker, and that "extreme" left and right—beyond the speakers—are reproduced convincingly. I also used music tracks with strong, energetic "in your face" mix (album "Cut" by the industrial "supergroup" C-Tec) to ensure that I could put the vocals further away from my face.

I tried to avoid attenuating the reverb too much compared to the direct sound as this dramatically decreases the perceived distance to the source. However, having the reverb too strong was breaking the perception of "extreme" left and right source positions, and so on. So finding the sweet spot for the combination of the simulation parameters turned out to be a challenging task and it actually gave me some intuitive understanding of how real speakers can interact with a real room.

Aligning Amplitude Response

Basically, what I have achieved through the previous stages is creating a virtual speaker setup in a virtual room with reverb similar to the one I have in my real room. Now I had to align the frequency response of that setup—as I hear it via the headphones—with the frequency response of my real speakers—as their sound reaches my ears. This process is often referred to as "headphones equalization." Traditionally it's done using a head and torso simulator, but I don't have one so I used in-ear microphones on my own head—that's even better because done this way the tuning becomes personal.

I used my Sennheiser Ambeo Headset for this task. I have captured the amplitude response of the speakers in Smaart V8 over the Ambeo sitting in my ears. Then I captured the amplitude response of EL-8s—also via Ambeo—and it has turned out to be quite close to speakers—no surprise that I like the sound of EL-8s so much. I must note that the positioning of centered banded noise was still wrong in EL-8 headphones. So even if I'd chosen to stick with them I still have to do some personal tuning, more about this later.

Nevertheless, what I wanted is to tune my SRH-1540s. I started measuring them, and they turned out to be way off the speaker sound "baseline": too much bass, and too much treble—the V-shape tuning in action. So I started equalizing them "in real time"—by adjusting the equalizer. I used a linear phase equalizer (LP10 by DDMF) to avoid altering the inter-aural time difference (ITD). This is because sharp EQ curves implemented using minimum phase filters can significantly affect the phase and thus change the ITD, since the tuning for the left and right ears is not symmetric.

After setting the amplitude response, I removed Ambeo from my ears—what a relief!—and performed final tuning strokes to make sure that all frequency bands are positioned consistently within the auditory image. This is extremely important in order to avoid spreading of auditory images of individual instruments.

For this step of tuning I used test signals generated by DGSonicFocus app by Dr. Griesinger. The app produces bands of noise centered between the channels. It can produce either correlated or decorrellated noise—I was using the latter option. When listening over correctly tuned speakers these test signal create a phantom center image. Thanks to my initial amplitude correction for headphone output, some of the bands were already placed correctly in the auditory image, but some still not, mostly in the high frequency range, because it's hard to tune high frequency region correctly from measurements only—they tend to be too volatile. So I used my ears instead, and by applying peaking EQs in the same linear phase equalizer managed to "move" all the bands to the center.

Below are the resulting EQ curves for SRH-1540. Note just how asymmetric they have to be in order to create a convincing auditory image for me over headphones:

I would compare this tuning process to making an individual pair of prescription glasses. Hopefully with advances in customer audio it will become much easier some day.

Time-Domain Tuning (optional)

Since I really enjoy what DSP filters produced by Acourate do to my speakers, I questioned myself whether it's worth to try applying Acourate to the headphone chain. After all, we are simulating speakers in a room so why not to try applying a room correction package to this simulation?

I did not plan doing acoustic measurements at the ear entrance as my equipment simply lacks the required precision. I decided to do the measurements at the analog electrical boundary by tapping into the headphone wires using my T-Cable. I temporarily bypassed the equalizer as it's linear phase, and its setting is asymmetric. From the measurements I've found that left and right outputs are almost identical, as I was expecting them to be on a proper audio chain. So, both the digital and the electrical analog chains are already almost perfect—is there really any room for improvement?

I ran Acourate's correction macros for these measurements, and it still managed to do something to the shape of the impulse response. Below is the difference, I think Acourate made it to look more like a minimum-phase response, notice the deeper "sagging" of the amplitude after the initial peak:

Did this correction change anything? Not too much in general, however percussion instruments started sounding a bit differently, and I would say towards more "natural" side. I loaded these corrections into a convolver plugin—adding it increased latency, but not significantly, since I already had the linear phase EQ plugin in the chain. Now I've got a feeling that I'm really done with the setup.

Putting it all Together

For completeness, here is the full processing chain I use for headphones tuning. I run it in Ardour together with the DSP filters for the speakers tuning:

Note that I marked how the sections of the chain conceptually relate to simulated speaker reproduction. As I noted previously, instead of multiple plugins for the "Room" part I could potentially use just one good reverb plugin, but I haven't yet found an affordable one which would fit my needs.

Despite using lots of plugins, the chain is not heavy on computations, and Ardour takes no more than 15% of CPU on my 2015 Mac mini (as measured by the Activity Monitor), leaving the fan being silent (and recall that Ardour also runs the speaker correction convolver).

Conclusions

Compared to setting up speakers, which was mostly done "by the book," setting up headphones required more experimenting and personal tweaking, but I think it was worth it. Would be interesting to do similar setup for IEMs, although doing measurements in this case for aligning with the speakers response will be challenging for sure.

About the time when I started doing these experiments, Apple has announced support for Atmos and binaural headphone rendering on headphones in their Music app. I took a try listening for some Atmos-remastered albums over headphones on an iPhone. The impression was close to what I have achieved for stereo recordings with my headphone setup—the same feeling of natural instrument placement, generally wider soundstage, and so on—definitely superior to a regular stereo playback over headphones. I was impressed that Apple and Dolby have achieved this effect over non-personalized headphones! On the other hand, expecting each album to be remastered in Atmos is unrealistic, so it's good I'm now able to listen to original stereo versions on headphones with the same feeling of "presence" that Apple provide in Atmos remasters.

Sunday, August 22, 2021

Automatic Gain Control

This post is based on the Chapter 11 of the awesome book "Human and Machine Hearing" by Richard Lyon. In this chapter Dr. Lyon goes deep into mathematical analysis of the properties of the Automatic Gain Control circuit. I took a more "practical" route instead, and did some experiments with a model I've built in MATLAB based on the theory from the book.

What Is Automatic Gain Control?

The family of Automatic Gain Control (AGC) circuits origins from radio receivers where it is needed to reduce the amplitude of the voltage received from the antenna in the case when the signal is too strong. In the old days the circuit used to be called "Automatic Volume Control" (A.V.C.), as we can see in the book on radio electronics design ("The Radiotron Designer's Handbook"):

However, the earliest AGC circuits can be found in the human sensory system—they help to achieve the high dynamic range of our hearing and vision. In the hearing system the cochlea provides the AGC function.

The goal of AGC is to maintain a stable output signal level despite variations in the input signal level. The stability of the output is achieved by creating a feedback loop which "looks" at the level of the output signal, and makes necessary adjustments to the input gain of the signal at the entrance to the circuit. This is how this can be represented schematically:

Note that the "level" is a somewhat abstract property of the signal. What we need to understand is that "level" can be tied, based on our choice, either to the amplitude of the signal or to it's power, and expressed either on a linear scale or on a logarithmic scale. There is also a somewhat arbitrary distinction between the "level" and the "fine temporal structure" of the signal. If we consider a speech signal, for example, it obviously has a high dynamic range due to fast attacks of consonant sounds. However, in AGC we don't examine the signal at such a "microscopic" level. There is always a time constant which defines the speed of level variations that we want to preserve in the output signal.

We want the gain changes to be bound to the "slow" structure of the output signal, otherwise we will introduce distortions. The AGC Loop Filter is used to express the distinction between "fast" and "slow" by applying smoothing the measured level. The simplest way of smoothing is applying a low-pass filter (LPF). Although it's common to define the LPF in terms of its cut-off frequency, another possible way is to use the "time constant", which defines the former.

AGC vs. Other Systems

There are two classes of systems that are similar to AGC in their function. The first class is comprised of systems controlled by feedback—these systems are studied extensively by the engineering discipline called "Control Theory". Schematically, a feedback-controlled system looks like this:

The big difference with AGC is that there is a "desired state" of the controlled system—this is what the control system is driving it towards. For example, in an HVAC system the reference is the temperature set by the user. In contrast, nobody sets the reference for an AGC circuit, instead, for any input signal that doesn't change for some time, the AGC circuit settles down on some corresponding output level which is referred to as "equillibrium."

Another class of systems that are similar to AGC are Dynamic Range Compressors, or just "Compressors", frequently used when recording from a microphone or an instrument in order to achieve a more "energetic" or "punchy" sound. The main diffence of a compressor from an AGC is that compressors normally use the input signal for controlling their output—this approach is called "feed-forward". Also, the design goal of the compressor is different from the design goal of an AGC, too—since it is used to "energize" the sound, adding harmonic distortions is welcome. Whereas, the design goal in AGC is to keep the level of distortions to minimum.

AGC Analysis Framework

The schematic representation for the AGC we have shown initially isn't very convenient for analysing it since the "controlled system" is completely a black box. Thus, the book proposes to split the controlled system into two parts: the non-linear part, which applies compression to the input signal, and the linear part which simply amplifies the compressed signal in order to bring it to the desired level. Note that since the compression factor is defined to be in the range [0..1], the compression always reduces the level of the input signal, sometimes considerably. Below is the scheme of the AGC circuit that we will use for analysis and modelling:

We label the outputs from the AGC blocks as follows:

  • a is the measured signal level;
  • b is filtered level;
  • g is the compression factor.

In the book, Dr. Lyon uses the following function for calculating g from b:

g(b) = (1 - b / K)K, K≠0

Below are graphs of this function for different values of K:

As the book says, the typical values used for K are +4 or -4.

AGC in Action

In order to provide a sense of the AGC circuit in action, I will show how the outputs from the blocks of the AGC change when it acts on an amplitude-modulated sinusoid. I used the same parameters for the input signal and the AGC circuit and obtained a result which looks very similar to the Figures 11.9 and 11.10 in the book.

The input signal is a sinusoid of 1250 Hz considered over a period of 1000 samples at the sampling rate of 20 kHz (that's 50 ms). Below are the input an the output signals, shown on the same graph:

And this is how the outputs from the AGC loop blocks: a, b and c change:

The level analyzer is a half-wave rectifier, thus we only see only positive samples of the output signal as the a variable. This output is being smoothed by an LPF filter with a cut-off frequency of about 16 Hz (10 ms—that's the 1/5th of the modulation frequency), and the output is the b variable. Finally, the gain factor g is calculated using the compression curve with K = -4. The value of g never exceeds 1, thus to be able to see it on the graph together with a and b we have to "magnify" it. The book (and my model) uses the gain of 10 for the linear part of the AGC (this is designated as H) to bring the level of the output signal after compression on par with the level of the input signal.

My implementation of the AGC loop in MATLAB is rather straightforward. I decided to take an advantage of "function handles", which are very similar to lambdas in other programming languages. The only tricky thing is to set the initial parameters of the AGC loop. Due to the use of feedback, there is a tricky situation with the values for the very first iteration, where the output isn't available yet. What I've found after some experimentation is that we can start with zeroes for some of the loop variables and derive the values of other variables for them. Then we need to "prime" the AGC loop by running it on a constant level input. After a number of iterations, the loop enters the equillibrium state. This is how the loop looks like:

function out = AGC(in, H, detector, lpf, gain)
    global y_col a_col b_col g_col;
    out = zeros(length(in), 4);
    out(1, b_col) = 0;
    out(1, g_col) = gain(out(1, b_col));
    out(1, y_col) = H * out(1, g_col) * in(1);
    out(1, a_col) = detector(out(1, y_col));
    for t = 2:length(in)
        y = H * out(t - 1, g_col) * in(t);
        a = detector(y);
        b = lpf(out(t - 1, b_col), a);
        g = gain(b);
        out(t, y_col) = y;
        out(t, a_col) = a;
        out(t, b_col) = b;
        out(t, g_col) = g;
    end
end

And these are the functions for the half-wave rectifier detector and the LPF:

hwr = @(x) (x + abs(x)) / 2;

% alpha is the time constant of the LPF filter
lpf = @(y_n_1, x_n) y_n_1 + alpha * (x_n - y_n_1);

In order to be able to visualize the inner workings of the loop, the states of the intermediate variables are included into the output as columns.

Since the resulting gain is relatively high, reaching the value of 0.1 at the maximum, we use the compensating gain H=10. We can also see that the gain factor g shows a dependency on the input level. This leads to non-linearities of the output. Using MATLAB's thd function from the Signal Processing Toolbox we actually can measure it pretty easily on our sinusoid. Just as a reference, this is what the thd function measures and plots for the input sinusoid (only the 2nd and 3rd harmonics are shown):

And this is what is shows for the output signal from our simulation:

As we can see, there is a non-negligible 2nd harmonic being added due to non-linearity of the AGC loop.

Experiments with the AGC Loop

What happens if we change the level detector from a half-wave rectifier to a square law detector? In my model we simply need to replace the detector function with the following:

sqr_law = @(x) x .* x;

Below are the resulting graphs:

What changes dramatically here is the level of compression. Since the square law "magnifies" differences between signal levels, high level signals receive a significant compression. As a result, I had to increase the compensation gain H on 5 orders of magnitude (that's 40 dB).

The behavior of the gain factor g still depends on the level of the input signal, so the circuit still exhibits a non-linear behavior. By looking at the THD graph we can see that in this case the THD is lower than of the half-rectifier AGC loop, and the dominating harmonic has changed to the 3rd:

Another modification we can try is change the time constant of the LPF filter. If we make the filter much slower, the behavior of the gain factor g becomes much more linear, however the output signal even less stable than the input signal:

On the other hand, if we make the AGC loop much "faster" by shifting the LPF corner frequency upwards, it suppresses the changes in the input signal very well, but at cost of highly non-linear behavior of the gain factor g:

Can we achieve the higher linearity of the square law detector while still using the half-wave rectifier?

Multi-Stage AGC Loop

The solution dates back to the invention of Harold Wheeler who used vacuum tube gain stages for the radio antenna input. By using multiple stages, the compression can be increased gradually. Also, a stage with lower compression brings lower distortion. If we recall our formula for the compression gain (making K an explicit parameter this time):

g(b, K) = (1 - b / K)K, K≠0

We can see that by multiplying several functions that use smaller value of K we can achieve an equivalent of a single function with a bigger (in absolute value) K:

(g(b, -1))4 ~ g(b, -4)

Actually, if we change the definition of g to set the K as the divisor of b independently of the K power, we can obtain exactly the same function.

We can also vary the time constants of each corresponding LPF filter. This is how this approach looks schematically:

Each "slower" outer AGC loop reduces the dynamic range of the output signal, reducing the amount of the compression that needs to be applied for abrupt changes by inner "faster" loops, and thus keeping the distortion low.

I used 3 stages with the following LPF filters:

This is how the input / output and the loop variables look like:

We can still maintain a low compensating gain H in this case, and the behavior of the gain factor g is now more linear, and we can see this on the THD graph:

And here is the comparison of the outputs between the initial single stage approach with the square law and multi-stage approaches:

The multi-stage AGC yields a bit less compressed output, however it has less "spikes" on level changes.

Conclusions

It was interesting to explore the automatic gain control circuit. I've uploaded the MATLAB "Live" script here. I hope I can reimplement my MATLAB code in Faust to use as a filter for real-time audio. AGC is very useful for virtual conferences audio, as not all VC platforms offer gain control for participants, and when attending big conferences I often need to adjust volume.