Saturday, March 27, 2021

Headphone Virtualization For Music Reproduction

This post is written after the presentation I gave to my colleagues. Here I tried to explain why making commercial stereo recordings to sound on headphones as good and natural as they can sound on a well tuned stereo speakers is not an easy task. This topic has much in common with the popular topics on "immersive" or "3D sound" on headphones because essentially we want to reproduce a recording in a way that makes the listener to believe that they are actually out there with the performance and forget that they even have the headphones on. However, this post deals specifically with commercial stereo recordings reproduction and does not touch topics of AR/VR.

Reproduction on Speakers

First we need to provide some context about speaker playback. Let's start with the simplest case of a mono speaker located in a non-anechoic (that is, regular) room. Imagine you are listening to some sound, for example pink noise, played over this speaker. Although it's a very simple case, it demonstrates several important things. We understand that the physical sound (acoustic waves) needs to be received by the sensory system of our body—mainly ears, processed by our brain, and as a result we have a perception (or acoustic image) of the physical source formed in our mind. We also see the speaker, and the perceived sound source will be in our mind anchored, or localized to the visual image of the speaker.

This audio perception has a lot of associated attributes in our mind. Some of them originate in the sound that is being reproduced by the speaker, like it's loudness and timbre. Some of them are are specific to the relative position of the speaker and the listener, and the properties of the room. Humans use both ears (binaural listening), and our brain manages to recognize the source in both audio inputs and derive the difference in sound levels and the times of arrival (known as Interaural Level and Time Difference, ILD and ITD) for roughly locating it in the horizontal plane of our mind's eye.

Moreover, in a non-anechoic room there will be reflections from the walls and other objects, and the information will be extracted by the brain from ILD and ITD of the reflected sounds to help us to estimate the distance to the sound source and even the size of the room.

Moving to a reproduction using two speakers gives a possibility to provide even more cues to the brain and create imaginary sound sources that are positioned outside of the actual sound sources—the speakers. However, with two speakers the acoustical picture becomes more complicated. Obviously, each ear receives sound from both speakers and from wall reflections. With a good stereo setup the listener can forget about the existence of the speakers and completely disentangle them from the sound they are producing.

Through the long history of the development of stereo recording and playback audio engineers learned how to use the stereo speaker arrangement for creating phantom sources that are located anywhere in the horizontal plane between the speakers and even outside them. As a matter of fact, most commercially available stereo recordings were produced for playback over speakers.

Use of multi-channel systems, especially with height channels helps to push envelope even further and produce phantom sources anywhere around the listener. Unlike a stereo setup where the perception of phantom sources might be quite sensitive to the listener location, multi-channel systems handle even multiple listeners with ease. Anyone who had a chance to visit a modern movie theater had experienced the wonders of this technology.

HRTF

However, even on stereo systems some of advanced sound engineers manage to create phantom sources that are located above the speakers, to the side of, or in a close proximity to the listener. These effects are achieved by applying frequency filtering which imitates the physical filters of ear pinnaes and of the head. Some example of tracks that I personally like are "Edge of Life" by Recoil and "One Step Behind" by Hol Baumann.

This brings us the the topic of HRTF (Head-Related Transfer Function). It is used a lot in the context of AR/VR, however, for our particular topic what we need to understand that there exist two filters: the first is the physical filter which is located between a sound source and an eardrum: the combination of the torso, head, and the outer ear. They transform any external sound in a way that greatly depends on the location of its source to the ear.

The second filter exists in our auditory system. It is quite complex, it uses information arriving to both ears, visual cues, and our learned experience of living with the physical fiter of our body. Its goal is to "undo" the effect of the body filter and restore the original timbre of the sound source and use the information from the body filter for locating the sound source.

A simple an efficient demonstration of this filter at work, as pointed out by S. Linkwitz, is turning one's head from side to side while listening to music. Although the sound that reaches one's ear drums changes dramatically, the perception of the timbre remains stable and the sound source just changes its position in the auditory image. However, the filter of the auditory system doesn't restore the timbre completely. If you try compare the auditory image of the noise from ocean waves as heard facing them, and then from the back, the latter sound will be noticeably lacking the boost of high frequencies that our ears pinnaes add.

It is important to note that due to assymetry of human bodies the physical filters for the left and right ears are different, and so are the auditory system filters that counteract them. This assymetry plays an important role, along with ITLD and room reflections, in locating sound sources and placing them correctly on the auditory image. As C. Poldy notes in his tutorial on headphones, "the interaural differences are unique for each individual and could not be a characteristic of the sound source." This allows humans (and other creatures) to derive the direction of the sound without rotating their heads.

Very simplified model of HRTF filters at work (after D. Griesinger) is as follows:

The "Adaptive AGC" block helps to restore alterations of frequency response due to environmental conditions. This is similar to "auto white balance" function of human's vision system. It helps to recover the natural timbre of familiar sources which are altered, for example, by closely placed reflective surfaces.

Reproduction on Headphones

Now we put headphones on—what happens? Because the drivers of headphones located close to ears, or even in the ear canal, the natural physical filter is partially bypassed and is partially altered due to the change in the ear physics, for example, due to blocked ear canal, or new resonances added due to presence of ear cups around the ear. Left and right headphone speakers are usually tuned to be symmetric. The combination of these factors brings in misleading cues to the auditory system and it can't anymore use the localization mechanisms beyond those relying on simple interaural level difference. As a result, the auditory image "resets" to "inside the head" sensation.

Another difference from stereo speaker playback is that in headphones left and right channels of the recording do not "leak" to contra-lateral ears. This is a remarkably good property of headphone playback and it is used a lot for creating immersive experience, however it deviates from the reproduction setup what stereo recordings are created for. Some recording and artificial effects that are used for creating a wide auditory scene on stereo recordings inevitably stop working when playing over headphones.

There exist several known approaches for bringing headphone playback closer to speaker reproduction. I must note that some of them are specific to stereo music reproduction—they are not needed for binaural recordings and binaural renderings of multi-channel and object-based audio programs.

Crossfeed

This is the technique that I was exploring a lot in the past, see my old posts about Redline Monitor Plugin and on Phonitor Mini. Crossfeed is based on adding of slightly delayed copies of sound from the counter channel to the direct channel. It is based on a simple spherical head model.

Adding a delayed copy of the signal to itself leads to comb filtering—it also occurs natually in speaker playback and is likely taken into account by the brain for approximating distances between audio sources. My opinion is that comb filtering should be kept to minimum to avoid altering the timbre of the sound. For music playback I would prefer the least amount of comb filtering, even if it results in less externalization over headphones.

Multi-channel Rendering

Rendering of multi-channel audio over headphones can be based on the same principle as crossfeed but with a more realistic head model, as it also needs to take into account natural suppression of high frequencies caused by pinnaes of the ears. It is likely that a binaural renderer for multi-channel audio relies on more realistic HRTFs. For example, below are HRTF filters used by my Marantz AV7704 when playing a 5.1 multi-channel program into the headphone output in "Virtual" mode:

An interesting observation is that the center channel is rendered using an identity transfer function, although normally a frontal sound source will be affected by HRTF, too.

The graphs above do not reveal how the simulation of acoustic leakage between speakers affects the output signal. On the graphs below the test signal is played simultaneously into the front left and front right channels. In the time domain we see a delayed signal from the counter channel (ETC is shown for clarity):

And in the frequency domain this unsurprisingly causes ripples to appear:

The headphone virtualizer in AV7704 doesn't go beyond simulating acoustic leakage and directional filtering. However, there is yet another big thing that could be added.

Reverberation

The rooms that we have at home rarely have extensive acoustic treatment similar to studios. Certainly, when setting up and tuning a speaker system in a room I try to minimize the impact of reflections during the first 25 ms or so, see my post about setting up LXmini in a living room. However, this setup is still "live" and has a long reverberation tail. The latter is obviously missing when playing over headphones. A slight amount of artificial reverb with controlled delay time and level helps to "live up" headphone playback and add more "envelopment" even for a stereo recording.

The standard LEDE design of audio studios also allowed for some diffused sound coming from the back of the listener. This sound, which is decorrellated with the direct sound from the speaker helps to enhance the clarity of the reproduction. In fact, the more it is decorrelated, the better, since that minimizes comb filtering.

Headphones Equalization

These days measuring headphones is a popular hobby among tech-savvy audiophiles. What these measurements show is that no two models of headphones are tuned the same way. Although there are well known "recommended" target curves like Harman Target Curve, or diffuse field target curve, which strives to make the sound pressure delivered to the microphones of a head and torso simulator to resemble the sound pressure they receive in a room with a lot of random reflections. However, each designer tends to bring in some "voicing" to stand off the crowd, and as a result, one might need to go a long way finding headphones that satisfy their musical taste. I guess, if the customers ears and body have similar dimensions as of some good headphone designer, the customer could be quite happy with the tuning.

I had some fun trying audio plugins for cross-tuning headphones to make them sound similar to other models, however the outcome of these experiments was still somewhat unsatisfying. The only equalization which seems to be useful is the one which ensures that the headphones deliver a flat frequency response to the eardrums. This is a "ground zero" equalization on top of which one can start putting on HRTFs and preference tuning curves.

One problem when trying to achieve the flat equalization by means of plugins is that the measurements that they use were taken on a head and torso simulator and don't take into account how the headphones interact with my ears, thus the resulting tuning is not flat. It's not even balanced correctly since my ears are not symmetric. It's very easy to demonstrate this by playing over headphones mono signal of banded tone bursts of chirps over the audible range—they move arbitrarily from left to right. This almost doesn't occur when playing the same signals over a tuned pair of stereo speakers because their sound passes through the "outer" HRTF filter—the body, and the audiory system can find a matching pair of HRTFs for compensation. When using headphones the matching pair of HRTFs can not be found, thus no compensation occurs.

This is actually a serious problem, and a lot of research related to HRTFs is devoted to finding ways of figuring out a personalized HRTF without physically taking the subject into an anechoic chamber to measure HRTFs directly. However, for simulating stereo speakers knowing full HRTFs (for sources in any direction) are not required. Still, some degree of personal headphone equalization is needed to achieve proper centering of mono images and placing the virtual speakers in front of the listener in horizontal plane.

Head Tracking

There is another way for dealing with the lack of a personal headphone equalization. Our hearing system takes a lot of cues from other sensory systems: visual, motion, sense of vibrations, and from higher levels of brain—all that to compensate for lacking and contradictory cues that our ears receive. By changing sound according to head movements, e.g. with use of some generic HRTFs, we can engage our adaptation mechanism to start making sense of the changes that they produce. Obviously, using person's own HRTF would be ideal, however providing auditory feedback for head movements relies on the ability of our brain to learn new things that are useful for survival.

Gaming-oriented headsets with head tracking, e.g. Audeze Mobius were available for a long time already. And lately, mass consumer-oriented companies like Apple have also adopted the head tracking technology for more realistic multi-channel audio reproduction over headphones, and a lot of other companies will undoubtely follow the suit.

What's Next?

I'm going to discuss how headphone virtualization is implemented in Waves Nx, and also my DIY approach based on D. Griesinger's ideas.

Saturday, January 23, 2021

Teensy Project: Talking ABC

As I had mentioned in my previous post, I was intending to build a talking Russian ABC for my daughter. It took me a lot of time to complete this project, and finally it's done:

This was an exciting if not somewhat exhausting effort, and I've learned a couple of things along the way. Making this ABC myself also made me to realize just how much complexity we are taking for granted in everyday things that surround us. Talking toys these days cost $19–$39 and we consider them as "cheap stuff." However, behind each of them there are likely days if not weeks of experimenting, designing, and testing. It's only thanks to mass production and outsourcing manufacturing to China that we can enjoy them for such low costs.

The Design

In a nutshell, the design of the ABC toy is pretty obvious: there is an input (buttons), an output (speaker), and the microcomputer (Teensy) which binds the things together. After my experiments with various audio options with Teensy, I've settled down on using the smaller version of Teensy (called 4.0) with Audio Shield which serves both as DAC, and as an SD card controller host, and "Noisy Cricket" amplifier to drive a single 0.5W speaker. The ABC is a standalone toy, so it must use a battery. I purchased a 2200 mAh, 3.7 V Li-Poly rechargeable battery and a charging board for it from Sparkfun. That's all the electronics involved here.

As for passive components—this toy needs buttons—a lot of them. The Russian alphabet has 33 letters, and I also needed 10 buttons for numbers, and 2 buttons for changing the mode. The ABC either pronounces the name for a letter, or the sound it stands for, along with the word on the picture:

In total, that's 45 pushbuttons. Finally, I needed a toggle switch to turn the toy on and off, and two LEDs: one to show that it's turned on, and another to show that the battery is charging. Charging is done via a micro-USB port. I've added another micro-USB port to extend the USB port of Teensy so it can be reprogrammed if needed without removing the back cover.

The number of pushbuttons used didn't allow wiring each one of them individually. Instead, I organized them into a grid. This is somewhat crude schematics of the toy:

I'll explain how the pushbutton grid works in a dedicated section. Physically the toy is built like a big but slim rectangular box with the front panel hosting all the components.

I used two identical ABS sheets for the front and the back panels. The frame is wooden and is attached permanently to the front panel. The toy is sturdy, if not a bit heavy. The ABS sheets are black, I needed to make them look friendly for a child, so I used self-adhesive films and some decals to make them look attractive for a kid. Film also covers the holes and heads of the screws used to attach the components.

Input Grid

There are 45 buttons to monitor. Monitoring each one of them individually would require the same amount of digital input pins. Although Teensy 4.1 potentially could handle that, I was using 4.0, and moreover, some of its pins were reserved for communicating with the Audio Shield board, leaving only just about 15 for handling the buttons. Thus, there was a need for some multiplexing. The idea is that we don't try to catch pressing of each button at all times, but rather query groups of them at periodic intervals. If the intervals are short the discrete nature of qurying is not noticeable by humans.

This is the schematics I've ended up with:

I use a 7x7 grid connecting digital inputs and outputs of Teensy. We go row by row, setting output level to "HIGH" and checking for each column what is the signal level. In order to minimize false triggering by static electricity, each input is connected to ground via a pulldown resistor. This works like a charm. The monitoring code is straightforward:

// Pin numbers used for outputs and inputs:
const int outs[] = { ... };
const int ins[] = { ... };

void setup() {
  // Initial configuration
  for (unsigned int out = 0; out < ARRAY_SIZE(outs); ++out) {
    pinMode(outs[out], OUTPUT);
    digitalWrite(outs[out], LOW);
  }
  for (unsigned int in = 0; in < ARRAY_SIZE(ins); ++in) {
    pinMode(ins[in], INPUT);
  }
}

void loop() {
  for (unsigned int out = 0; out < ARRAY_SIZE(outs); ++out) {
    digitalWrite(outs[out], HIGH);
    delay(10);
    for (unsigned int in = 0; in < ARRAY_SIZE(ins); ++in) {
      if (digitalRead(ins[in])) {
        // Keypress detected
        break;
      }
    }
    digitalWrite(outs[out], LOW);
    delay(10);
  }
}

The actual code is a bit more complex due to the need to avoid restarting the sound if the button has been accidentally pressed twice. Full code for this project is published here on GitHub.

Attaching Audio Shield to Teensy

The Audio Shield is designed to cover all pins of Teensy, however it doesn't actually use all of them. So instead of soldering a row of female pin receptors on the audio shield and use full rows of male pins on Teensy, I've ended up with the following arrangement:

I soldered male pins to the audio shield and cut out those unused by it. I soldered angled male pins to the contact holes on Teensy above the removed pins. I used plastic shims on angles to make Teensy "float" above the shield. I called the resulting design "The Dreadnought" thanks to the gun-like pins on both sides of the board. There is also a double "tail" of pins on the back: the upper row is soldered to the holes, for providing power to Noisy Cricket, and the lower row is soldered to the plates on the bottom of Teensy for additional inputs.

This arrangement ended up to be slimmer than it would be if usual pairs of female/male pin rows were used and fitted even with some extra space into 3/4" height of the toy's internal compartment.

Tuning Audio Output

I tried my best achieving "transparent" sounding for the toy's speaker, unfortunately I fell short reaching that aim due to natural limits of this speaker. Nevertheless, at least I've found a very straightforward way for performing measurement through the entire Teensy / Audio Shield / Noisy Cricket stack, and also a way for quickly doing some DSP tuning using REW. Here are some technical details.

Initially when thinking about measurements, I was considering Teensy as a regular consumer audio device—output only, which means it must be tested using so called "open loop" technique. This involves somehow delivering test signals to the device, playing them, recording, and then analyzing "offline." This is a really tedious task, requiring a lot of experience for iterating quickly.

Another problem with the "open loop" technique is that the playback device and the recording device are both digital, yet unsynchronized, and this often produces artefacts when digitally processing the recording of the test signal due to slight variations between the actual sampling rates.

However, soon I realized that Teensy is actually is much more versatile than a regular microcontroller. First, it can act as a USB audio interface (see the details here), which means that the measurement application can work in real time, in a "closed loop" measurement mode which is more productive than "open loop." In theory, with a good I2S audio I/O board connected to Teensy it would be possible to run both playback and recoding from a measurement microphone through Teensy. However, the microphone input on the Audio Shield was not designed for acoustic measurements, thus an external audio card is required.

The external audio card needs a way to synchronize its clock with Teensy. Otherwise, as I've mentioned, there is a high chance of getting a skewed measurement. One approach to syncing two USB audio devices is to use the feature of macOS, as I've done previously for Ambeo headset. However, a better way is to utilize the built-in SPDIF output on Teensy. This is the diagram of the measurement loop I've ended up with:

Teensy provides clock to RME Fireface and does playback. Fireface handles input from a measurement microphone. This arrangement has demonstrated solid correlation in Smaart, which means we are actually measuring the output of the system and can tweak it.

For tweaking, I preferred to use REW. Teensy Audio Library offers a biquad filter component which accepts raw coefficients, and REW is very handy for generating them. This was my workflow:

  1. Measure the response of the ABC using REW.
  2. Go to EQ dialog. Use "Generic" equalizer mode.
  3. Adjust the target curve and let REW calculate correction filters. If there are too many of them (the biquad component on Teensy allows only 4), disable them, and ask REW to optimize using only the remaining ones.
  4. Save the biquad coefficients to a file for 44.1 kHz sampling rate (only Generic equalizer in REW allows choosing the SR).
  5. Paste generated biquads into code, negating the signs for a1 and a2 coefficients.
  6. Update the sketch on Teensy.
  7. Restart REW since unfortunately the USB audio interface exposed by Teensy resets after reflashing and REW (and any other audio program) loses it.

Below are frequency responses graphs below and after tuning. The main problem with this speaker / enclosure is the dip at about 5 kHz and following it—a huge scoop. This makes the overall sounding telephone-like, but it's hard to do anything about it:

In time domain, there is noticeable "boominess" in the low end:

Power Consumption

Since it's a battery powered device, I wanted to make sure that it doesn't run out of battery too quickly. In order to measure power consumption, I first measured the actual voltage provided by the battery when powering Teensy, it was 4.1 VDC. Then I dialed this voltage on a desktop power supply, powered the ABC from it, and checked the current displayed, it was 125 mA when idle and 150 mA when playing sounds. Having that the battery is rated at 2200 mAh, the toy can work for hours.

I checked whether Teensy can turn itself off, and found that it's only possible using an external circuitry for power control. I didn't consider this in the initial design, so I decided to go without it. In fact, my daughter is disciplined enough to turn the toy off after using, so there is really no need for this extra circuit.

Conclusions

So far, this was the longest project I had undertook. Next time, I would likely try to limit the time spent, as seeing no light at the end of the tunnel for a long time lowers your morale. It was a great relief to have this project finished.

The whole idea of using a microcontroller for doing audio automation seems very appealing though. I can see how Teensy can be used in various audio devices. I would also like to use Teensy in some audio processing project, but I need first to figure out how to go beyond the default 44.1 kHz, 16-bit mode for audio processing.

Sunday, October 18, 2020

Audio Output on Teensy 4.x Boards

I remember learning about Teensy for the first time several years ago from a colleague. He was using Google WALT device for measuring audio latency, and WALT is based on Teensy-LC board. Back then, this board had impressed me with its tiny size, albeit providing a lot of features. Its processing power is very modest though—Teensy-LC is based ARM Cortex-M0+ processor running at 48 MHz.

Recently I've started a project of a talking ABC board for my daughter and decided to check what progress had Teensy made. I was very impressed learning that the latest (4th) version of Teensy emplys a much beefier ARM Cortex-M7 processor running at 600 MHz! This board is more powerful than the desktop computer I was using 25 years ago, and that's at a fraction of the cost of that PC, and on the footprint of a USB memory stick.

Note that Teensy is a microcontroller board which means it doesn't have an operating system. This is what makes Teensy different from Raspberry Pi, for example. This fact has a lot of advantages: first, Teensy boots instantly, second, all the processing power of its CPU is available to your own app. This also means that the board can't be used for general PC tasks like checking Facebook. However, Teensy can be used for more exciting things like building your own interactive toy.

In my case I needed Teensy to play an audio clip (pronounciation of a letter) in response to pressing of a button. Sounds easy, right? However, one thing I needed to do is to figure out how to play audio on Teensy. What I've learned is that Teensy 4.x offers a lot of ways to do that. In this post I'm comparing various ways of making sound on Teensy.

Teensy 4.0 vs 4.1

Every Teensy generation comes in two flavors: small and slightly bigger. Below is the photo of Teensy 4.1 (top) and 4.0 (bottom):

Both boards use the same processor which means their basic capabilities are the same. However, bigger size means more I/O pins available. Also, it's possible to add more memory to Teensy 4.1 by soldering additional chips on its back side. For my project, the important difference is that Teensy 4.1 has an SD card slot, whereas 4.0 only provides pins for it. I plan to use the SD card for storing sound samples–the board's flash memory is unfortunately too small for them. Storing samples on an SD card also simplifies their deployment as I can simply write them down from a PC.

The audio capabilities of 4.0 and 4.1 are thus the same, so I will be referring to the board simply as "Teensy 4" or "4.x".

Teensy Audio Library

From the programming side Teensy is compatible with the Arduino family of microcontrollers. The same Arduino IDE is used for compiling the code, and the same I/O and processing libraries can be employed.

Teensy also has a dedicated Audio library which in my opinion is very interesting. The library has a companion System Design Tool which allows to design an audio processing chain really quick by drag'n'drop, and then export it to the Arduino project.

Being a visual tool, Audio System Designer allows to explore the capabilities of the library without a need to go through a lot of documentation to get started. The documentation is built into tool. The only drawback of the docs is that they are too short. Although, this is partially compensated by numerous example programs.

The described audio capabilities are all based on the objects provided by Teensy Audio Library.

Output Power Requirements

My plan is to use an 8 Ohm 0.5 W speaker from Sparkfun for audio output in my project. Thus, I'm comparing output from audio amplifiers using an 8 Ohm resistive load and ensuring 2 VRMS output voltage (approx. 6 dBV). The goal is to achieve as "clean" output as possible.

Built-in Analog Output (MQS)

The chip that Teensy 4 is based on offers analog output solution which is called MQS for "Medium Quality Sound". Not to be confused with Mastering Quality Sound which has the same abbreviation. MQS on Teensy 4 is a proprietary technology of the chip maker (NXP). MQS allows connecting a small speaker or headphones to the chip pins directly, without any external output network.

Note the revision 3 of Teensy board has a built-in 12-bit DAC. MQS implements a 16-bit DAC with small Class-D amplifier. However, not very good ones. To me, "medium" in MQS is a stretch, coined by the marketing department, and perhaps it would be more fair to call it "LQS" for "Low Quality Sound".

Let's first take a look at a simple 1 kHz sine wave in time domain:

It definitely looks very jaggy to me. Another problem detected by using a DVM is a high DC bias offset: 1.64 VDC. It doesn't show up on the graph because the audio analyzer is AC-coupled. This amount of DC offset can pose a problem to line inputs and even to some speakers.

Another drawback of MQS is that the chip doesn't provide muting for the power-on thump. This can be worked around by adding a relay, after all it's trivial to control it using a PWM output pin, however if you have to use external parts, I would recommend using an external DAC instead.

Below are a couple more measurement graph revealing shocking simplicity of this output. First, as we can see on the frequency domain graph, there is no antialiasing filter, so we can see the mirror image of the original 1 kHz frequency and its first harmonic between 42–44 kHz followed by a direct copy. That means, the DAC/amp on the chip likely uses 44.1 kHz sampling rate.

Frequency response in the audible range is rather flat:

When I tried to achieve the required 0.5 W into 8 Ohm I could only squeeze out 1/100 of that (note that the graph is A-weighted):

In my opinion, due to absence of any filtering, turn on click protection, high DC bias, and low power, MQS output should only be used during development and testing—it's indeed convenient that a speaker can be attached directly to the board for a quick sound check.

External Output Devices via I2S

Since built-in analog output has serious limitations, I've started looking for external boards. Thankfully, Teensy supports I2S input and output. Teensy actually supports plenty of those interfaces, offering great possibilities for multi-channel audio I/O.

For my project mono output is enough. I tried a couple of inexpensive external boards to check how much the audio output improves compared to the built-in output.

MAX98357A DAC/Amp

I bought a breakout board from Sparkfun to try this IC. The datasheet calls the chip "PCM Class D Amplifier with Class AB Performance." Note that it's a mono amplifier which either sums its stereo input, or just uses only one of the two input channels.

Hooking it up to Teensy is extremely easy. One needs to connect the clocks: LRCLK and BCLK to the corresponding pins on Teensy, then connect I2S data (OUT1x, I used OUT1A), and of course power, which can be also sourced from Teensy. Then just use i2s or i2s2 output block in the Audio Designer. There is no volume control on this breakout board, only the amplifier gain can be changed.

MAX98357A IC can accept a variety of sampling rates and bit depts, however Teensy normally produces 44.1/16 audio signal. Looking at the white noise output we can see that the MAX IC employs a proper brickwall audio band filter at the DAC side:

The frequency response in the audible range is rather flat:

Another good feature of the IC compared to MQS is proper output muting on power on to prevent pops. The speaker output has almost no DC offset.

As for the jitter, the IC seems to employ clever synchronization tricks. Initially after powering on the jitter is gross and the noise floors is very high:

However, after 5–15 sec the IC seems to stabilize its input and drastically improve its output quality:

The MAX98357A was able to deliver the required 0.5 W albeit with a 10% distortion (this graph was obtained with the amplifier configured for 12 dB gain):

It's interesting that the 5th harmonic is dominating.

Considering the price of the chip, I would say that MAX98357A IC is a good choice if only mono output is needed and has a lot of advantages over the MQS output.

Audio Adapter Board

Since the times of Teensy 3 its creators were offering an "audio shield" board which is designed to cover the smaller version of Teensy completely. Due to some changes in pin assignments on Teensy 4 the design of the Audio Adapter Board was updated.

The audio part of the Adapter board is based on the SGTL5000 chip which in addition to ADC/DAC and amplifiers also offers some basic DSP functionality.

The Adapter board has line input and output, mic input, and headphone output. It uses I2S interface for communicating with Teensy. The board also offers an SD card slot and a controller for it. Note that although Teensy 4 has an on-chip SD card controller, there is no SD card slot on the 4.0 board. Adding it requires soldering a cable to the corresponding pins on the back side of the board, because the pin spacing and overall space is a bit tight for soldering an SD card socket directly. Thus, for a Teensy-based audio project it might be beneficial to attach the audio shield as it provides both analog audio I/O and an SD card.

One thing many users of this board noted is that is must be connected to Teensy using very short wires. The reason is due to use of an additional (compared to MAX98357A) high-frequency master clock input (MCLK) which runs at the frequency of several MHz.

The resulting jitter of the DAC is quite low, staying below 94 dB the carrier signal:

Surely, the SGTL5000 chip is advanced enough to have protection against the power-on thump. The level of distortions is tolerable (since it's a line output, I had connected it directly to the analyzer's input):

Note a noise peak at 60 Hz. I'm pretty sure it's the result of insufficient shielding on this board because the measurement was taken using differential input of the analyzer. This normally cancels out any EMF noise induced on the probe wires.

The headphone output of the adapter board isn't powerful enough to drive the load required for my project. So in addition to the adapter board an external power amplifier has to be used.

External Analog Amplifiers

I've tested two boards from Sparkfun: a mono Class-D amp, and a classic Class-AB amplifier named "Noisy Cricket". These amplifiers can be connected to the line output of the audio adapter board.

Mono Class-D Amp (TPA2005D1)

This is a low power IC amplifier for which Sparkfun has a breakout board. This is a rather old chip TPA2005D1 from Texas Instruments which advertises a 10% THD on its specs sheet.

And indeed it does have a 10% THD+N when driven up to the required output power (the graph is A-weighted):

Note that I tested this chip on its own, providing an input from the audio analyzer, and powering it using a bench power supply. Despite being tested under these "laboratory" conditions, the chip didn't show a stellar performance. I also tried supplying a differential input from the analyzer, and raising the input voltage up to the accepted maximum of 5.5 VDC but it didn't improve its performance.

It's interesting though, that being an unfiltered Class-D amplifier with 250 kHz switching frequency, this chip offers bandwidth which is enough to serve the full range of the QA401 DAC at 192 kHz sampling rate:

So it seems that there shouldn't be a big difference in terms of audio quality when using the MAX98357A chip via I2S directly, or TPA2005D1 via the line output of the audio shield.

Noisy Cricket (LM4853)

This is another IC amplifier from TI on a good quality breakout board by Sparkfun, which even includes a volume control. The IC is LM4853 amplifier chip (not just an op-amp). It can work either as a stereo amplifier, or as a mono amplifier in bridged mode.

The specs sheet of LM4853 shows much better distortion figures than for TPA2005D1. I had configured the board in mono mode and tested it in the same setup as TPA2005D1: powered from a bench power supply (at 3.4 VDC) and driven by QA401 signal generator. The results were much better:

The 3rd harmonic is 50 dB below the carrier level. For my toy project this is good enough.

Looking at the frequency response, we see some roll-off in the bass range, but I'm pretty sure that the speaker I'm going to use can't go that low anyway, so it's not a big deal:

So, Noisy Cricket is a good choice for me. Hopefully I will be able to achieve close to natural voice reproduction on my talking ABC.

Conclusions

Despite that boards based on Class-D chips are more compact and likely consume less power, when using a speaker of a classic cone construction it seems better to use a classic Class-AB DAC/Amp combination built from the Audio Adapter board and Noisy Cricket.

I'm putting a big rechargeable battery into this talking ABC, so higher power consumption isn't a problem for me. Additional convenience of using the audio shield comes from the fact that it has mounting holes and an SD card for storing audio samples.

If a more sensitive speaker could be used which requires less driving power, then an alternative solution is to use Teensy 4.1 which already has the SD card slot on board, and connect the MAX98357A DAC/Amp chip to Teensy's I2S output.


Bonus: Built-in Digital Output—S/PDIF

I have moved this section to the end because this output finds no application in my project. However, it's a new feature of Teensy 4 which also might be useful sometimes.

On the previous generations of the board, thanks to the efforts of the Audio Library contributors, it was possible to emit signal in the S/PDIF and ADAT formats programmatically. The nicety of the hardware support added in Teensy 4 is that it consumes less power and allows yielding the CPU to more interesting tasks.

The hardware S/PDIF output is as simple to use as MQS—it only requires connecting an RCA output to the board pins. This output only supports Audio CD output format: 44.1 kHz, 16-bit. I must note that although the built-in S/PDIF worked for me on the Teensy 4.0, on its bigger version 4.1 the S/PDIF sampling rate for some reason was setting itself to 48 kHz which made it unusable since Teensy Audio Library doesn't seem to support it. Thus, I could only test the built-in S/PDIF on Teensy 4.0.

Apparently, with a digital input there are no concerns about filtering or non-linearity in analog domain. One thing I was curious to check was the amount of jitter. I hooked up Teensy 4 to the S/PDIF input of RME Fireface UCX interface and then used the same J-Test 44/16 test signal generated using REW 5.20. RME was set to use the S/PDIF clock. I played the same J-Test signal on Teensy and via USB ASIO to be able to compare them. Here is what I've got—the blue graph is from USB, the red one is from Teensy:

As we can see, the output of Teensy has much more stronger jitter-induced components around the carrier frequency, whereas there are practically none for RME's own output.

Note that the peaks on the left side (up to 6.5 kHz) is some artefact of using 16-bit test signal on a 24-bit device (RME). I tried another DAC (Cambridge Audio DacMagic Plus), another computer, switched from PC to Mac, tried 16-bit J-Test sample from HydrogenAudio forum, but these spikes on the left were always there as long as I was using 16-bit J-Test signal, and they were completely gone on 24-bit test signal. I suspect there must be something in the process of expansion of a 16-bit signal to 24-bit that makes them appear.