Saturday, June 15, 2024

LXmini Desktop Version (LXdesktop)—Part I: Build

I would like to make a couple of posts regarding my project which was going on for a long time—a couple of years, actually—and finally got to a satisfying conclusion. As I wrote earlier (see this post and this post) I did build LXmini speakers in their original version, and then was experimenting a bit with different crossover settings and tunings. Then after a house move, I could not use them as floor speakers anymore, and decided to make a more radical experiment and turn them into a desktop near field version. In this post I will talk about the build differences versus the original design by S. Linkwitz, and in subsequent post (or posts?) I will explain the tuning process.

Major Differences

Let's start with the two most important changes.

Height Reduction

The first obvious difference is of course the reduced height of the speakers, since they have to stand on the desktop now. I always work standing, and this puts my head a bit higher above the surface of the desktop compared to a seated position. This is actually good because the vertical stand of LXmini also doubles as the enclosure for the woofer, and thus it shouldn't be too small in order to provide enough spring back action for the woofer driver. I think there is no need to remind everyone that LXmini are designed around simple plumbing pipes. When choosing the height for the vertical stand pipe, I was aiming to get the space between the bottom of the full-range driver and the woofer at my eye level:

The idea behind this requirement is that ears are about at the eye level, and in order to preserve the distances from both drivers to the ear as the head moves closer to or further from the speaker, the ear and both drivers must form an isosceles triangle:

Preserving the distances helps to maintain the time delay between the drivers and the ear, which is important for preserving acoustical summing of sound waves from the drivers.

The length of my vertical stand pipes ended up to be 43 cm (17 inches). Obviously, the change of the working volume for the woofer driver changes its frequency response, so I could not use the original equalizer settings. However, this wasn't a problem for me since I knew that I will end up using my own tuning anyway.

The Speaker Connector

Another major change was in the speaker connector. I'm a big fan of SpeakON connectors by Neutrik. Since LXmini uses line-level crossovers and an amplifier per driver, its speaker cable has 4 wires. SpeakONs exist in various versions including the 4-pole—the model NL4FC. SpeakONs are much more convenient and also safer than traditional speaker posts for "banana" plugs which the original LXmini design uses. The height of those posts and plugs allowed putting them at the bottom of the speaker, however SpeakON connectors are significantly longer and would not fit under the speaker. Thus, I had to move connector to the back side of the speaker. This also makes the connection process a lot easier.

The receptacle for the SpeakON at the rear side also serves as a lock holding the vertical pipe and its stand together. Because of that, only one complementing bolt is required on the front side.

Minor Tweaks

Besides these major changes, I also had a couple of ideas how to make the assembly process more convenient.

Threaded Bolt Holes

The first idea was to use 1/4-20 button cap Allen head bolts everywhere possible, and eliminate the need for nuts by threading the holes in pipes instead. This change makes the process of assembling the top part of the speaker much easier. For example, one step of the assembly procedure requires centering the full-range driver at the end of the horizontal pipe by means of 3 of 4 screws. The original design uses nuts screwed to the bottom of bolts—just enough to make a flat surface:

Aligning these bolts around the driver's magnet is a cumbersome procedure due to the strong interaction between them. In my design, the threaded boltholes prevent movement of the bolts, and screwing them in while preserving alignment becomes much easier:

Another fiddly step of the assembly procedure is alignment of the horizontal pipe to make the surface of the full-range driver to be at the right angle to the surface of the woofer. The horizontal pipe rests on 3 screws that have to be adjusted. The heads of two of them are easily accessible:

The third one in the original design must be a long one, and I did not have such a long bolt, so instead I used a short one and was adjusting it via a hole at the bottom of the woofer's plate:

Pipe Clamps

I have removed the top pipe clamp since it is purely decorative, and it can vibrate at high frequencies, creating undesired resonance. The bottom pipe clamp is functional since it holds the rubber collar on top of the vertical pipe. In order to prevent its possible vibration, first I have put a piece of rubber between the loose end of the clamp steel stripe and the rest of it. Then I used a wire in order to secure the loose end as much as possible:

Alignment Post (Not the best idea :)

Finally, this is the change which did not work as expected. I usually use laser guides while setting up speakers. Since surfaces that we are surrounded with in our homes are never ideal, the only true guides for setting up things straight are gravity and lasers. With regular rectangular boxes of convenient speakers, putting a laser level on top of them is a trivial task, but with the round pipes of the LXmini design it's not. So, my idea was to add a threaded post on top of the speaker in order to mount a laser level on it.

The idea did not work out because although the laser level holds well on the post, it's practically impossible to set it up aligned properly with the speaker itself. So, I had to abandon this idea and instead put marks at the baseplate of the speaker, and this is where I put the laser level during the set-up.

Overall Look

This is how the desktop setup looked like initially (yes, the monitor has exact square aspect ratio :):

Note that it is recommended to put dipoles at least 1 meter from the real wall, however in the desktop setup this is not possible. Instead, I have acoustic absorbers mounted around the desk.

After living with this classical 60 degrees stereo setup for a while, I decided to try to improve imaging by using the directional property of dipoles in order to suppress the sound going into the opposite (ipsilateral) ear. The speakers got moved closer to my side of the desk, and were rotated away from me:

Below is a schematic diagram, a view from above:

The process of configuring the speakers for this setup will be explained in the next post, stay tuned!

Sunday, March 10, 2024

Headphone Equalization For The Spatializer

As I have mostly finished my lengthy experiments with different approaches to stereo spatialization and finally settled up on the processing chain described in this post, I could pour my attention into researching the right tuning for the headphones. Initially I was relying on the headphone manufacturers to do the right thing. What I was hoping for, is that I could take headphones properly tuned to either "free field" or "diffuse field" target, and apply just some minor corrections in my chain. You can read about my attempts to implement "free to diffuse" and "diffuse to free field" transforms applied to the Mid/Side representation of a stereo track.

Why Factory Headphone Tuning Can't Be Used

However, since in the last version of my spatializer I actually came closer to simulating a dummy head recording, I realized that using any factory equalization of any headphones simply does not work—it always produces an incorrect timbre. The reason is that my spatializer simulates speakers and the room, as well as torso reflections, and even some aspects of the pinna filtering—by applying different EQ to "front" (correlated) and "back" (anti-correlated) components of a stereo recording. "3D audio" spatializers (they are also called "binaural synthesis systems") that use real HRTFs come even closer to simulating a dummy head recording. That means, if you play the spatialized audio over headphones with their original, say, "free field target" tuning, you will apply similar human body-related filtering twice: once by the headphones themselves, and then also by the spatializer.

At some point I used to think that using headphones calibrated to the diffuse field target is fine, because they are "neutral" (do not favor any particular direction), but this assumption is also false. That's because diffuse field equalization is an average over HRTFs for all directions—thus it's still based on the human body-related filter. This is why more advanced "3D sound" renderers have a setting where you specify the headphones being used—by knowing that, they can reverse the factory tuning. I also experimented with AirPods Pro and confirmed that they change their tuning when switching between "stereo" and "spatialized" modes.

By the way, in the reasoning above I use dummy head recordings and synthetic "3D audio" spatializatoin interchangeably, but is it conceptually the same thing if we view it from the listening aspect? I think so. Dummy head recordings are taken using a real person, or a head and torso simulator (HATS) with microphones mounted at the ear canal entrances. Sometimes it's not a complete HATS, but just a pair of ears placed at the distance similar to the diameter of a human head. In any case, the microphones placed into these ears block the ear canal (in the case of HATS the canal might not even exist). Now, if we turn our attention to HRTF-based 3D audio rendering, HRTFs are also typically measured with mics placed at the blocked ear canals of a person (see the book "Head-Related Transfer Function and Acoustic Virtual Reality" by K. Iida as an example). So basically, a dummy head recording captures some real venue (or maybe a recording playing over speakers in a room), while a "3D audio" system renders a similar experience from "dry" audio sources by applying HRTF filters and room reverb IRs to them.

It's actually a useful insight because it means that we can use the same equalization of headphones for all of these: dummy head recordings, "realistic" spatializers, and my stereo spatializer. Essentially, I can decouple the tuning of my spatializer from the tuning of the headphones, and validate the latter using other spatial audio sources. And I am also able to compare these different sources directly, switching between them instantly, which provides ideal conditions for doing A/B or blind comparisons.

Hear-Through Transfer Function

The problem is, finding correct equalization for playing any of the spatial audio sources over headphones is actually a bit tricky. As I've mentioned above, first we need to "erase" the factory tuning of the headphones, and then re-calibrate them. But to what target? If we think about it, it must be a transfer function from an entrance of a blocked ear canal to the eardrum. This actually sounds like a familiar problem. For example, if we consider AirPods Pro, by their design they isolate the user from outside noises, however one can enable the "transparency" mode which adds the outer world back in. In the audio engineering industry this is also known as the "hear-through" function.

There is a good paper "Insert earphone calibration for hear-through options" by P. Hoffmann, F. Christensen, and D. Hammershøi on the transfer function of a "hear-through" device. The TF they use schematically looks like this:

The curve starts flat, then it begins to elevate at 300 Hz, reaching a peak between 2–3 kHz, then a sharper peak between 8–9 kHz. It also has a downward slope after 4 kHz. Note that in the paper there is a discrepancy between the curve shown on the Fig. 2 and the target curve for headphone tunings which appears on the Fig. 4.

In practice, we should consider this curve as an approximation only, since for each particular earphone and ear canal the center frequencies of the peaks and their amplitudes will be different. The previously published paper "Sound transmission to and within the human ear canal" by D. Hammershøi, and H. Møller shows that rather high individual variation exists between subjects when the TF from the outside of the ear canal to the ear drum is measured.

Also note that this TF applies to in-ear monitors (IEMs) only. The paper describes re-calibration of different IEMs to this target. Some of them comply well with re-tuning, some are not. For my experiment, I decided to use EarPods because they are comfortable—I have no physical ear fatigue whatsoever, they use a low distortion dynamic driver (see measurements on RTINGS.com), and they are cheap! Plus, you can put them into and out of your ear very easily, unlike true IEM designs that require some fiddling in order to achieve good sealing which blocks outside noises. Obviously, the lack of this sealing with EarPods is one of their drawbacks. After putting them in, you still hear everything outside, and what's worse, the timbre of outside sounds is changed, although since we are not using them for augmented reality purposes, this is not an issue. Yet another annoyance is that the 3.5 mm connector is rather flimsy, however, I have worked around that by wrapping it into heat-shrink tubing:

These drawbacks are not really serious—since we are not implementing actual hear-through for augmented reality, there is no concern about hearing modified sounds from the outside, and we will partially override them with music. However, here is the main drawback of EarPods: due to their shape, it is not easy to make a DIY matching coupler for them. Basically, you need a coupler imitating a human ear. Luckily, I now have access to a B&K head 5128-B. Thanks to its human-like design, EarPods can be easily mounted on it and I could obtain reliable measurements.

Yet another challenge comes from the fact that I wanted to create an individual hear-through calibration. That means, I wanted to be able to adjust the peaks of the hear-through target on the fly. These peaks actually depend on the resonances of the ear canal and the middle ear, which are all individual. The B&K head has its own resonances, too. Thus, even if I knew my individual HT target, calibrating to it on a B&K head might not give me what I want due to interaction between an EarPod and the ear canal of the B&K head.

So I came up with the following approach:

  1. Tune EarPods to "flat", as measured by the B&K head.
  2. Figure out (at least approximately) the TF of the B&K ear canal and microphone.
  3. Add the TF from item 2 back to the "flat" calibration TF.
  4. On top of the TF from item 3, add the generic "hear-through" (HT) TF, and adjust its peaks by ear.

Why does this approach work? Let's represent it symbolically. These are the TFs that we have:

  • B&K's own TF, that is, how the head's mic sees a "flat" low-impedance sound source at the ear canal entrance: P_bk;
  • EarPods, as measured by B&K: P_epbk;
  • Hear-Through TF—according to the paper it transforms a source inside a blocked ear canal into a source at the non-blocked ear canal entrance: P_ht.

We can note that P_epbk can be approximated by P_eb * P_bk, where P_eb is the TF that we could measure by coupling an EarPod directly to a microphone calibrated to a flat response. What we intend to do is to remove P_eb, and replace it with a TF that yields P_ht when EarPods are coupled with my ear canal. That is, our target TF is:

P_tg = P_ht / P_eb

When I calibrate EarPods to yield a flat response on the B&K head, I get an inverse of P_epbk, that is, 1 / P_epbk, and we can substitute 1 / (P_eb * P_bk) instead. Thus, to get P_tg we can combine the TFs as follows:

P_tg = P_ht * 1 / (P_eb * P_bk) * P_bk = P_ht * P_bk * 1 / P_ebbk

And that's exactly what I have described in my approach above. I must note that all these representations are of course approximate. Exact positions of transfer function peaks shift as we move the sound source back and forth in the ear canal, and move outside of it—one can see real measurements in the "Sound transmission to and within the human ear canal" paper I've mentioned previously. That's why it would be naive to expect that we can obtain an exact target TF just from measurements, however they are still indispensable as the starting point.

Approximating B&K 5128 Transfer Function

So, the only thing which has to be figured out is the P_bk transfer function. As I understand it, without a purposefully built earphone which is combined with a microphone inside the ear, this is not an easy task. The problem is that the SPL from the driver greatly depends on the acoustic load. For example, if we use a simplest coupler—a silicon tube between the earphone and a measurement mic, the measured SPL will depend on the amount of air between them, and the dependency is not trivial. What I mean, it's not just changing resonances and power decay due to distance increase.

I checked B&K's own technical report "The Impedance of Real and Artificial Ears" with lots of diagrams for impedance and transfer functions. I also tried some measurements on my own using low distortion dynamic driver IEM (the Truthear Zero:Red I mentioned in my previous posts). The main understanding that I have achieved is that the ear canal and microphone of the B&K head exhibit a tilt towards high frequencies. I was suspecting that because the EarPods tuned to "flat" on the B&K (the 1 / (P_eb * P_bk) TF) did sound very "dark." So I ended up with creating a similar tilt, and the actual amount was initially tuned based on listening, and then corrected when measuring the entire rig of EarPods via P_tg on the same B&K head.

Individual Hear-Through Tuning

As I've already mentioned before, the hear-through TF from the paper could only be considered as a guideline because of individuality of ear canals and the actual depth of insertion of earphones into them.

In order to figure out the individual resonances of the left and right ears I used pink noise rendered into one stereo channel and spatialized via my processing chain. I had two goals: get rid of the "hollow" sound (pipe resonance), and make sure that the pink noise sounds like a coherent point source. When the tuning is not right, it can sound as if consists of two components, with the high frequency band placed separately in the auditory space.

For this tuning process, I have created a prototype HT TF in a linear phase equalizer (DDMF LP10), and then was tweaking center frequencies of the peaks and their amplitudes. My final settings for the left and right ears look like this:

As you can see, compared to the "generic" TF pictured before, some changes had to be made. However, they are only needed to compensate for the features that got "erased" by the "flat" tuning. I had to restore back some bass, and also diminish the region around 6 kHz to avoid hearing sharp sibilants. This is how the final tuning looks like when EarPods are measured by the B&K head:

It looks rather similar to the target described in the "Insert earphone calibration..." paper. Note that the difference in the level of bass between the left and the right ear are due to my inability to achieve the same level sealing on both artificial ears— I must say, coupling EarPods is really non-trivial. Yet another difference is addition of dips around 60 and 70 Hz, as otherwise I could feel the vibration of the drivers in my ears.

Listening Impressions

As I have mentioned above, since we can consider dummy head recordings, "3D audio" renderings, and my spatializer as equally footed sources which require the same "hear-through" earphone calibration, I started with the first two.

I have made some dummy head recording using MS-TFB-2-MKII binaural mics by Sound Professionals mounted on my own head. I tried both just recording sounds of my home, and also recording pink noise played over stereo speakers (left only, right only, correlated PN, anti-correlated PN, and correlated PN from the back). For the "3D audio" rendering I used Anaglyph plugin. This is a free and simple to use plugin which allows rendering a mono audio source via different HRTFs, and apply different kinds of room reverbs.

I must say, I was impressed by the realism achieved! When playing sources recorded or rendered in front of me, I was sure the sound comes from the laptop sitting on my knees. Note that in Anaglyph I have set the source height at -15° for achieving this effect.

Playing sources recorded from the back was a bit of a surprise—I have experienced a strong back-front reversal with my own dummy head recording. However, the rendering by Anaglyph and my own spatialization chain produced much more realistic positioning. Thus, either Sound Professional's mics have some flaw which makes behind sources less "real," or "reality" is not what we want to achieve with these renderings— perhaps the brain requires stronger "clues" to avoid FB reversals, and these clues for some reason are not so strong in a dummy head recording.

But other than that, the dummy head recordings done with these mics are very convincing. If you play a record back exactly at the same location where it was recorded, the mind automatically connects sounds with the objects around. Or you will experience a bit confusing feeling if the visual scene has changed, say, a person which was talking on the recording has left, and now the reproduced sound comes from an empty place.

I also noted how much the feeling of distance and location "collapses" when a dummy head recording is played in another setting—another room for example. Our visual channel indeed has a strong dominance over the audio channel, and seeing a wall at the location where the sound is "coming from" in the recording completely changes the perception of the source's distance to the listener.

Also, the feeling of the distance is greatly affected by the loudness of playback, and perceived dynamic range. Thus, don't expect a lot of realism from simple studio-produced stereo recordings, even when rendering them via a "spatialization" engine.

Fine-Tuning the Spatializer

Using these realisting reference recordings and renderings I was able to fine tune my spatializer. Basically, I used pink noise positioned at different locations, and was quickly switching between my spatializer and a dummy head recording, or a rendering by Anaglyph. This was much more precise and easier than trying to compare the sound in headphones with sound in speakers! Even if you train yourself to put the headphones very quickly on and off, and switching the speakers off and on at the same time, this still does not happen quick enough. I've heard from James "JJ" Johnston that switching time must be no longer than 200 ms. With instant switching, I could compare both positioning of sources and their tonality. Note that these attributes are not independent because the auditory system basically "derives" this information from incomplete data which it receives from ears.

I noticed that the perceived distance to the sources rendered by my spatializer is less than the distance of "real" or "3D audio" sources. When switching between my spatializer and a DH recording, or Anaglyph, the source was literally jumping between "far" and "near" while staying at the same lateral angle and height. Increasing the level of the reverb in my spatializer puts the source a bit further away, at the cost of cleanness. Similar thing with Anaglyph: with reverb turned off the sound is perceived as being very close to the head. Yet, reverb adds significant tonal artifacts. Since I'm not doing a VR, I decided that it's better to have a cleaner sound closer to the face. In fact, a well produced recording has a lot of reverb on its own, which helps to "feel" the sources as if they are far away.

As a result of these comparisons, I have adjusted the level of the cross-feed, and adjusted directional bands the Mid/Side equalization for better rendering of "frontal" and "behind the back" sources. As I have mentioned above, the sources from behind ended up sounding more "realistic" with my rendering, compared to my own dummy head recordings.

Another interesting observation was that my spatialization produces more "heavier" sounding pink noise than a dummy head recording. I actually noticed the same thing with Apple's spatializer. I think this heavier tonality is better for spatializing music, and maybe movies as well, so I decided not to attempt to "fix" that.

Bonus

While researching this topic, I found a paper by Bob Schulein and Dan Mapes-Riordan with a very long title "The Design, Calibration and Validation of a Binaural Recording and Playback System for Headphone and Two-Speaker 3D-Audio Reproduction" on their approach to doing dummy head recordings. The paper also suggests a headphone equalization approach which is similar to what I ended up using. In fact, the "Transfer function T2" from the paper has a lot in common with the "hear-through" target function I used. From the paper, I came to Bob's YouTube channel which features some useful dummy head calibration recordings, as well as recordings of music performances, both captured by the mannequin called Dexter which Bob built.

Some notes on these recordings. I could hear that the tuning applied to the recording is different on older vs. newer videos—it is indeed hard to come up with the right tuning! Then, on the "calibration" recordings where Bob walks around Dexter, I was experiencing front-back reversals as well—that's when listening to recordings with my eyes closed. Thus, one either needs to always look at the visual—this is what Bob proposes, or incorporate head tracking into the reproduction process. However, using head tracking requires replacing in the recording process the dummy head with ambisonics mic, which captures the sound field in a more "generic" form, and then rendering this ambisonics capture via a "3D audio" renderer.

Another weak point of "old style" dummy head recordings that I have noticed is that minor creaking and coughing sounds coming from the audience immediately distract attention from the performance. This happens unconsciously, probably because you don't see the sources of these creaks, and this forces you to look into the direction of the sound as it can be potentially dangerous. When sitting in a concert hall, this unconscious switching of the attention happens much less because you have a wider field of view, compared to the video recording. Whereas, in stereo recordings the performance is captured using mics that are set close to the performers, so all sounds from the auditory can be suppressed if needed during the mixing stage.

Conclusions

I'm really glad I have solved the "last mile" problem of the stereo spatialization chain. Also, the headphone tuning I came up with allows listening to dummy head and "3D audio" recordings delivered with a great degree of realism. It's also interesting that it turns out, for music enjoyment we don't always need the reproduction to be "real." Much like a painting or a photograph, not mentioning movies, actually represents some "enhanced" rendering of reality and reveal the author's attitude to it. Sound engineers also work hard on every recording to make it sound "real" but at the same time interesting. Thus, the reproduction chain must not ignore this aspect.

Sunday, February 4, 2024

The Math of The ITD Filter and Thoughts on Asymmetry

In a recent post titled "(Almost) Linear Phase Crossfeed" I only explained my intents for creating these filters and the effect I achieved with them. In this post I'll explain the underlying math and go through the MATLAB implementation.

I called this post "The Math of The ITD Filter" for the reason I think this naming is more correct. This is because the purpose of this all-pass filter is just to create an interaural time delay (ITD). Thus, it's not a complete cross-feed or HRTF filter but rather a part of it. As mentioned in the paper "Modeling the direction-continuous time-of-arrival in head-related transfer functions" by H. Ziegelwanger and P. Majdak, an HRTF can be decomposed into three components: minimum phase, frequency-dependent TOA (time-of-arrival), and excess phase. As authors explain, the "TOA component" is just another name for the ITD psychoacoustic cue, and that is what our filters emulate.

Time-Domain vs. Frequency-Domain Filter Construction

Since there is a duality between time and frequency domains, any filter or a signal can be constructed in one of them and transformed into another domain as needed. The choice of the domain is a matter of convenience and of the application area. If you are looking for concrete examples, there is an interesting discussion in the paper "Transfer-Function Measurement with Sweeps" by S. Müller and P. Massarani about time-domain and frequency-domain approaches for generation of log sweeps used in measurements.

Constructing in the time domain looks more intuitive at the first glance. After all, we are creating a frequency dependent time delay. So, in my initial attempts I experimented with splitting the frequency spectrum into bands using linear phase crossover filters, then delaying these bands, and combining them back. This is an easy and intuitive way for creating frequency-dependent delays, however stitching these differently delayed bands back does not happen seamlessly. Because of abrupt changes of the phase at connection points, the group delay behaves quite erratically. I realized that I need to learn how to create smooth transitions.

So my next idea was to create the filter by emulating its effect on a measurement sweep, and then deriving the filter from this sweep. This is easy to model because we just need to delay parts of the sweep that correspond to affected frequencies. Since in a measurement sweep each frequency has its own time position, we know which parts we need to delay. "To delay" in this context actually means "alter the phase." For example, if the 1000 Hz band has to be delayed by 100 μs, we need to delay the phase of that part of the sweep by 0.1 * 2π radians. That's because a full cycle of 1 kHz is 1 ms = 1000 μs, so we need to hold it back by 1/10 of the cycle.

This phase delay can be created easily while we are generating a measurement sweep in the time domain. Since we "drive" this generation, we can control the delay and make it changing smoothly along the transition frequency bands. I experimented with this approach, and realized that manipulation of the phase has its own constraints—I have explained these in my previous post about this filter. That is, for any filter used for real-valued signals, we have to make to phase to start from 0 at 0 Hz (DC), and end with zero phase at the end of our frequency interval (half of the sampling rate). The only other possibility is to start with π radians and end with π—this creates a phase-inverting filter.

This is how this requirement affects our filter. Since we want to achieve a group delay which is a constant over some frequency range, it implies that the phase shift must change with the constant rate on that range (this is by the definition of the group delay). That means, the phase must be constantly decreasing or increasing, depending on the sign of the group delay. But this descent (or ascent) must start somewhere! Due to the restriction I stated in the previous paragraph, we can't have a non-zero phase at 0 Hz. So, I came to an understanding that I need to build a phase shift in the ultrasound region, and then drive the phase back to 0 across the frequency region where I need to create the desired group delay. Phase must change smoothly, as otherwise its derivative, the group delay, will jump up and down.

This realization finally helped me to create the filter with the required group delay behavior. However, constructing the filter via the time domain creates artifacts at the ends of the frequency spectrum (this is explained in the paper on measurements sweeps). Because of that, modern measurement software usually uses sweeps generated in the frequency domain. These sweeps are also more natural for performing manipulations in the frequency domain, which can be done fast by means of direct or inverse FFTs. So, after discussing this topic on the Acourate forum, I've got advice from its author Dr. Brüggemann to construct my filters in the frequency domain.

Implementation Details

Creating a filter in the frequency domain essentially means generating a sequence of complex numbers corresponding to each frequency bin. Since we generate an all-pass filter, the magnitude component is always a unity (zero amplification), and we only need to generate values to create phase shifts. However, the parameters of the filter we are creating are not phase shifts themselves but rather their derivatives, that is, group delays.

This is not a big problem, though, because mostly our group delay is a constant, thus it's a derivative of a linear function. The difficult part is to create a smooth transition of the phase between regions. As I have understood from my time-domain generation attempts, a good function for creating smooth transitions is the sine (or cosine), because its derivative is the cosine (or sine) which is essentially the same function, but phase-shifted. Thus, the transitions of both the phase and the group delay will behave similarly.

So, I ended up with two functions. The first function is:

φ_main(x) = -gd * x

And its negated derivative is our group delay:

gd = -φ_main'(x)

The "main" function is used for the "main" interval where the group delay is the specified constant. And another function is:

φ_knee(x) = -gd * sin(x)

Where gd is the desired group delay in microseconds. By choosing the input range we can get an ascending slope which transitions into a linear region, or a descending slope.

Besides the group delay, another input parameter is the stop frequency for the time shift. There are also a couple of "implicit" parameters of the filter:

  • the band for building the initial phase shift in the ultrasound region;
  • the width of the transition band from the time shift to zero phase at the stop frequency.

Graphically, we can represent the phase function and the resulting group delay as follows (the phase is in blue and the group delay is in red):

Unfortunately, the scale of the group delay values is quite large. Here is a zoom in on the region of our primary interest:

For the initial ramp up region, I have found that a function sin(x) + x gives a smoother "top" compared to a sine used alone. So there is the third function:

φ_ramp(x) = (sin(x) +x) / π

Note that this function does not depend on the group delay. What we need from it is to create the initial ramp up of the phase from 0 to the point where it starts descending.

Getting from the phase function to the filter is rather easy. The complex number in this case is . The angle φ is in radians, thus the values produced by our φ_main and φ_knee, must be multiplied by .

Since we work with discrete signals, we need to choose how many bins of the FFT to use. I use 65536 bins for the initial generation, this has enough resolution in the low frequency region. And the final filters can be made shorter by applying a window (more details are in the MATLAB section below).

Verification in Audio Analyzers

Checking the resulting group delay of the filter is possible in the MATLAB itself, as I have shown in the graphs above. However, to double-check, we want to make an independent verification using some audio analyzing software like Acourate or REW. For that, we need to apply inverse FFT to the FFT we have created, and save the resulting impulse response as a WAV file. Below is a screenshot from FuzzMeasure (I used it because it can put these neat labels on the graph):

I loaded two impulses, both ending at 2150 Hz: one has a -50 μs group delay (red graph), another 85 μs (blue graph). The absolute values of the group delay displayed on the graph (170+ ms) are not important because they are counted from the beginning of the time domain representation, thus they depend on the position of the IR peak. What we are interested in are the differences. We can see that at the flat region past 2 kHz the value is 170.667 ms for both graphs. Whereas, for the blue graph the value at ~200 Hz is 170.752 ms. The difference between these values is 85 μs. For the red graph, the difference is 617-667 = -50 μs. As we can see, this agrees with the filter specification.

In REW, the group delay is relative to the IR peak. However, it has a different issue. Being a room acoustics software, REW by default applies an asymmetric window which significantly truncates the left part of the impulse. This works great for minimum-phase IRs, however since the IR of our filter is symmetric, this default weighting creates as significant ripple at low frequencies:

In order to display our filter correctly, we need to choose the Rectangular window in the "IR windows" dialog, and expand it to maximum both on the left and the right side:

After adjusting windowing this way, the group delay is displayed correctly:

We can see that the group delays are the same as before: -50 μs and 85 μs. Thus, verification confirms that the filters do what is intended, but we need to understand how to use audio analyzing software correctly.

MATLAB Implementation Explained

Now, knowing that our filter works, let's understand how it is made. The full code of the MATLAB script is here, along with a couple of example generated filters.

The main part is the function called create_itd_filter. This is its signature:

function lp_pulse_td = create_itd_filter(...
    in_N, in_Fs, in_stopband, in_kneeband, in_gd, in_wN)

The input parameters are:

  • in_N: the number of FFT bins used for filter creation;
  • in_FS: the sampling rate;
  • in_stopband: the frequency at which the group delay must return to 0;
  • in_kneeband: the frequency at which the group delay starts its return to 0;
  • in_gd: the group delay for the "head shadowing" region;
  • in_wN: the number of samples in the generated IR after windowing.

From these parameters, the script derives some more values:

    bin_w = in_Fs / in_N;
    bin_i = round([1 18 25 in_kneeband in_stopband] ./ bin_w);
    bin_i(1) = 1;
    f_hz = bin_i .* bin_w;

bin_w is the FFT bin width, in Hz. Using it, we calculate the indexes of FFT bins for the frequencies we are interested in. Let's recall the shape of our group delay curve:

Note that 1 Hz value is only nominal. In fact, we are interested the first bin for our first point, so we set its index explicitly, in order to avoid rounding errors. And then we translate bin indexes back to frequencies (f_hz) by multiplying them by the bin width. Use of frequencies that represent the centers of bins is the usual practice for avoiding energy spillage between bins.

Next, we define the functions for the phase, either directly, or as integrals of the group delay function:

    gd_w = @(x) -in_gd * x; % f_hz(3)..f_hz(4)
    syms x;
    gd_knee_f = -in_gd * (cos(pi*((x - f_hz(4))/(f_hz(5)-f_hz(4)))) + 1)/2;
    gd_knee_w = int(gd_knee_f);
    gd_rev_knee_f = -in_gd * cos(pi/2*((x - f_hz(3))/(f_hz(3)-f_hz(2))));
    gd_rev_knee_w = int(gd_rev_knee_f);

gd_w is the φ_main function from the previous section. It is used between the frequency points 3 and 4. gd_knee_f is the knee for the group delay, its integral is the φ_knee function for the phase shift. As a reminder, this knee function is used as a transition between the constant group delay (point 4) and zero delay (point 5). We run the cosine on the interval [0, π], and transform it, so that it yields values in the range from 1 to 0, allowing us to descent from in_gd to 0.

But we also need to have a knee before our constant group delay (from the point 2 to the point 3), to ramp it up from 0. Ramping up can be done faster, so we run the cosine function on the interval [-π/2, 0]. The resulting range goes naturally from 0 to 1, no need to transform it.

Now we use these functions to go backwards from the point 5 to the point 2:

    w = zeros(1, in_N);
    w(bin_i(4):bin_i(5)) = subs(gd_knee_w, x, ...
        linspace(f_hz(4), f_hz(5), bin_i(5)-bin_i(4)+1)) - ...
        subs(gd_knee_w, x, f_hz(5));
    offset_4 = w(bin_i(4));
    w(bin_i(3):bin_i(4)) = gd_w(...
        linspace(f_hz(3), f_hz(4), bin_i(4)-bin_i(3)+1)) - ...
        gd_w(f_hz(4)) + offset_4;
    offset_3 = w(bin_i(3));
    w(bin_i(2):bin_i(3)) = subs(gd_rev_knee_w, x, ...
        linspace(f_hz(2), f_hz(3), bin_i(3)-bin_i(2)+1)) - ...
        subs(gd_rev_knee_w, x, f_hz(3)) + offset_3;
    offset_2 = w(bin_i(2));

Since gd_knee_w and gd_rev_knee_w are symbolic functions, we apply them via the MATLAB function subs. When going from point to point, we need to move each segment of the resulting phase vertically to align the connection points.

Now, the interesting part. We are at the point 2, and our group delay is 0. However, the phase shift is not 0, you can check the value of offset_2 in the debugger if you wish. Since the interval from the point 1 to the point 2 lies in the ultrasonic region, the group delay there is irrelevant. What is important is to drive the phase shift so that it is equal to 0 in the point 1. This is where the φ_ramp function comes handy. It has a smooth top which connects nicely to the top of the knee, and it yields 0 at the input value 0.

    ramp_w = @(x) (x + sin(x)) / pi;
    w(bin_i(1):bin_i(2)) = offset_2 * ramp_w(linspace(0, pi, bin_i(2)-bin_i(1)+1));

Then we fill our the values of FFT bins by providing our calculated phase to the complex exponential:

    pulse_fd = exp(1i * 2*pi * w);

Then, in order to produce a real-valued filter, the FFT must be symmetric relative to the center frequency (see this summary by J. O. Smith, for example). So, for example, the FFT (Bode plot) of the filter may look like this:

Below is the code that performs necessary mirroring:

    pulse_fd(in_N/2+2:in_N) = conj(flip(pulse_fd(2:in_N/2)));
    pulse_fd(1) = 1;
    pulse_fd(in_N/2+1) = 1;

Now we switch to the time domain using inverse FFT:

    pulse_td = ifft(pulse_fd);

This impulse has its peak in the beginning. In order to produce a linear phase filter, the impulse must be shifted to the center of the filter:

    lp_pulse_td = circshift(pulse_td, in_N/2-pidx+1);

And finally, we apply the window. I used REW to check which one works best, and I found that the classic von Hann window does the job. Here we cut and window:

    cut_i = (in_N - in_wN) / 2;
    lp_pulse_td = lp_pulse_td(cut_i:cut_i+in_wN-1);
    lp_pulse_td = hann(in_wN)' .* lp_pulse_td;

That's it. Now we can save the produced IR into a stereo WAV file. I use a two channel file because I have found that having a bit of asymmetry works better for externalization:

lp_pulse_td1 = create_itd_filter(N, Fs, stopband, kneeband, gd1, wN);
lp_pulse_td2 = create_itd_filter(N, Fs, stopband, kneeband, gd2, wN);
filename = sprintf('itd_%dusL_%dusR_%dHz_%dk_%d.wav', ...
    fix(gd1 * 1e6), fix(gd2 * 1e6), stopband, wN / 1024, Fs / 1000);
audiowrite(filename, [lp_pulse_td1(:), lp_pulse_td2(:)], Fs, 'BitsPerSample', 64);

These IR files can be used in any convolution engine. They are also can be loaded into audio analyzer software.

Application and Thoughts on Asymmetry

For a practical use, we need to create 4 filters: a pair for ipsi- and contra-lateral source, and these are different for the left and right ear. Why the asymmetry? The paper I referenced in the introduction of this post, and other papers suggest that ITD of real humans are not symmetrical. My own experiments with using different delay values also confirm that asymmetric delays feel more natural. It's interesting that even the delay of arrival between ipsi- and contra-lateral ears can be made slightly different for the left and right direction. Maybe this originates from the fact that we never hold our head ideally straight. The difference is quite small anyway. Below are the delays that I ended up using for myself:

  • ipsilateral left: -50 μs, right: -60 μs;
  • contralateral left: 85 μs, right: 65 μs.

As we can see, difference between the left hand source path and the right hand source path is 10 μs (135 μs vs. 125 μs). This corresponds approximately to a 3.4 mm sound path, which seems to be reasonable.

Note that if we take the "standard" inter-ear distance of 17.5 cm, that gives us roughly a 510 μs TOA delay. However, this corresponds to a source which is 90° from the center. While, for stereo records we should use just about one third of this value. Also, your actual inter-ear distance can be smaller. In any case, since this is just a rough model, it makes sense to experiment and try using different values.

Regarding the choice of the frequencies for the knee- and stop-bands. With the increase of the frequency the information from the phase difference becomes more and more ambiguous. Various sources suggest that the ambiguity in the time information starts at about 1400 Hz. My own brief experiments suggest that if we keep the ITD along the whole audio spectrum, it makes determination of source position more difficult for narrowband sources starting from 2150 Hz. Thus, I decided to set the transition band between those two values.