Audio Panning and You

So, here I am, doing some musical work for my choir. In essence, my “job” is simple - take bad quality sheet music, transcribe it to make good sheet music, and export audio files for each singer so they can rehearse their voice. It doesn’t sound too complicated, right?

Well, as has been explained earlier, I’m a strong believer in automation — mostly because I hate tedious tasks. I can’t really automate the sheet music transcription, but everything around it can be semi-automated. And from that premise, I’ve written a utility called Librarian.

The mistake
#

There actually already is a Librarian utility, one I wrote about a year ago when faced with a similar problem. However, for some reason I decided to write that utility entirely in bash. Now, bash is a useful tool, but it is an awful programming language, and my 800+ line mess is a testament to that.

While a refactor from that mess to Python is ongoing, I have a more pressing issue — I have some new audio I want to get out, and I need to master it. The ffmpeg commands I’m using for that are hidden somewhere in the old librarian utility, and now I have to find them and tweak them.

The goal
#

The current/old audio mastering works as follows: for each .wav file in the directory, do an analysis and apply a compressor, then re-analyze, normalize, and encode as mp3. Simple as.

Notably, this means that each audio track for each voice contains only that voice. Feedback I have gotten repeatedly is that it’d be nice to hear the rest of the music, but at a lower volume, so it’s easier to follow along. It makes sense — hearing the other voices is better than just listening to silence and trying to count beats in your head.

The new idea, then, is to mix all the other audio tracks into each voice, but with the volume very lowered. This presents at least two issues that I’m currently aware of; first, I need to set up ffmpeg so that it’s no longer just dealing with one file at a time, but rather an arbitrary number of files at any one time. Secondly, I need to find a way to pan the voices apart in stereo — I’m not a sound tech, so don’t come at me if I’m butchering the nomenclature. The sound gets… really strange, honestly, when more than one voice is singing the same note. Some kind of weird interference, I guess, leading to a lot of very sharp overtones.

The solution, I’ve found, is to spread the voices apart in stereo, but I can’t do that during the export from the notation software. For starters, I work in bulk, and tuning the panning dials for all the instruments (read: voices) in every piece gets really dull — it also means the setup is vulnerable to human error, were I to forget to do that tuning. Secondly, I want each voice file to have “their” voice panned dead center — compensating for weird panning after the fact seems like way more trouble than it’s worth.

Current state of the art
#

Working with ffmpeg is not exactly trivial, so let’s see what I managed to concoct all those months ago.

# MASTER: ffmpeg a single file
# This is where all the ffmpeg settings live
master_run_ffmpeg() {
  name="$1"

  # Derive path components
  stem=$(dirname "${name}")
  filename=$(basename "${name}")

  # Strip file ending, if it exists
  wav_ext=".wav"
  if [[ "${filename}" == *"${wav_ext}" ]]; then
    filename="${filename%"${wav_ext}"}"
  fi

  # Create temporary directory
  td=$(mktemp -d)

  # Call ffmpeg - first pass analysis
  loudnorm_args=$(master_analysis_to_args "${filename}" "${stem}" "${td}")

  # Call ffmpeg - first pass
  ffmpeg \
    -i "${stem}/${filename}.wav" \
    -af "${loudnorm_args}" \
    -af "acompressor=threshold=-12dB:ratio=2:attack=0.2:release=1" \
    -codec:a pcm_s16le \
    -y \
    "${td}/${filename}.wav"

  # Call ffmpeg - second pass analysis
  loudnorm_args=$(master_analysis_to_args "${filename}" "${td}" "${td}")

  # Call ffmpeg - second pass mastering
  ffmpeg \
    -i "${td}/${filename}.wav" \
    -af "${loudnorm_args}" \
    -codec:a libmp3lame \
    -qscale:a 2 \
    -y \
    "${stem}/${MASTERDIR}/${filename}.mp3"
}

This is a shortened version of the function of the main mastering function — I’ve cut out some logging and other chaff. As we can see, we’re depending on another function called master_analysis_to_args…

# MASTER: Do loudnorm analysis, return argument string
master_analysis_to_args() {
  name="$1"
  stem="$2"
  td="$3"

  analysis=$(master_lournorm_analysis "${filename}" "${stem}" "${td}")

  m_i=$(echo "${analysis}" | grep "input_i" | sed 's/[[:space:]]*".*" : "\(.*\)",/\1/')
  m_tp=$(echo "${analysis}" | grep "input_tp" | sed 's/[[:space:]]*".*" : "\(.*\)",/\1/')
  m_lra=$(echo "${analysis}" | grep "input_lra" | sed 's/[[:space:]]*".*" : "\(.*\)",/\1/')
  m_thresh=$(echo "${analysis}" | grep "input_thresh" | sed 's/[[:space:]]*".*" : "\(.*\)",/\1/')

  loudnorm_args="$(master_loudnorm_args "$m_i" "$m_tp" "$m_lra" "$m_thresh")"

  echo "$loudnorm_args"
}

Well, that function only wraps two more.

# MASTER: loudnorm analysis on a file, returning the full report output
master_lournorm_analysis() {
  name="$1"
  stem="$2"
  td="$3"

  # Set the report target
  report="${td}/report.log"
  export FFREPORT="file=${report}"

  # Call ffmpeg - analysis
  ffmpeg \
    -report \
    -i "${stem}/${filename}.wav" \
    -af "loudnorm=print_format=json" \
    -vn -sn -dn \
    -f null /dev/null

  # Read the report
  data=$(cat ${report} | tail -n 25)

  # Destroy the report
  rm -f ${report}

  # Return
  echo "${data}"
}

# MASTER: Compile loudnorm argument string
master_loudnorm_args() {
  m_i="$1"
  m_tp="$2"
  m_lra="$3"
  m_thresh="$4"
  loudnorm_args="loudnorm=linear=true"
  loudnorm_args="${loudnorm_args}:measured_I=$m_i"
  loudnorm_args="${loudnorm_args}:measured_LRA=$m_lra"
  loudnorm_args="${loudnorm_args}:measured_TP=$m_tp"
  loudnorm_args="${loudnorm_args}:measured_thresh=$m_thresh"
  loudnorm_args="${loudnorm_args}:i=-12"
  echo "${loudnorm_args}"
}

Okay, so there we are. The master_loudnorm_args function just mangles around the data extracted by master_analysis_to_args into an argument string ffmpeg can use. The real analysis is done in master_loudnorm_analysis.

Process flow
#

Here’s a little secret — me walking through my own code to try to understand it is most of the purpose of this blog. It’s some combination of learning by teaching and using a very distributed, large scale, rubber duck.

The first step is an audio analysis using the loudnorm filter.

Audio analysis
#

ffmpeg -report

tells ffmpeg to dump its whole terminal output to a file, which we have specified as ${td}/report.log by setting the FFREPORT environment variable, per the docs.

The -i flag just feeds in a .wav file, so nothing special there.

  -af "loudnorm=print_format=json"

is the interesting bit. -af applies a “filtergraph” to the audio stream (see the docs), which for us is the loudnorm filter. It does a “EBU R128 loudness normalization”, whatever that is, but when just given the print_format argument, I guess it does an analysis without changing the input. The output, having trimmed off a lot of ffmpeg’s chaff, looks like this:

{
        "input_i" : "-40.11",
        "input_tp" : "-24.19",
        "input_lra" : "20.00",
        "input_thresh" : "-51.96",
        "output_i" : "-24.50",
        "output_tp" : "-7.71",
        "output_lra" : "13.30",
        "output_thresh" : "-35.41",
        "normalization_type" : "dynamic",
        "target_offset" : "0.50"
}

Please don’t think I just know this stuff. I’m looking it up in the docs as I go, trying to reverse-engineer my own code, and making a mental note to maybe add more comments in future.

The references to the documentation are just as much for my benefit as yours.

  -vn -sn -dn

seems to just disable the video, subtitle, and data streams. ffmpeg really is a multitool, isn’t it.

Finally

  -f null /dev/null

sets the output format as null and drives any output to /dev/null. Makes sense for an analysis stage where we just care about the JSON data.

Then the report file is read, reduced to its last 25 lines (via tail -n 25), and passed through a series of greps and regexes, until the output configuration string is produced, looking like:

loudnorm=linear=true:measured_I=-40.11:measured_LRA=20.00:measured_tp=-24.19:measured_thresh=-51.96:i=-12

This is then passed back to the main mastering function.

First stage: Compressor
#

The first step in the actual mastering process is to apply a compressor. The call itself looks like:

  ffmpeg \
    -i "${stem}/${filename}.wav" \
    -af "${loudnorm_args}" \
    -af "acompressor=threshold=-12dB:ratio=2:attack=0.2:release=1" \
    -codec:a pcm_s16le \
    -y \
    "${td}/${filename}.wav"

Some of this is looking familiar. We take in the base .wav file, apply loudness normalization with the loudnorm parameters we just got out, apply the acompressor filter with some reasonably chosen magic numbers, and export it as a pcm_s16le (16-bit) .wav file, overwriting if there’s something in the way (thanks to the -y flag).

Second stage: Amplification
#

Now, the just-generated new .wav file is passed through analysis again — the same as above — in order to do a final, second pass to normalize the loudness.

ffmpeg \
    -i "${td}/${filename}.wav" \
    -af "${loudnorm_args}" \
    -codec:a libmp3lame \
    -qscale:a 2 \
    -y \
    "${stem}/${MASTERDIR}/${filename}.mp3"

This looks pretty similar, except that we’re just applying the loudnorm filter and we’re encoding the output as MP3 with a quality of “2”, whatever that means.

And that’s supposedly it.

Improving
#

So, for starters, the other audio needs to be mixed in. In the test file I’m working with, that’s 10 different files — two choirs of four voices each plus two solos. Normally, I just export a “combined” audio file from my notation software, but now we’re doing it by hand, since we want to fiddle around with panning.

Speaking of: panning. Each of the 10 tracks need to be panned in a unique way. For the new Librarian utility, I want to auto-generate the panning numbers, but for now I can go with just picking numbers out of a hat.

That’s not the whole story, though. The pan filter is more powerful than just tweaking a panning knob, but is consequently harder to grasp. So I guess step one is to figure out how to pan an input file.

Panning
#

Leaving aside how perception works, to the best of my understanding, panning stereo audio 50% right means leaving the right channel untouched and reducing the left channel by 50%. There may be more Fancy Mathematics™️ involved, but let’s pretend there’s a linear relationship.

Given that, it seems that panning works by

ffmpeg -i test.wav \
  -af "pan=stereo|FL=0.5*FL|FR=FR" \
  test_pan.wav

This just takes the input audio, de-amplifies the left channel (FL) by 0.5 while leaving the right channel alone. Testing that, it seems to work as expected.

Choosing the panning values
#

This is a little out of scope for this exercise, but I have some thoughts I want to jot down, nevertheless.

The way I see it, there are a few ways of assigning these values.

Hardcoding works up to a point, and that point is any time I have more (or less) than exactly four tracks (most men’s choir music will be two tenors and two basses).
Random selection is something I was toying with, mostly because hardcoding is doomed to fail. One could motivate it by saying something like how it “mimics the way a real choir/orchestra sounds”, but really, it’s just to not have to deal with hardcoding.

The problem here is that while I want the voices panned out in stereo, I still want them reasonably centered — it wouldn’t make sense for some voices to be just coming out of one of the audio channels. That means I can’t do a uniform random distribution over the $[0,1]$ interval.

I was toying even more with the idea of doing random sampling on a normal distribution with something like $\mu = 0.5$ and $\sigma = \frac{1}{5}$, where I’d treat 0.5 as “dead center”, then pan out accordingly. The issue with this approach is that, of course, most samples will lie pretty close to dead center — some may even be dead center. I considered using an offset to push the samples to the left or right, but then the chance of ending up dead center still persists.
The method I think I’ve settled on is a round robin based approach. I create some predefined list of coefficients — maybe $[0.9, 0.85, 0.8, 0.75, \dots, 0.5]$. Then, I sort the audio files lexicographically and iterate over them. First file gets the first coefficient panned left, second file gets the first coefficient panned right, third gets second coefficient panned left, and so on.

Not only is this deterministic, it also (probably) preserves a reasonable balance in the mixed audio between the channels. It’ll also lead to “symmetry”, which I’m usually after — if the 1st Bass track is panned 20% left (0.8 coefficient on the right channel), the 2nd Bass track will (most likely) be panned 20% right, since it ought to come next in the sorted list.

If, for some unholy reason, I were to run out of coefficients, I can just start over with the same list. The odds of this causing some weird audio interference are slim to none. In the final utility, I probably also want some method to exclude one or more tracks from panning — e.g. an accompanying piano or church organ or such — but that’s not a problem for now.

Mixing
#

Now that we can pan out the audio tracks, let’s try to mix them together. The amix filter seems to be a reasonably straightforward way of doing that.

ffmpeg -i test_left_pan.wav \
  -i test_right_pan.wav \
  -filter_complex "amix=inputs=2" \
  mixed.wav

seems to do the trick.

I also noticed that the amix filter has support for setting weights, which means I don’t have to de-gain the “background” tracks as a separate filter — as long as the “hero” track is the first input, all other inputs get the second weight.

Order of operations
#

Having solved that, I now have a reasonably good idea of what the new script should do — and I need it to be a script; doing this by hand is incredibly tedious.

Take all the component .wav files and export new, panned versions. Make a note of the number of tracks — we’ll need that info to the amix filter.
Do a “straight” mix of all panned tracks into the “general” track.
Apply the earlier analyze -> compress -> analyze -> normalize chain to the general track.
Do the “hero” mixes for each (or some subset) of the component files.
Apply the processing chain to each of those, too.

If I can get that in place, I can do the rest by hand, while continuing to develop the new Librarian.

But not today. It’s taken me most of today to get this far and my brain is mush.

The mistake #

The goal #

Current state of the art #

Process flow #

Audio analysis #

First stage: Compressor #

Second stage: Amplification #

Improving #

Panning #

Choosing the panning values #

Mixing #

Order of operations #