So, here I am, doing some musical work for my choir. In essence, my “job” is simple - take bad quality sheet music, transcribe it to make good sheet music, and export audio files for each singer so they can rehearse their voice. It doesn’t sound too complicated, right?
Well, as has been explained earlier, I’m a strong believer in automation — mostly because I hate tedious tasks. I can’t really automate the sheet music transcription, but everything around it can be semi-automated. And from that premise, I’ve written a utility called Librarian.
The mistake #
There actually already is a Librarian utility, one I wrote about a year ago when faced
with a similar problem. However, for some reason I decided to write that utility
entirely in bash
. Now, bash
is a useful tool, but it is an awful programming
language, and my 800+ line mess is a testament to that.
While a refactor from that mess to Python is ongoing, I have a more pressing issue — I
have some new audio I want to get out, and I need to master it. The ffmpeg
commands
I’m using for that are hidden somewhere in the old librarian
utility, and now I have
to find them and tweak them.
The goal #
The current/old audio mastering works as follows: for each .wav
file in the directory,
do an analysis and apply a compressor, then re-analyze, normalize, and encode as mp3.
Simple as.
Notably, this means that each audio track for each voice contains only that voice. Feedback I have gotten repeatedly is that it’d be nice to hear the rest of the music, but at a lower volume, so it’s easier to follow along. It makes sense — hearing the other voices is better than just listening to silence and trying to count beats in your head.
The new idea, then, is to mix all the other audio tracks into each voice, but with the
volume very lowered. This presents at least two issues that I’m currently aware of;
first, I need to set up ffmpeg
so that it’s no longer just dealing with one file at a
time, but rather an arbitrary number of files at any one time. Secondly, I need to find
a way to pan the voices apart in stereo — I’m not a sound tech, so don’t come at me if
I’m butchering the nomenclature. The sound gets… really strange, honestly, when more
than one voice is singing the same note. Some kind of weird interference, I guess,
leading to a lot of very sharp overtones.
The solution, I’ve found, is to spread the voices apart in stereo, but I can’t do that during the export from the notation software. For starters, I work in bulk, and tuning the panning dials for all the instruments (read: voices) in every piece gets really dull — it also means the setup is vulnerable to human error, were I to forget to do that tuning. Secondly, I want each voice file to have “their” voice panned dead center — compensating for weird panning after the fact seems like way more trouble than it’s worth.
Current state of the art #
Working with ffmpeg
is not exactly trivial, so let’s see what I managed to concoct all
those months ago.
# MASTER: ffmpeg a single file
# This is where all the ffmpeg settings live
master_run_ffmpeg() {
name="$1"
# Derive path components
stem=$(dirname "${name}")
filename=$(basename "${name}")
# Strip file ending, if it exists
wav_ext=".wav"
if [[ "${filename}" == *"${wav_ext}" ]]; then
filename="${filename%"${wav_ext}"}"
fi
# Create temporary directory
td=$(mktemp -d)
# Call ffmpeg - first pass analysis
loudnorm_args=$(master_analysis_to_args "${filename}" "${stem}" "${td}")
# Call ffmpeg - first pass
ffmpeg \
-i "${stem}/${filename}.wav" \
-af "${loudnorm_args}" \
-af "acompressor=threshold=-12dB:ratio=2:attack=0.2:release=1" \
-codec:a pcm_s16le \
-y \
"${td}/${filename}.wav"
# Call ffmpeg - second pass analysis
loudnorm_args=$(master_analysis_to_args "${filename}" "${td}" "${td}")
# Call ffmpeg - second pass mastering
ffmpeg \
-i "${td}/${filename}.wav" \
-af "${loudnorm_args}" \
-codec:a libmp3lame \
-qscale:a 2 \
-y \
"${stem}/${MASTERDIR}/${filename}.mp3"
}
This is a shortened version of the function of the main mastering function — I’ve cut
out some logging and other chaff. As we can see, we’re depending on another function
called master_analysis_to_args
…
# MASTER: Do loudnorm analysis, return argument string
master_analysis_to_args() {
name="$1"
stem="$2"
td="$3"
analysis=$(master_lournorm_analysis "${filename}" "${stem}" "${td}")
m_i=$(echo "${analysis}" | grep "input_i" | sed 's/[[:space:]]*".*" : "\(.*\)",/\1/')
m_tp=$(echo "${analysis}" | grep "input_tp" | sed 's/[[:space:]]*".*" : "\(.*\)",/\1/')
m_lra=$(echo "${analysis}" | grep "input_lra" | sed 's/[[:space:]]*".*" : "\(.*\)",/\1/')
m_thresh=$(echo "${analysis}" | grep "input_thresh" | sed 's/[[:space:]]*".*" : "\(.*\)",/\1/')
loudnorm_args="$(master_loudnorm_args "$m_i" "$m_tp" "$m_lra" "$m_thresh")"
echo "$loudnorm_args"
}
Well, that function only wraps two more.
# MASTER: loudnorm analysis on a file, returning the full report output
master_lournorm_analysis() {
name="$1"
stem="$2"
td="$3"
# Set the report target
report="${td}/report.log"
export FFREPORT="file=${report}"
# Call ffmpeg - analysis
ffmpeg \
-report \
-i "${stem}/${filename}.wav" \
-af "loudnorm=print_format=json" \
-vn -sn -dn \
-f null /dev/null
# Read the report
data=$(cat ${report} | tail -n 25)
# Destroy the report
rm -f ${report}
# Return
echo "${data}"
}
# MASTER: Compile loudnorm argument string
master_loudnorm_args() {
m_i="$1"
m_tp="$2"
m_lra="$3"
m_thresh="$4"
loudnorm_args="loudnorm=linear=true"
loudnorm_args="${loudnorm_args}:measured_I=$m_i"
loudnorm_args="${loudnorm_args}:measured_LRA=$m_lra"
loudnorm_args="${loudnorm_args}:measured_TP=$m_tp"
loudnorm_args="${loudnorm_args}:measured_thresh=$m_thresh"
loudnorm_args="${loudnorm_args}:i=-12"
echo "${loudnorm_args}"
}
Okay, so there we are. The master_loudnorm_args
function just mangles around the data
extracted by master_analysis_to_args
into an argument string ffmpeg
can use. The
real analysis is done in master_loudnorm_analysis
.
Process flow #
The first step is an audio analysis using the loudnorm
filter.
Audio analysis #
ffmpeg -report
tells ffmpeg
to dump its whole terminal output to a file, which we have specified as
${td}/report.log
by setting the FFREPORT
environment variable, per
the
docs.
The -i
flag just feeds in a .wav
file, so nothing special there.
-af "loudnorm=print_format=json"
is the interesting bit. -af
applies a “filtergraph” to the audio stream (see the
docs), which for us is the
loudnorm
filter. It does a “EBU
R128 loudness normalization”, whatever that is, but when just given the print_format
argument, I guess it does an analysis without changing the input. The output, having
trimmed off a lot of ffmpeg
’s chaff, looks like this:
{
"input_i" : "-40.11",
"input_tp" : "-24.19",
"input_lra" : "20.00",
"input_thresh" : "-51.96",
"output_i" : "-24.50",
"output_tp" : "-7.71",
"output_lra" : "13.30",
"output_thresh" : "-35.41",
"normalization_type" : "dynamic",
"target_offset" : "0.50"
}
Please don’t think I just know this stuff. I’m looking it up in the docs as I go, trying to reverse-engineer my own code, and making a mental note to maybe add more comments in future.
The references to the documentation are just as much for my benefit as yours.
-vn -sn -dn
seems to just disable the video, subtitle, and data streams. ffmpeg
really is a
multitool, isn’t it.
Finally
-f null /dev/null
sets the output format as null
and drives any output to /dev/null
. Makes sense for
an analysis stage where we just care about the JSON data.
Then the report file is read, reduced to its last 25 lines (via tail -n 25
), and
passed through a series of grep
s and regexes, until the output configuration string is
produced, looking like:
loudnorm=linear=true:measured_I=-40.11:measured_LRA=20.00:measured_tp=-24.19:measured_thresh=-51.96:i=-12
This is then passed back to the main mastering function.
First stage: Compressor #
The first step in the actual mastering process is to apply a compressor. The call itself looks like:
ffmpeg \
-i "${stem}/${filename}.wav" \
-af "${loudnorm_args}" \
-af "acompressor=threshold=-12dB:ratio=2:attack=0.2:release=1" \
-codec:a pcm_s16le \
-y \
"${td}/${filename}.wav"
Some of this is looking familiar. We take in the base .wav
file, apply loudness
normalization with the loudnorm
parameters we just got out, apply the
acompressor
filter with some
reasonably chosen magic numbers, and export it as a pcm_s16le
(16-bit) .wav
file,
overwriting if there’s something in the way (thanks to the -y
flag).
Second stage: Amplification #
Now, the just-generated new .wav
file is passed through analysis again — the same as
above — in order to do a final, second pass to normalize the loudness.
ffmpeg \
-i "${td}/${filename}.wav" \
-af "${loudnorm_args}" \
-codec:a libmp3lame \
-qscale:a 2 \
-y \
"${stem}/${MASTERDIR}/${filename}.mp3"
This looks pretty similar, except that we’re just applying the loudnorm
filter and
we’re encoding the output as MP3 with a quality of “2”, whatever that means.
And that’s supposedly it.
Improving #
So, for starters, the other audio needs to be mixed in. In the test file I’m working with, that’s 10 different files — two choirs of four voices each plus two solos. Normally, I just export a “combined” audio file from my notation software, but now we’re doing it by hand, since we want to fiddle around with panning.
Speaking of: panning. Each of the 10 tracks need to be panned in a unique way. For the new Librarian utility, I want to auto-generate the panning numbers, but for now I can go with just picking numbers out of a hat.
That’s not the whole story, though. The
pan
filter is more powerful than just
tweaking a panning knob, but is consequently harder to grasp. So I guess step one is to
figure out how to pan an input file.
Panning #
Leaving aside how perception works, to the best of my understanding, panning stereo audio 50% right means leaving the right channel untouched and reducing the left channel by 50%. There may be more Fancy Mathematics™️ involved, but let’s pretend there’s a linear relationship.
Given that, it seems that panning works by
ffmpeg -i test.wav \
-af "pan=stereo|FL=0.5*FL|FR=FR" \
test_pan.wav
This just takes the input audio, de-amplifies the left channel (FL
) by 0.5 while
leaving the right channel alone. Testing that, it seems to work as expected.
Choosing the panning values #
The way I see it, there are a few ways of assigning these values.
-
Hardcoding works up to a point, and that point is any time I have more (or less) than exactly four tracks (most men’s choir music will be two tenors and two basses).
-
Random selection is something I was toying with, mostly because hardcoding is doomed to fail. One could motivate it by saying something like how it “mimics the way a real choir/orchestra sounds”, but really, it’s just to not have to deal with hardcoding.
The problem here is that while I want the voices panned out in stereo, I still want them reasonably centered — it wouldn’t make sense for some voices to be just coming out of one of the audio channels. That means I can’t do a uniform random distribution over the \([0,1]\) interval.
I was toying even more with the idea of doing random sampling on a normal distribution with something like \(\mu = 0.5\) and \(\sigma = \frac{1}{5}\), where I’d treat 0.5 as “dead center”, then pan out accordingly. The issue with this approach is that, of course, most samples will lie pretty close to dead center — some may even be dead center. I considered using an offset to push the samples to the left or right, but then the chance of ending up dead center still persists.
-
The method I think I’ve settled on is a round robin based approach. I create some predefined list of coefficients — maybe \([0.9, 0.85, 0.8, 0.75, \dots, 0.5]\). Then, I sort the audio files lexicographically and iterate over them. First file gets the first coefficient panned left, second file gets the first coefficient panned right, third gets second coefficient panned left, and so on.
Not only is this deterministic, it also (probably) preserves a reasonable balance in the mixed audio between the channels. It’ll also lead to “symmetry”, which I’m usually after — if the 1st Bass track is panned 20% left (0.8 coefficient on the right channel), the 2nd Bass track will (most likely) be panned 20% right, since it ought to come next in the sorted list.
If, for some unholy reason, I were to run out of coefficients, I can just start over with the same list. The odds of this causing some weird audio interference are slim to none. In the final utility, I probably also want some method to exclude one or more tracks from panning — e.g. an accompanying piano or church organ or such — but that’s not a problem for now.
Mixing #
Now that we can pan out the audio tracks, let’s try to mix them together. The
amix
filter seems to be a reasonably
straightforward way of doing that.
ffmpeg -i test_left_pan.wav \
-i test_right_pan.wav \
-filter_complex "amix=inputs=2" \
mixed.wav
seems to do the trick.
I also noticed that the amix
filter has support for setting weights, which means I
don’t have to de-gain the “background” tracks as a separate filter — as long as the
“hero” track is the first input, all other inputs get the second weight.
Order of operations #
Having solved that, I now have a reasonably good idea of what the new script should do — and I need it to be a script; doing this by hand is incredibly tedious.
- Take all the component
.wav
files and export new, panned versions. Make a note of the number of tracks — we’ll need that info to theamix
filter. - Do a “straight” mix of all panned tracks into the “general” track.
- Apply the earlier analyze -> compress -> analyze -> normalize chain to the general track.
- Do the “hero” mixes for each (or some subset) of the component files.
- Apply the processing chain to each of those, too.
If I can get that in place, I can do the rest by hand, while continuing to develop the new Librarian.
But not today. It’s taken me most of today to get this far and my brain is mush.