[FFmpeg-user] FFmpeg AAC encoder produces harsh noise on specific voice

Sun Aug 10 00:45:25 EEST 2025

This phoneme in particular will probably always encode poorly on simplistic
AAC implementations like libavcodec. Why? More on that later.

Others already suggested libfk_aac - and after testing that coder with your
samples, it's definitely the right choice (ie. sounds fine to me at 128k).
More deets on the Fraunhofer FDK AAC coder here:
https://ffmpeg.org/ffmpeg-codecs.html#libfdk_005faac - and a sample of its
output at 128k using your "input2" source, attached.

It's clear you've hit one of the many poorly handed corner cases of this
AAC implementation. If you're curious why, read on.

---

First, I'd recommend some experimentation: toggling the coder models
available ("aac_coder'), and then also toggle aac_pns, aac_tns, aac_ltp;
listen for whether the character of the error changes. Details here:
https://ffmpeg.org/ffmpeg-codecs.html#aac

As to why this signal is so badly represented by "twoloop:" we need to
actually look at the signal we've encountered and understand what
it represents. Interestingly, this particular sound presents a relatively
simple time domain character, but is rather complex in the frequency
domain. What we have here is a textbook example of:
https://en.wikipedia.org/wiki/Cyclostationary_process - mixed with a flavor
of https://en.wikipedia.org/wiki/Frequency_comb - which, taken together,
present a unique problem for any block based MDCT codec scheme: to
coherently describe the subtle time domain components of a strongly
modulated signal, in a purely block-based frequency transformed domain.

Let's examine this signals major features, looking at "input2" here, since
it's the longest and simplest example in your set:

-the formant pitch is ~274 Hz
-an in-phase high frequency burst occurs at *half* that frequency - around
~137 bursts/sec, roughly one every 3.6 msec
-the modulated burst is "ringing" around 4700 Hz
-the formant and harmonics have a slow downwards frequency drift, along
with short-term trills and warble

This all adds up to create a situation in which high frequency bands are
"sparse" in an absolute energy sense (relative to the formant pitch), but
which present ever-so-slight differences over short time scales (block
lengths, even if dynamic, will never be in-phase with the signal features).
These prevent the twoloop algorithm from making *consistent*-sounding
decisions, and why we hear swish/flutter/chirpy-noises at almost any rate
for signals of this type. Important decisions like "is this part of the
signal a transient?" and "do these coefficients contain enough entropy to
matter?" or "should we substitute noise?" will radically alter the
character of the reproduced signal, especially over the course of the
signals' evolution.

Why? Well, “twoloop” in FFmpeg’s native AAC encoder is a classic
rate–distortion search and quantizer allocation scheme. It optimizes
scalefactors per codebook, and across bands (two nested loops), on top of
FFmpeg’s psychoacoustic masking model. It then employs the usual AAC tools
(block switching, M/S and intensity stereo, PNS, and TNS) in its RD loop.
It does not implement high-band envelope detection nor cross-band “carrier
vs. envelope” tools like SBR/PS, or like we find in AC3. In contrast,
libfdk-aac does—and employs a more complete hybrid, contextual
psychoacoustic masking and ATH model. It also has support for the usual,
more complex AAC profiles (HE-AAC v1/v2, ELD/LD), including an
“afterburner” analysis-by-synthesis refinement. If one isn't using HE-AACv2
options, FDK still employs various refinements necessary to do the fancy
stuff, even in LC operation.

For comparison, I attached some 128k, 64k, and 48kbit AC3 encodes - you'll
hear how even this stone-age codec scheme makes better decisions, and
degrades more gracefully, than the current twoloop AAC RD algorithm. Here,
the major contributing factor in AC3s ability to code this signal "better"
than twoloop AAC lies in its explicit use of "carrier
precombination" (read:
https://www.fast-and-wide.com/images/stories/White_papers/ac3_multichannel_decoder.pdf)
- which nicely handles cases like yours. This is possible by separating the
subband "carrier" signal from its "envelope" after input decomposition by
the filterbank. This has the audible effect of preserving interrelated
time-domain features of the ~137 "tone bursts" per second in your sample,
while still providing coding gain vs. the source PCM.

HTH,

-Tk

On Sat, Aug 9, 2025 at 4:26 AM Agent 45 <jackatmg at gmail.com> wrote:

> Hello, FFmpeg team,
>
> I'm encountering a consistent issue when encoding voice with FFmpeg AAC
> encoder.
> At low and medium bitrates the encoded output contains noticeable and
> sometimes harsh noise when encoding specific vocals.
> These noise gradually reduce as the bitrate increases.
>
> I’ve attached all files (input and encoded outputs).
> Here are the commands used, ffmpeg version 7.1.1:
>
> ffmpeg -i input1.wav -c:a aac -b:a 128k output1_128k.m4a
> ffmpeg -i input2.wav -c:a aac -b:a 128k output2_128k.m4a
> ffmpeg -i input3.wav -c:a aac -b:a 128k output3_128k.m4a
>
> ffmpeg -i input1.wav -c:a aac -b:a 192k output1_192k.m4a
> ffmpeg -i input2.wav -c:a aac -b:a 192k output2_192k.m4a
> ffmpeg -i input3.wav -c:a aac -b:a 192k output3_192k.m4a
>
> ffmpeg -i input1.wav -c:a aac -b:a 256k output1_256k.m4a
> ffmpeg -i input2.wav -c:a aac -b:a 256k output2_256k.m4a
> ffmpeg -i input3.wav -c:a aac -b:a 256k output3_256k.m4a
>
> # Observations:
>
> - All 128k versions contain harsh noise, and almost the same if increase
> the bitrate to 160k
>
> - `output1_192k.m4a`: noise at around 0.27s
> - `output2_192k.m4a`: No obvious noise detected
> - `output3_192k.m4a`: Mild noise at around 0.05s and some noise still
> present from 0.3s
>
> - `output1_256k.m4a`: noise at around 0.27s
> - `output2_256k.m4a`: No obvious noise detected
> - `output3_256k.m4a`: Mild noise around at around 0.05s
>
> - No noise detected when increased to 320k
> _______________________________________________
> ffmpeg-user mailing list
> ffmpeg-user at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-user
>
> To unsubscribe, visit link above, or email
> ffmpeg-user-request at ffmpeg.org with subject "unsubscribe".
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: input2-64.ac3
Type: audio/ac3
Size: 4352 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-user/attachments/20250809/5375eb59/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: input2-128-fd.aac
Type: audio/aac
Size: 9798 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-user/attachments/20250809/5375eb59/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: input2-128.ac3
Type: audio/ac3
Size: 8704 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-user/attachments/20250809/5375eb59/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: input2-48.ac3
Type: audio/ac3
Size: 3264 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-user/attachments/20250809/5375eb59/attachment-0003.bin>