[FFmpeg-user] FFmpeg AAC encoder produces harsh noise on specific voice

Sun Aug 10 16:06:51 EEST 2025

Thank you for the detailed explanation and suggestions.
Changing either -aac_coder (to fast) or -aac_pns (to disable) significantly
reduces the metallic noise in my tests, so I’ll continue experimenting with
these options. If that still doesn’t give satisfactory results in some
cases, I’ll also try alternative encoders like libfdk_aac as you suggested.

I also hope posting on both the mailing list and code.ffmpeg.org hasn’t
caused any inconvenience — I only recently joined and wasn’t aware both
were still active.

Anton Kapela <tkapela at gmail.com> 于2025年8月10日周日 05:45写道：

> This phoneme in particular will probably always encode poorly on simplistic
> AAC implementations like libavcodec. Why? More on that later.
>
> Others already suggested libfk_aac - and after testing that coder with your
> samples, it's definitely the right choice (ie. sounds fine to me at 128k).
> More deets on the Fraunhofer FDK AAC coder here:
> https://ffmpeg.org/ffmpeg-codecs.html#libfdk_005faac - and a sample of its
> output at 128k using your "input2" source, attached.
>
> It's clear you've hit one of the many poorly handed corner cases of this
> AAC implementation. If you're curious why, read on.
>
> ---
>
> First, I'd recommend some experimentation: toggling the coder models
> available ("aac_coder'), and then also toggle aac_pns, aac_tns, aac_ltp;
> listen for whether the character of the error changes. Details here:
> https://ffmpeg.org/ffmpeg-codecs.html#aac
>
> As to why this signal is so badly represented by "twoloop:" we need to
> actually look at the signal we've encountered and understand what
> it represents. Interestingly, this particular sound presents a relatively
> simple time domain character, but is rather complex in the frequency
> domain. What we have here is a textbook example of:
> https://en.wikipedia.org/wiki/Cyclostationary_process - mixed with a
> flavor
> of https://en.wikipedia.org/wiki/Frequency_comb - which, taken together,
> present a unique problem for any block based MDCT codec scheme: to
> coherently describe the subtle time domain components of a strongly
> modulated signal, in a purely block-based frequency transformed domain.
>
> Let's examine this signals major features, looking at "input2" here, since
> it's the longest and simplest example in your set:
>
> -the formant pitch is ~274 Hz
> -an in-phase high frequency burst occurs at *half* that frequency - around
> ~137 bursts/sec, roughly one every 3.6 msec
> -the modulated burst is "ringing" around 4700 Hz
> -the formant and harmonics have a slow downwards frequency drift, along
> with short-term trills and warble
>
> This all adds up to create a situation in which high frequency bands are
> "sparse" in an absolute energy sense (relative to the formant pitch), but
> which present ever-so-slight differences over short time scales (block
> lengths, even if dynamic, will never be in-phase with the signal features).
> These prevent the twoloop algorithm from making *consistent*-sounding
> decisions, and why we hear swish/flutter/chirpy-noises at almost any rate
> for signals of this type. Important decisions like "is this part of the
> signal a transient?" and "do these coefficients contain enough entropy to
> matter?" or "should we substitute noise?" will radically alter the
> character of the reproduced signal, especially over the course of the
> signals' evolution.
>
> Why? Well, “twoloop” in FFmpeg’s native AAC encoder is a classic
> rate–distortion search and quantizer allocation scheme. It optimizes
> scalefactors per codebook, and across bands (two nested loops), on top of
> FFmpeg’s psychoacoustic masking model. It then employs the usual AAC tools
> (block switching, M/S and intensity stereo, PNS, and TNS) in its RD loop.
> It does not implement high-band envelope detection nor cross-band “carrier
> vs. envelope” tools like SBR/PS, or like we find in AC3. In contrast,
> libfdk-aac does—and employs a more complete hybrid, contextual
> psychoacoustic masking and ATH model. It also has support for the usual,
> more complex AAC profiles (HE-AAC v1/v2, ELD/LD), including an
> “afterburner” analysis-by-synthesis refinement. If one isn't using HE-AACv2
> options, FDK still employs various refinements necessary to do the fancy
> stuff, even in LC operation.
>
> For comparison, I attached some 128k, 64k, and 48kbit AC3 encodes - you'll
> hear how even this stone-age codec scheme makes better decisions, and
> degrades more gracefully, than the current twoloop AAC RD algorithm. Here,
> the major contributing factor in AC3s ability to code this signal "better"
> than twoloop AAC lies in its explicit use of "carrier
> precombination" (read:
>
> https://www.fast-and-wide.com/images/stories/White_papers/ac3_multichannel_decoder.pdf
> )
> - which nicely handles cases like yours. This is possible by separating the
> subband "carrier" signal from its "envelope" after input decomposition by
> the filterbank. This has the audible effect of preserving interrelated
> time-domain features of the ~137 "tone bursts" per second in your sample,
> while still providing coding gain vs. the source PCM.
>
> HTH,
>
> -Tk
>
>
> On Sat, Aug 9, 2025 at 4:26 AM Agent 45 <jackatmg at gmail.com> wrote:
>
> > Hello, FFmpeg team,
> >
> > I'm encountering a consistent issue when encoding voice with FFmpeg AAC
> > encoder.
> > At low and medium bitrates the encoded output contains noticeable and
> > sometimes harsh noise when encoding specific vocals.
> > These noise gradually reduce as the bitrate increases.
> >
> > I’ve attached all files (input and encoded outputs).
> > Here are the commands used, ffmpeg version 7.1.1:
> >
> > ffmpeg -i input1.wav -c:a aac -b:a 128k output1_128k.m4a
> > ffmpeg -i input2.wav -c:a aac -b:a 128k output2_128k.m4a
> > ffmpeg -i input3.wav -c:a aac -b:a 128k output3_128k.m4a
> >
> > ffmpeg -i input1.wav -c:a aac -b:a 192k output1_192k.m4a
> > ffmpeg -i input2.wav -c:a aac -b:a 192k output2_192k.m4a
> > ffmpeg -i input3.wav -c:a aac -b:a 192k output3_192k.m4a
> >
> > ffmpeg -i input1.wav -c:a aac -b:a 256k output1_256k.m4a
> > ffmpeg -i input2.wav -c:a aac -b:a 256k output2_256k.m4a
> > ffmpeg -i input3.wav -c:a aac -b:a 256k output3_256k.m4a
> >
> > # Observations:
> >
> > - All 128k versions contain harsh noise, and almost the same if increase
> > the bitrate to 160k
> >
> > - `output1_192k.m4a`: noise at around 0.27s
> > - `output2_192k.m4a`: No obvious noise detected
> > - `output3_192k.m4a`: Mild noise at around 0.05s and some noise still
> > present from 0.3s
> >
> > - `output1_256k.m4a`: noise at around 0.27s
> > - `output2_256k.m4a`: No obvious noise detected
> > - `output3_256k.m4a`: Mild noise around at around 0.05s
> >
> > - No noise detected when increased to 320k
> > _______________________________________________
> > ffmpeg-user mailing list
> > ffmpeg-user at ffmpeg.org
> > https://ffmpeg.org/mailman/listinfo/ffmpeg-user
> >
> > To unsubscribe, visit link above, or email
> > ffmpeg-user-request at ffmpeg.org with subject "unsubscribe".
> >
> _______________________________________________
> ffmpeg-user mailing list
> ffmpeg-user at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-user
>
> To unsubscribe, visit link above, or email
> ffmpeg-user-request at ffmpeg.org with subject "unsubscribe".
>