[FFmpeg-devel] [PATCH 1/2] decode: add ff_decode_skip_samples function

Sat Nov 4 22:33:46 EET 2023

Nov 4, 2023, 17:22 by derek.buitenhuis at gmail.com:

> Hi,
>
> I'm going to opine a bit here, and also comment on the mov/MP4 patch[0] that accompanies
> this set.
>
> This is for both historical purposes, and to distill IRC logs into something more
> digestible for others on the list to gain context on the issue, so apologies for
> re-treading ground.
>
> On 10/30/2023 5:09 AM, Lynne wrote:
>
>> This is a convenience function, which is required to be called by decoders
>> needing to skip samples every time.
>> It automatically creates and increments side data.
>>
>> The idea is to get rid of skip_samples eventually and replace it with this
>> function.
>>
>
> So there is a lot to cover here, and  lot of nuance to how things should be handled,
> that I want to list out, before discussing the specific changes of these two sets of
> patches. A bit of of a 'state of the world'.
>
> The goal of this patchset seems to be to aid in gapless playback and correct seeking
> in AAC streams (or later, more generally MDCT-styl audio codecs in general), but properly
> skipping initial priming samples (which include pre-roll), pre-roll (both normal, and extra
> incurred due to SBR) when seeking, and and, though not covered in these sets, I'll mention,
> end padding.
>
> First a note on terminology: 'Algorithmic delay' as it is being used here is not quite
> correct, for two reasons:
>  * Latency and pre-roll (or roll distance) are separate things. Opus, for example,
>  can have a latency as low as 2.5ms, but pre-roll is always at least 80ms - they
>  are different things which serve different purposes, and I confirmed this with
>  people who definitely know more about audio than me[1]. Pre-roll is often larger
>  than latency, and the values stored in file metadata reflect this.
>

Algorithmic delay is an industry standard term for those that do codec research,
and is accurate. Algorithmic delay == your definition of latency.
(latency is not a technically completely valid term, as latency has blurry definition
outside of real-time situations, which decoding may or may not be).
It is not equal to pre-roll.

It is a codec-dependent value, defined by the sum of all delay that happens within
each algorithm (block) of a codec.
It is exactly defined to:
 - 1024 samples for AAC (21.3̅ milliseconds), due to the MDCT overlap process.
 - 120 samples for Opus (2.5 milliseconds), due to the MDCT overlap process
 - 320 samples for Siren (G.719, 6.6̅ milliseconds), due to the MDCT overlap process
 - Equal to the frame size for Vorbis, due to it having flexible blocks.
- 1024 samples for ATRAC9 (21.3̅ milliseconds), due to the MDCT overlap process.
 - 256 samples for AC-3 (5.3̅ milliseconds), due to the MDCT overlap process.

Quoting from a few official documents:

RFC6716 "Definition of the Opus Audio Codec":
"The inverse MDCT implementation has no special characteristics.  The input is N frequency-domain samples and the output is 2*N time-domain samples, while scaling by 1/2.  A "low-overlap" window reduces the algorithmic delay."

ISO/IEC 14496-3, the bible itself on the very problem we're trying to solve:
"To enable coding of general audio signals with an algorithmic delay not exceeding 20 ms at 48 kHz, it uses a frame length of 512 or 480 samples (compared to the 1024 or 960 samples used in standard MPEG-2/4 AAC)."

G.719:
"The codec operates on 20 ms frames, and the algorithmic delay end-to-end is 40 ms. The encoder input and decoder output are sampled at 48 kHz."
(this one gives the algorithmic delay as the sum of both the decoder and the encoder's delays during real-time operation).

>  * Pre-roll, or roll distance, are the industry standard terms. Making up out own
>  terms because we disagree is silly and stubborn, and makes it harder on API
>  users trying to use the API correctly, or understnd our code.
>

You call me silly and stubborn for trying to use a standard term for a mathematically
defined value upon which to base other values (your definition of pre-roll, which contains
the algorithmic delay), so that there is no disagreement in what we talk about.
I defined the algorithmic delay earlier in our discussion and outlined the reason to do so.
And disagreement in what we talk about is exactly what happened.

With that out of the way, I will not stop calling what I have just defined as algorithmic delay.
Nor will I want an explanation on why after saying algorithmic delay so many times, you
keep reading pre-roll, and you keep trying to interpret every value as pre-roll, even
in the chat log you posted.

Now, let me begin by defining that the amount of samples you have to necessarily strip
from any MDCT-based audio codec at the start is exactly equal to its algorithmic delay.
**Including Opus**.
The output audio signal will be timing-accurate compared to the input signal,
as long as the encoder did not add any extra samples.

Continuing, let me define the exact number of samples that have to be stripped
from any MDCT-based audio codec after a seek is also exactly equal to its algorithmic delay.
**Including Opus**, in case the encoder signals that each coarse energy value is
independent and not delta/inter/differentially coded.

End of mathematically defined values.

In case the encoder signals that coarse energy levels are to be signaled as deltas between
each frame, the amount of frames needed varies, as each value will, as a result of a guided
random walk, converge to the encoder's value closer over each frame.
The threshold for which one decides that the value is close enough is SUBJECTIVE, and also
varies not with time, but with number of frames. Opus supports variable frame sizes.
Some may generally state that the amount of time needed for convergence at 20 millisecond
frames is 80 milliseconds (4 packets), but if the frame size the encoder uses are 2.5 milliseconds,
then a much lower amount of time would be needed to get a subjectively identical level of
convergence, perhaps 6 frames (14 milliseconds).
I keep saying subjective, as there is no exact mathematical definition for when the coarse
energy levels converge to exactly the levels the encoder has. If a specially crated stream is
used, this could be guaranteed to never happen, even within what I've been referring to
as a subjective threshold.
This is from someone who knows more about Opus than anyone else except maybe those
who created it.

Allow me to define pre-roll with a definition that matches yours:
Pre-roll shall be any amount of samples needed to be skipped at the start of decoding,
and it shall not be lower than the algorithmic delay.
Pre-roll shall contain any additional encoder-side induced delay.

> Next, a quick breakdown of the AAC situation, in terms of both how this it is stored,
> what we support, and the state of the ecosystem and types of files that exist:
>  * 'Raw' ADTS streams have no way to store any of this. The best we can do is guess
>  the pre-roll. We should not guess priming or end padding, as no matter what we do,
>  it'll be wrong, and any value will be a cargo culted hack value.
>

The amount of samples to strip off, would be, and to match
*every other codec we have*, would be 1024, the algorithmic delay.
The exact mathematically defined extra audio samples the decoder produces.
I will not accept calling this a "cargo culted hack value". It is, as I demonstrated earlier,
an exactly defined value that you can learn by reading the spec.
It is not a guess, it is a guarantee. It is a guarantee, that any codec which does not
insert any extra samples of padding, will meet, and the result will be that, compared to
the input signal pre-encoding, no samples will be present at the start.
This is not a fantasy, this is a reality when using our aac encoder.

>  * MP4 - there are two places to store this metadata - one standard, and one proprietary
>  Apple way. There are, separately, two ways to signal priming length when SBR is present.
>

I would like to add that there is a third way - that nothing is signalled. Given what you have
said about MP4 timelines, those require that nothing is stripped off from audio data.
>From my point of view, this is a special case, not a general case valid for all containers.

>  * MP4s may contain a user data box with 'iTunSMPB' which contains priming, pre-roll,
>  and end padding data. We support reading only priming data from this at the moment,
>  and we set skip samples based on this. This is 'iTunes style' metadata.
>  * The standards compliant (read: non-iTunes) way is to use an edit list to trim the
>  priming samples, and, opionally end padding. End padding may also be trimmed by reducing
>  the sample duration of the last packet in the stts box. Pre-roll is store in the sgpb
>  box with the 'roll', type, which signals the roll distance as a number of packets;
>  for example, -1 indicates you should decode an discard the samples of 1 packet before
>  beginning plaback. Notably, this allows the sgpd box to also be use for video like
>  periodic intra refresh H.264. libavformat does not current parse or export this info,
>  but even if we did, converting number of packets to audio samples can get hairy.
>  * Notably, since in MP4, the edit list represents the exact presentation-level info,
>  when no edit list, or an edit list startiing at 0 is present, no samples, not even
>  pre-roll should be trimmed - all players in the wild handle this properly, and it
>  has been standard practice among streaming services for >10 years to not output
>  the AAC frames representing priming samples at all (even if there is a small hit
>  quality). This is what the patch at [0] is addressing.
>  * My personal opinion is that since priming samples include any inherent delay already,
>  that if we do not know how many priming samples there are, we should not trim anything
>  from the start of the file, regardless of format. I am keen on hearing others Opinions(TM)
>  here, particularily Anton and Martin (sorry for name dropping :)).
>

For MP4-only, I would be inclined to agree. For anything else, I will disagree.
Everything else, including users, expect no additional samples to be present at the
start, and assume so.

>  * Further complicating matters is the fact that, again thanks to Apple, there are a lot
>  of broken files around, since iTunes expects files to *not* include addition delay incurred
>  by SBR in their edit list / priming info, even though, by spec, you are suppose to. This
>  leads to the unfortunate case where you have tons of files in the wild that both do, and
>  do not include SBR delay in their edit lists, and there is no way of detecting when this is
>  the case. I do not have a good solution to this other than proposing a flag somewhere to
>  switch between behaviors.
>
> Aside, for Opus, we incorrectly read this info from the codec specific box instead of the sgpd box...
> but we only ever put it in a write-only field.
>
> Now, on to the patches / implementation (opinions warning):
>
>  * As noted above, I don't think we should apply any guessed priming to initial samples (pre-roll,
>  or 'algorithmic delay, included). No other decoders or players do this, in the world, to my
>  knowledge, and violating the principal of least surpise because we think we're slightly more
>  correct isn't great. I also think trying to 'fix' raw ADTS is destined to always be a hack,
>  an we shouldn't. YMMV. I'd like to hear views from others here. This would make the patch in
>  [0] redundant.
>

I would be willing to accept this for MP4 and maybe ADTS.
But I will not accept not stripping off mathematically defined as redundant samples,
which we strip off for every other codec with a known amount, for anything else, if no
data is present to indicate how many samples to strip.
Our encoder matches this, and so does every single other audio codec. Just because everyone
else introduces additional padding with their encoder does not mean that it's a guess.
It is a defined *minimum* value that any well-written encoder can meet.

>  * I don't think hardcoding a value of 1024 (or 960) is wise, or necessarily even correct. For example,
>  all HE-AAC MP4s I have signal a seek pre-roll of 2048 samples. Ideally we would pass through seek
>  pre-roll somehow - but I am not sure of a good way in our existing setup - the write-only codecpar
>  field for seek pre-roll is in samples, which is kind of incompatible with the way the sgpd box stores
>  pre-roll as number of packets. We could set it based one the duration of the first packet(s), assuming
>  all audio codecs will have only 1 packet duration (frame size), or we could add a new field. Possibly,
>  we could leave the hardcoded values in place, only for seeking, inside e.g. ADTS.
>

The value is not hardcoded. If side data is present, the side data will be used. The value I have put there is equal to the algorithmic delay, which, like I mentioned earlier, is a minimum.
The encoder may know more, so I would accept any container value as long as it is larger than the algorithmic delay.

>  * This kind of brings me to: I don't really think using the same side data for both priming and pre-roll
>  is a great idea. Clearly this set of problems is already confusing enough to many, and conflating the
>  two in the API is a sure way to only make it worse.
>  * If we are going to discard at seek, we need to take this into account in the demuxer's seek code, or
>  we will always be $preroll number of samples off where the user actually wanted to seek and decode.
>

Correct. We have to take this into account while muxing.

> I am almost certain I missed even more nuance, and hopefully Martin or Anton can chime in, or I remember more.
>
> But also, given all of this, I think we need to deeply consider how we approach this, so we don't end up with
> something that only covers certain cases (and I am sure I forgot more cases). To that end, I do not think rushing
> to get a patchset that can change sync on all AAC files in existence into 6.1 is wise. Even when this does go in,
>

I did agree to this, and you did see it before you sent this.

> it should be able to sit in master for a good long time before being in a release. As I understand it, FATE is
> already unhappy, and it shouldn't be treated as being a problem with FATE vs the set.
>
> Lastly, some time this weekend/week, I will labour to create a more extensive set of AAC files we can use to
> test, and use in FATE.
>