[FFmpeg-devel] [PATCH 1/2] decode: add ff_decode_skip_samples function

Sun Nov 5 01:02:26 EET 2023

Hi,

Just following up on this - I'm sorry I haven't been able to look at the 
proposed patchset myself quite in detail yet.

My prime concern is about the requests to have this merged into the 
upcoming 6.1 release; that's way too soon IMO.

These patches do change aspects of how these things behave, that have been 
working the same for a very long time, so there are all sorts of potential 
subtle breakage, or (incorrect or not) assumptions being broken, across 
libavcodec and its users.

On Sat, 4 Nov 2023, Derek Buitenhuis wrote:

> Next, a quick breakdown of the AAC situation, in terms of both how this it is stored,
> what we support, and the state of the ecosystem and types of files that exist:
>    * 'Raw' ADTS streams have no way to store any of this. The best we can do is guess
>      the pre-roll. We should not guess priming or end padding, as no matter what we do,
>      it'll be wrong, and any value will be a cargo culted hack value.

I share this concern; all the various encoders I've seen have used a 
different amount of priming samples, so guessing it will be bound to be 
wrong in a lot of the cases.

>    * MP4 - there are two places to store this metadata - one standard, and one proprietary
>      Apple way. There are, separately, two ways to signal priming length when SBR is present.
>       * MP4s may contain a user data box with 'iTunSMPB' which contains priming, pre-roll,
>         and end padding data. We support reading only priming data from this at the moment,
>         and we set skip samples based on this. This is 'iTunes style' metadata.
>       * The standards compliant (read: non-iTunes) way is to use an edit list to trim the
>         priming samples, and, opionally end padding. End padding may also be trimmed by reducing
>         the sample duration of the last packet in the stts box. Pre-roll is store in the sgpb
>         box with the 'roll', type, which signals the roll distance as a number of packets;
>         for example, -1 indicates you should decode an discard the samples of 1 packet before
>         beginning plaback. Notably, this allows the sgpd box to also be use for video like
>         periodic intra refresh H.264. libavformat does not current parse or export this info,
>         but even if we did, converting number of packets to audio samples can get hairy.
>           * Notably, since in MP4, the edit list represents the exact presentation-level info,
>             when no edit list, or an edit list startiing at 0 is present, no samples, not even
>             pre-roll should be trimmed - all players in the wild handle this properly, and it
>             has been standard practice among streaming services for >10 years to not output
>             the AAC frames representing priming samples at all (even if there is a small hit
>             quality). This is what the patch at [0] is addressing.

FWIW, MP4 isn't the only container where this might be relevant; AAC is 
frequently used in muxes together with video in FLV and MKV and others as 
well.

In the case of FLV, I'm not aware of any metadata that signals how much to 
trim off, so essentially we can't do it by guessing. On the producing 
side, this is handled by shifting the timestamps so the audio track, which 
would be starting at -<delay>, ends up starting at 0, and the video track 
ends up starting at +<audiodelay> instead.

In that case, if we trim off the priming samples (based on a guess as 
that's all we have?), I guess that'd lead us to both tracks starting at 
+<delay> (i.e. not affecting sync). As long as it doesn't change sync, I 
guess it can be tolerable.

To avoid all these effects, producers of muxed files can work around this 
in many ways. For many years, I've been doing the trick of skipping the 
first <delay> samples of input to the audio encoder, so that after 
accounting for that, I have both audio and video tracks starting at 0.0, 
without the decoder needing to do anything - working the same across all 
players, good and bad.

If we suddenly start decoding such files with the audio track starting at 
+<delay>, I guess it'll be ok for sync, but it's a mildly surprising 
change, but hopefully any reasonable player based on libavcodec would 
still not freak out by it.

>    * As noted above, I don't think we should apply any guessed priming to initial samples (pre-roll,
>      or 'algorithmic delay, included). No other decoders or players do this, in the world, to my
>      knowledge, and violating the principal of least surpise because we think we're slightly more
>      correct isn't great. I also think trying to 'fix' raw ADTS is destined to always be a hack,
>      an we shouldn't. YMMV. I'd like to hear views from others here. This would make the patch in
>      [0] redundant.

Yes, with raw ADTS there's really no good way of getting this right, other 
than plain guessing, and there's no single universally correct guess 
AFAIK.

(And even if we have a qualified guess for the amount of encoder priming, 
we have even less knowledge about how much to trim off at the end, if 
we're aiming at proper gapless playback.)

For MP4 there's at least a couple ways of signalling it.

> But also, given all of this, I think we need to deeply consider how we 
> approach this, so we don't end up with something that only covers 
> certain cases (and I am sure I forgot more cases). To that end, I do not 
> think rushing to get a patchset that can change sync on all AAC files in 
> existence into 6.1 is wise. Even when this does go in, it should be able 
> to sit in master for a good long time before being in a release.

+1. This has the potential to be surprising in many different cases, and 
may need a bunch of follow up patches to sort out cases found later. It 
definitely should sit in git master for a some time before ending up in a 
release - not be slipped into 6.1 the week before the release.

// Martin