[FFmpeg-user] Towards better trims & concatenations

Tue Jan 9 01:01:48 EET 2024

On 1/8/24 08:08, Rob Hallam wrote:
> On Mon, 8 Jan 2024 at 12:37, Mark Filipak <markfilipak.imdb at gmail.com> wrote:
>>
>> On 1/8/24 07:16, Rob Hallam wrote:
>>> On Mon, 8 Jan 2024 at 12:07, Mark Filipak <markfilipak.imdb at gmail.com> wrote:
>>>
>>>> For example, if 'v' (video) and 'a' (audio) packets go from
>>>> v-a-a-a-a-v-a-a-a-a-v... to
>>>> a-a-a-a-a-a-a-a-v-v-v..., then somethings wrong, eh? That's the kind of difference I'm seeing
>>>> between the two versions of 01.mp4.
>>>
>>> Forgive me for jumping in in the middle here, but is that strictly
>>> true?

Is what true? Is it true that the audio packets are bunched up, out of time sequence, and pushed to 
the front? Yes, it's true. That's why the MPV player has difficulty and doesn't start at 
00:00:00.000. Part of that problem is that, for some unknown reason, ffmpeg creates one time_base 
for frame packets and a different time_base for audio packets. It seems to me that that's just 
looking for trouble.

>>> Honest question, perhaps the spec says that they should be
>>> identical.

There is no spec that defines how to trim and concatenate.

>> Sorry, I don't understand you. Are you asking if I'm lying? I doubt it, but I don't know the
>> antecedent of "that". Also, when you wrote "the spec", what spec did you have in mind?
> 
> For clarity, I wasn't accusing you of lying...

For clarity, I didn't think you were, and said so.

>... and it certainly wasn't my
> intention to imply that; my apologies if it sounded that way!
> 
> The 'that' in the above-quoted case was your example of packets-
> clearly they are ordered differently, something has changed and
> perhaps it shouldn't have changed.

There is no 'perhaps' about it.

> I wondered if there was a practical
> difference; to go back to the multiplication example, if you get 120
> either way, does it matter if you do 3*4*10 versus 10*3*4 ? Sometimes
> it does matter -- like in cases of floating-point maths -- but  I am
> wondering if ffmpeg here is producing something that appears different
> but looks and sounds the same.

I address this further down.

> I didn't have a particular spec in mind, but candidates would be
> ffmpeg specs...

FFmpeg has specs? I'd surely like to see them.

> ...and/or specs for the container and codec formats in use-
> ie does this behaviour contradict those.

I parse VOBs. I don't know the structures of M2TSs or MP4s or MKVs or anything else. But they all 
work off packet headers (e.g., PESs (packetized elemental streams)) that contain the structure and 
the settings that made the packet's payload what it is. There's no usage spec. Packet headers 
contain DTS, PTS, DAR, width, height, etc. Packet headers don't 'specify' how applications should 
create and maintain a valid packet table, nor do they specify packet table access methods. The specs 
just show structure. The H.262 spec goes a little further when it attempts to describe a virtual 
decoder machine for MPEG TS streams. That machine is a simple outline of how DTS & PTS work to 
render time ordered presentations from time disordered packets that are received. Illustrating such 
a small aspect of such a large procedure is like illustrating how the sun works by lighting a match. 
It's an important part, and the decoder model is good as far as it goes, but the rest is left up to 
the application and the specification is silent about that.

>>> In much the same way a*b*c is equivalent to b*a*c, does the order of
>>> packets necessarily matter if the output is perceptually the same?

Yes, time order matters. If two videos are perceptually the same, then they're the same; they have 
the same internals. You can't move frames or audio samples around and it not be perceived. Things 
can get so bad that players drop packets. Is that perceivable? Yes, at some level of probing, it is.

The frames and samples and chapters and subtitles are Legos. If you take the peak of a Lego building 
off and stick it onto the side of the building, is that perceivable?

This is not brain surgery. It's Legos.

Oh, I think I see why your difficulty, Rob. "a*b*c" happens at one instant. It doesn't matter in 
what order the multiplication happens because it's all in a single instant. With video frames, order 
matters. Frames are separated in time -- out of order is visible.

>> The packets are in PTS order. Does the order of the packets matter? No, it's the order of the PTSs
>> that matters.
>>
>>> If the output is not perceptually the same, or there are timing issues
>>> / desync / other problems as a result then I can see that being a
>>> potentially important bug.
>>
>> The MPV player misbehaves for all 6 of the sons. The starting running time is not "00:00:00.000".
> 
> Does it matter that the starting running time is not "00:00:00.000" ?

Yes.

> I presume it does, otherwise you might not be raising this issue; but
> in my ignorance I can see the possibility that the reported starting
> running time is a 'cosmetic' issue rather than a functional one.

Trimming errors are wrecking concatenations. If DTSs & PTSs aren't smooth and continuous at the 
join, bad things happen. By that I don't mean that packets have to be in PTS order. They are about 
half the time and PTS-DTS varies between I-frames and P-frames and B-frames in order to allow the 
decoder time to decode and do the interframe correlations -- motion vectors and all that stuff. But 
the trimming has to take PTS into account so that the cut happens in the right spot with no leftover 
packets that shouldn't be there, but that apparently isn't happening and I have the proof.

> I am
> happy to be corrected and educated, which is partly why I am still
> subscribed to this ML.
> 
> I ask these questions because "ffmpeg produces output that plays
> incorrectly" is a different bug to "ffmpeg produces output that plays
> correctly but has a different file structure". Both are bugs, but it's
> worth being clear to which one the issues have identified belong so
> you and devs are on the same page.

To state it clearly, Rob, two MP4s for example that play correctly have the same structure. If one 
of them has a different structure, then one of them does not play correctly and that can be seen 
and/or heard. v-a-a-a-a-v-a-a-a-a-v versus a-a-a-a-a-a-a-a-v-v-v is my poor portrayal of such a 
difference that I am actually seeing.

>>> PS I've been following along as I am also interested in cutting and
>>> re-joining- my first query to this ML was about whether there's a way
>>> to chop off the starts and ends of some clips, add transitions and
>>> re-encode those short overlapping bits, and then join them back on to
>>> their parent clips to avoid having to re-encode the whole lot

To be frank, Rob, if you want to help yourself, you may want to help me. I published my procedure. 
Duplicate it and apply it to some of the videos you've had problems with. Learn how to use 
'-framecrc' and '-showinfo'. It will take you awhile, but it will be time well spent. It will 
demystify a lot for you. I'll be here to help if you like.

The developers are interested in streaming methods and using them to get consulting jobs. That's as 
it should be because everyone needs to make a living. To get them to pay attention to this 'troll', 
I need allies.

Rob, video is not brain surgery. It's Legos.

-- Mark.