[FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame\n for subtitle handling

Sun Dec 12 04:21:42 EET 2021

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Daniel
> Cantarín
> Sent: Sunday, December 12, 2021 12:39 AM
> To: ffmpeg-devel at ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame\n
> for subtitle handling
> 
>  > One of the important points to understand is that - in case of subtitles,
>  > the AVFrame IS NOT the subtitle event. The subtitle event is actually
>  > a different and separate entity. (...)
> 
> 
> Wouldn't it qualify then as a different abstraction?
> 
> I mean: instead of avframe.subtitle_property, perhaps something in the
> lines of avframe.some_property_used_for_linked_abstractions, which in
> turn lets you access some proper Subtitle abstraction instance.
> 
> That way, devs would not need to defend AVFrame, and Subtitle could
> have whatever properties needed.
> 
> I see there's AVSubtitle, as you mention:
> https://ffmpeg.org/doxygen/trunk/structAVSubtitle.html
> 
> Isn't it less socially problematic to just link an instance of AVSubtitle,
> instead of adding a subtitle timing property to AVFrame?
> IIUC, that AVSubtitle instance could live in filter context, and be linked
> by the filter doing the heartbeat frames.
> 
> Please note I'm not saying the property is wrong, or even that I understand
> the best way to deal with it, but that I recognize some social problem here.
> Devs don't like that property, that's a fact. And technical or not, seems to
> be a problem.
> 
>  > (...)
>  > The chairs are obviously AVFrames. They need to be numbered monotonically
>  > increasing - that's the frame.pts. without increasing numbering the
>  > transport would get stuck. We are filling the chairs with copies
>  > of the most recent subtitle event, so an AVSubtitle could be repeated
>  > like for example 5 times. It's always the exact same AVSubtitle event
>  > sitting in those 5 chairs. The subtitle event has always the same
> start time
>  > (subtitle_pts) but each frame has a different pts.
> 
> I can see AVSubtitle has a "start_display_time" property, as well as a
> "pts" property "in AV_TIME_BASE":
> 
> https://ffmpeg.org/doxygen/trunk/structAVSubtitle.html#af7cc390bba4f9d6c32e39
> 1ca59d117a2
> 
> Is it too much trouble to reuse that while persisting an AVSubtitle instance
> in filter context? I guess it could even be used in decoder context.
> 
> I also see a quirky property in AVFrame: "best_effort_timestamp"
> https://ffmpeg.org/doxygen/trunk/structAVFrame.html#a0943e85eb624c2191490862e
> cecd319d
> Perhaps adding there some added "various heuristics" that it claims to
> have,
> this time related to a linked AVSubtitle, so an extra property is not
> needed?
> 
> 
>  > (...)
>  > Considering the relation between AVFrame and subtitle event as laid out
>  > above, it should be apparent that there's no guarantee for a certain
>  > kind of relation between the subtitle_pts and the frame's pts who
>  > is carrying it. Such relation _can_ exist, but doesn't necessarily.
>  > It can easily be possible that the frame pts is just increased by 1
>  > on subsequent frames. The time_base may change from filter to filter
>  > and can be oriented on the transport of the subtitle events which
>  > might have nothing to do with the subtitle display time at all.
> 
> This confuses me.
> I understand the difference between filler frame pts and subtitle pts.
> That's ok.
> But if transport timebase changes, I understand subtitle pts also changes.
> 
> I mean: "transport timebase" means "video timebase", and if subs are synced
> to video, then that sync needs to be mantained. If subs are synced, then
> their timing is never independant. And if they're not synced, then its
> AVFrame
> is independant from video frames, and thus does not need any extra prop.
> 
> Here's what I do right now with the filler frames. I'm talking about current
> ffmpeg with no subs frames in lavfi, and real-time conversion from dvbsub
> to WEBVTT using OCR. Quite dirty stuff what I do:
>    - Change FPS to a low value, let's say 1.
>    - Apply OCR to dvb sub, using vf_ocr.
>    - Read the metadata downstream, and writting vtt to file or pipe output.
> 
> As there's no sub frame capability in lavfi, I can't use vtt encoder
> downstream.
> Therefore, the output is raw C string and file manipulation. And given
> that I
> set first the FPS to 1, I have 1 line per second, no matter the
> timestamp of
> either subs or video or filler frame. The point then is to check for
> text diffs
> instead of pts for detecting the frame nature. And I can even naively
> just put
> the frame's pts once per sec with the same text, and with empty lines when
> there's no text, without caring about the frame nature (filler or not).
> 
> There's a similar behaviour when dealing with CEA-608: I need to check text
> differences instead of any pts, as inner workings of this captions are more
> related to video than subs. I assume in my filters that frame PTS is
> correct.
> 
> I understand the idea behind PTS, I get that there's also DTS, and so I
> can get
> that there could be an use case where another timing is needed. But I still
> don't see the need for this particular extra timing, as the distance
> between
> subtitle_pts and filler.pts does not means downstream something like "now
> clear the current subtitle line" or something like that. What will
> happen if
> there's no subtitle_pts, is that the same line will still be active,
> which will
> only change when there's an actual subtitle difference. So, I believe this
> value is more theoretically useful rather than factual.
> 
> I understand that there are subs formats that need precise start and end
> timing, but I fail to see the case where that timing avoids the need for
> text
> differences checking, be it filter or encoder. And if filters or
> encoders naively
> use PTS, then the filler frames would not break anything: will show
> repeatedly
> the same text line, at current FPS speed. And if the sparseness problem is
> finally solved by your logic somehow, and there's no need for filler
> frames,
> then there's also no need for subtitle_pts, as pts would be actually fine.
> 
> So, I'm confused, given that you state this property as very important.
> Would you please tell us some actual, non-theoretical use case for the prop?
> 
> 
>  >
>  > Also, subtitle events are sometimes duplicated. When we would convert
>  > the subtitle_pts to the time_base that is negotiated between two filters,
>  > then it could happen that multiple copies of a single subtitle event have
>  > different subtitle_pts values.
>  >
> 
> If it's repeated, doesn't it have different pts?
> I get repeated lines from time to time. But they have slightly different
> PTS.
> 
> "Repeated event" != "same event".
> If you check for repeated events, then you're doing some extra checking,
> as I point with "text difference checks" in previous paragraphs, and so
> PTS is not ruling all the logic. Otherwise, worst case scenario you get the
> same PTS twice, which will discard some frame. And most likely scenario,
> you get two identical frames with different PTS, that actually changes
> nothing in viewer's experience.
> 
>  >
>  > Besides that, there are practical considerations: The subtitle_pts
>  > is almost nowhere needed in any other time_base than AV_TIMEBASE_Q.
>  >
>  > All decoders expect it to be like this, all encoders and all filters.
>  > Conversion would need to happen all over the place.
>  > Every filter would need to take care of rescaling the subtitle_pts
>  > value (when time_base is different between in and out).
>  >
> 
> I'm not well versed enough in ffmpeg/libav to understand that.
> But I tell you what. You think is possible for you to do some practical
> test?
> I mean this:
>    - Take some short video example with dvbsubs (or whatever graphical).
>    - Apply graphicsub2text, converting to webvtt, srt, or something.
>    - Do the same, but taking away subtitle_pts from AVFrame.
> 
> Let's compare both text outputs.
> I propose text, because is easier to share. But if you think of any other
> practical example like this, it's also welcome. The point is to understand
> the relevance of subtitle_pts by looking at the problem of not having it.
> 
> If there's no big deal, then screw it: you take it away, devs get pleased,
> and everybody in the world gets the blessing of having subtitle frames in
> lavfi. If there's some big deal, then the devs should understand.

I'm afraid, the only reply that I have to this is:

- Take my patchset 
- Remove subtitle_pts
- Get everything working
  (all example command lines in filters.texi)

=> THEN start talking

The same goes out to everybody else who keeps telling it can be 
removed and that it's an unnecessary duplication.

The stage is yours...

Kind regards,
softworkz