[FFmpeg-devel] Subtitles for GSoC

Wed Mar 9 12:33:28 CET 2016

Le nonidi 19 ventôse, an CCXXIV, Clement Boesch a écrit :
> I added this task for previous OPW (and maybe GSoC, can't remember). I'm
> unfortunately not available for mentoring (too much time, energy and
> responsibility). Though, I can provide standard help as a developer.

Same goes for me.

> So, yeah, currently the subtitles are decoded into an AVSubtitle
> structure, which hold one or several AVSubtitleRect (AVSubtitle.rects[N]).
> 
> For graphic subtitles, each rectangle contains a paletted buffer and its
> position, size, ...
> 
> For text subtitles, the ass field contains the text in ASS markup: indeed,
> we consider the ASS markup to be the best/least worst superset supporting
> almost every style of every other subtitles formats have, so it's used as
> the "decoded" form for all text subtitles. For example, the SubRip (the
> "codec", or markup you find in SRT files) decoder will transform
> "<i>foo</i>" into "{\i1}foo{\i0}".

There is a serious problem with using a common markup: levels and source of
styling.

ASS has already two levels: style that is directly embedded in the dialog
text as markup, and style that applies to the whole dialog event, selected
by the "style" field. There are extra levels: fallback provided by the
application when the format does not provide styling, an possibly fallback
provided by the library if the application has no fallback. Formats that
rely on CSS for styling may have even more complex structures.

Obviously, handling all this in a completely generic way is not possible.
But at least we should try to achieve that transcoding from a format to a
format with similar capabilities, especially the same format, does not
completely lose the structure.

> - they are defined in libavcodec, and we do not want libavfilter to
>   depend on libavcodec for a core feature (we have a few filters
>   depending on it, but that's optional). As such, libavutil is a much
>   better place for this, which already contains the AVFrame.

Personally, I have no problem about lavfi depending on lavc; I want to merge
all libraries anyway.

> When these issues are sorted out, we can finally work on the integration
> within libavfilter, which is yet another topic where other developers
> might want to comment. Typically, I'm not sure what is the state of
> dealing with the sparse property of the subtitles. Nicolas may know :)

Handling sparse streams should not be that difficult. See below.

> Anyway, there are multiple ways of dealing with the previous mentioned
> issues.
> 
> The first one is to create an AVSubtitle2 or something in libavutil,
> copying most of the current AVSubtitle layout but making sure the user
> allocates it with av_subtitle_alloc() or whatever, so we can add fields
> and extend it (mostly) at will.
> 
> The second one, which I'm currently wondering about these days is to try
> to hold the subtitles data into the existing AVFrame structure. We will
> for example have the frame->extended_data[N] (currently used by audio
> frames to hold the channels) point on a instances of a newly defined
> rectangle structure. Having the subtitles into AVFrame might simplify a
> lot the future integration within libavfilter since they are already
> supported as audio and video.  This needs careful thinking, but it might
> be doable.

I think the AVFrame approach is best, especially for lavfi integration.

Here are the issues I can think of:

1. Sparseness: sparseness is not a problem in itself but a problem when
   connected with continuous streams. For example, imagine you have the
   following subtitles:

   0:01:00 -> 0:01:05 Hello world.
   0:59:00 -> 0:59:05 Good bye.

   and you want to render it on the video. When it gets a frame at 1:06,
   the render filter needs the 59:00 event to know that there is no subtitle
   until then to render. With external subtitles, that is not a problem,
   because it arrives immediately. But if the subtitles are muxed with the
   video, then it will only arrive after 58 minutes of video, which will
   need to be buffered as decoded frames. Completely unrealistic.

   Players get away with it by having interleaving constraints: if video has
   reached timestamp 1:06, then other streams should have reached it, or
   will reach it very soon. Then we only need to buffer a few frames worth
   of video to account for muxing delay.

   We can do the same thing in lavfi using heartbeat frames and a sync
   filter.

   A heartbeat frame is a frame that says "we are at pts 1:06 and there is
   nothing new". Pure subtitles filters will just ignore and forward them;
   filters that must sync with a continuous stream use them for progress.

   The heartbeat frames need to be generated. Only the application knows
   where the subtitles come from: demuxed from the same file as the video,
   stand-alone file, etc. We can provide infrastructure to help that as a
   filter with a video and a subtitle input, and a subtitle output: it
   forwards the subtitles unchanged but also generates heartbeat frames
   according to the timestamps of the video stream. Applications are
   supposed to automatically insert this sync filter to group video and
   subtitles demuxed from the same file.

   Am I being clear?

2. Duration: some combination of subtitles and file formats have a reliable
   duration known immediately; other combinations have packets marking the
   end of the previous subtitles.

   I think we should always go with a pair of start+end frames. If the
   duration is known and reliable, then the decoder should output both at
   once; we need an API that allows several output frames from an input
   packet, not really a problem for a new API.

3. Overlap: consider the following script:

   0:01:02 -> 0:01:08 I want to say hello.
   0:01:06 -> 0:01:09 Silence!

   If it comes from ASS, it means that "silence" will be displayed alongside
   the previous dialog line. If it comes from SRT, then it is slightly
   invalid, and the demuxer or decoder should have internally fixed it by
   adjusting 0:01:08 to 0:01:06. For now, we do not do it.

   With start+end frame, this is not really an issue. With overlapping, the
   script gives the following frames:

   0:01:02 start event_1 "I want to say hello."
   0:01:06 start event_2 "Silence!"
   0:01:08 end event_1
   0:01:09 end event_2

   Without overlapping:

   0:01:02 start event_1 "I want to say hello."
   0:01:06 end event_1
   0:01:06 start event_2 "Silence!"
   0:01:09 end event_2

   A de-overlap filter to prepare outputs that do not support overlapping
   should be very easy to implement.

4. Subtitle type: bitmap versus text. We want to detect inconsistencies as
   early as possible. Imagine encoding a movie to high-quality x265 with
   just a few on-screen signs overlaid. It already took 48 hours encoding
   the first half hour of movie, and then... crash because text subtitles
   can not been overlaid without being first rasterized.

   Also, maybe we want to insert the rasterizing filter automatically. This
   feels like the format negotiation for pix_fmt. Actually, for bitmap
   subtitles, the pixel format needs to be negotiated as well.

   Are there other attributes that may need negotiating? Styled or unstyled
   maybe?

5. Global data: ASS has styles that apply to the whole file. At the
   lavf/lavc boundary, they are handled as codec extradata, but lavfi does
   not have such a thing. I suppose they could be handled as frames by
   themselves.

6. Integration with refcounting. When rendering with libass, in particular,
   any text event is converted in a bunch of glyphs; libass shares the glyph
   alpha mask: two 'A's are rendered as two glyphs with different
   coordinates and possibly colors, but both pointing to the same alpha
   mask. For efficient processing (uploading OpenGL textures), we need to
   preserve that structure.

Another big piece of work for lavfi integration is refactoring
media-(in)dependent code. I a thinking in particular of media-independent
filters: setpts, settb, split, etc. For now, we have to duplicate the filter
structures: setpts/asetpts, settb/asettb, split/asplit. If one more media
type is added, it starts to become unpractical. If several are added (I do
want filtering of data frames), it becomes awful. We would need a solution
to be able to use settb directly for any kind of stream.

> But again, these are ideas, which need to be discussed and experimented. I
> don't know if it's a good idea for a GSoC, and I don't know who would be
> up for mentoring.
> 
> It's nice to finally see some interest into this topic though.

Same for me.

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20160309/4c47ee5f/attachment.sig>