[FFmpeg-devel] Enhancement layers in FFmpeg

Mon Aug 1 16:17:12 EEST 2022

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> Niklas Haas
> Sent: Monday, August 1, 2022 1:25 PM
> To: ffmpeg-devel at ffmpeg.org
> Subject: [FFmpeg-devel] Enhancement layers in FFmpeg
> 
> Hey,
> 
> We need to think about possible ways to implement reasonably-
> transparent
> support for enhancement layers in FFmpeg. (SVC, Dolby Vision, ...).
> There are more open questions than answers here.
> 
> From what I can tell, these are basically separate bitstreams that
> carry
> some amount of auxiliary information needed to reconstruct the
> high-quality bitstream. That is, they are not independent, but need
> to
> be merged with the original bitstream somehow.
> 
> How do we architecturally fit this into FFmpeg? Do we define a new
> codec
> ID for each (common/relevant) combination of base codec and
> enhancement
> layer, e.g. HEVC+DoVi, H.264+SVC, ..., or do we transparently handle
> it
> for the base codec ID and control it via a flag? Do the enhancement
> layer packets already make their way to the codec, and if not, how do
> we
> ensure that this is the case?
> 
> Can the decoder itself recursively initialize a sub-decoder for the
> second bitstream? And if so, does the decoder apply the actual
> transformation, or does it merely attach the EL data to the AVFrame
> somehow in a way that can be used by further filters or end users?

From my (rather limited) angle of view, my thoughts are these:

When decoding these kinds of sources, a user would typically not only
want to do the processing in hardware but the decoding as well.

I think we cannot realistically expect that any of the hw decoders
will add support for this in the near future. As we cannot modify 
those ourselves, the only way to do such processing would be a 
hardware filter. I think, the EL data would need to be attached 
to frames as some kind of side data (or similar) and get uploaded 
by the hw filter (internally) which will apply the EL data.

(I have no useful thoughts for sw decoding) 

> (What about the case of Dolby Vision, which iirc requires handling
> the
> DoVi RPU metadata before the EL can be applied? What about instances
> where the user wants the DoVi/EL application to happen on GPU, e.g.
> via
> libplacebo in mpv/vlc?)

IMO it would be desirable when both of these things would/could be
done in a single operation.

> How does this metadata need to be attached? A second AVFrame
> reference
> inside the AVFrame? Raw data in a big side data struct?

As long as it doesn't have its own format, its own start time,
resolution, duration, color space/transfer/primaries, etc..
I wouldn’t say that it's a frame.

Best regards,
softworkz