[FFmpeg-devel] RFC: new packed pixel formats (machine vision)

Tue Oct 22 23:33:12 EEST 2024

Hi Martin,

Thanks for writing in!

On Tue, Oct 22, 2024 at 11:41 AM martin schitter <ms+git at mur.at> wrote:
>
>
>
> On 22.10.24 08:50, Diederick C. Niehorster wrote:
> >> I want to pick up a discussion i started last week
> >> (https://ffmpeg.org/pipermail/ffmpeg-devel/2024-October/334585.html)
> >> in a new thread, with the relevant information nicely organized. This
> >> is about adding pixel formats common in machine vision to ffmpeg
> >> (though i understand some formats may also be used by cinema cameras),
> >> and supporting them as input formats in swscale so that it becomes
> >> easy to use ffmpeg for machine vision purposes (I already have such
> >> software, it will be open-sourced in good time, but right now there is
> >> a proprietary conversion layer from Basler i need to replace (e.g. by
> >> this proposal)).
>
> most of your point do not look so much machine learning or computer
> vision specific, but more like typical/traditional video tech
> peculiarities. More ML related obstacles come into play, if have to
> support optimized calculations with uncommon small bit sizes, etc. But
> most of your described issues should be solvable easily by already
> available features of ffmpeg, if I'm not wrong.

I am writing about machine vision, not machine learning or computer
vision. So there are no uncommon small bit sizes, we're dealing with
8bit, 10bit, 12bit components here.
Where possible, i already map to the matching ffmpeg format, the
problem i am running into is that there isn't one for some of the
common machine vision pixel formats.
While this can be fixed with an encoder, that would complicate their
use in ffmpeg. Having them instead as pixel formats supported by
swscale as inputs make them far more generally useful, and enables
easy passing these formats to many of ffmpeg's encoders using an
auto-negotiated/inserted scale filter.
In the previous discussion, Lynne also indicated that the inclusion of
such formats is in scope for ffmpeg, as there are also cinema cameras
that produce some of them.

> >> Example formats are 10 and 12 bit Bayer formats, where the 10 bit
> >> cannot be represented in AVPixFmtDescriptors as currently as effective
> >> bit depth for the red and blue channels is 2.5 bits, but component
> >> depths should be integers.
>
> As bits will always be distinct entities, you don't need more than
> simple natural numbers to describe their placement and amount precisely.

An AVPixFmtDescriptor encodes the effective number of bits. Here the
descriptor for 8 bit bayer formats already included with ffmpeg:
#define BAYER8_DESC_COMMON \
        .nb_components= 3, \
        .log2_chroma_w= 0, \
        .log2_chroma_h= 0, \
        .comp = {          \
            { 0, 1, 0, 0, 2 }, \
            { 0, 1, 0, 0, 4 }, \
            { 0, 1, 0, 0, 2 }, \
        }
Note that the green component is denoted as having 4 bits, and the red
and blue as 2 bits. That is because there are only one blue and red
sample per 4 pixels, and one per 2 pixels for green samples, leading
to _effective bitdepths_ of 8/4=2 for red and blue, and 8/2=4 for
green.
For 10bit bayer, this leads to having 2.5 effective bits for red and
blue. Hence the proposal i made.

> ffmpeg already supports the AV_PIX_FMT_FLAG_BITSTREAM to switch some
> description fields from byte to bit values. That's enough to describe
> the layout of most pixelformats -- even those packed ones, which are not
> aligned to byte or 32bit borders. You just have to use bit size values
> for step and offset stuct members.

Lynne indicated that AV_PIX_FMT_FLAG_BITSTREAM is only for 8bit and
32bit aligned formats. Here i'm dealing with unaligned formats.
An option could be to release the restriction that
AV_PIX_FMT_FLAG_BITSTREAM needs to be 8bit or 32bit aligned, but that
would be a backwards incompatible change with not only significant
repercussions for the ffmpeg codebase, but also for user code. It is
better to have a new flag for the new situation.

> But there is another common case, which is indeed not describable with
> ffmpeg current stuct: color components can be composed out of separated
> MSb and LSb parts at different places in the component sequenz --
> similar to the color examples BayerRG12g40 and BayerRG12g24 in your
> linked examples. Although these examples are indeed a little bit more
> complex, because they may describe arrangements, which differ between
> even and odd lanes. The bit packing for 10 and 12bit data in
> DNxUncompressed entails a similar issue, by packing all LSb information
> as one block at the end of every scan line.

I think these are less common, the one exception being some Gige
cameras that spread the msb of multiple components after 8 lsb bits
for each. I think these are sufficiently complicated (and different
from the rest) formats that i would handle these with an encoder once
they come up (i currently do not have gige cameras, only USB3 vision
and coaxpress cameras, so these are not a target i can currently
support. This may change given the generic library i'm looking to
eventually provide.)

> For the simple case of just separated MSb and LSb locations within
> otherwise simply repeating pixel bits group it could be solved by
> extending the description in a similar way as used in the RGBALayout
> description sequenz of MXF -- see G.2.40/p174 of
> https://pub.smpte.org/latest/st377-1/st377-1-2019.pdf

This looks like a very flexible spec. It would however also require
totally overhauling/replacing AVPixFmtDescriptors, which is a no go.

> More complex arrangements should be IMHO simply converted by application
> specfic handling to more common formats, but don't get an overly complex
> ffmpeg pixel description.
>
> >> Other example formats are 10bit gray
> >> formats where multiple values are packed without padding over multiple
> >> bytes (e.g. 4 10-bit pixels packed into 5 bytes, so not aligned to 16
> >> or 32 bits).
>
> That's no problem, as already explained.

As discussed above, AV_PIX_FMT_FLAG_BITSTREAM is not the right flag for them.

> The unpacking of this kind of data to more sparse 16 bit aligned
> structures can be handled very efficient by using PDEP intrinsics of
> modern CPUs, as long as the order of components fits. Component order
> swapping is unfortunately a slightly more inefficient operation in case
> of packed image date, while it can be solved much more easily in case of
> planar data arrangements by pointer swaps.

Thanks, i've learned something :)

> >> Here a proposal for how these new formats could be encoded into
> >> AVPixFmtDescriptor, so that these can then be used in ffmpeg/swscale.
>
> I think swscale and the internal processing of ffmpeg should not be
> support an endless amount of arbitrary pixel formats, but be focused on
> a really useful minimal set of required base formats.

As argued above, having native for common pixel formats (of which
there are many) makes ffmpeg versatile, and enables most of ffmpeg
functionality to be used by most of these pixels formats. Having only
a small set complicates the use of all the other formats.

> I would look at vulkans pixel format list as modern example for more
> systematic list of elementary pixel data storage variants.
> (https://docs.vulkan.org/spec/latest/chapters/formats.html)
>
> >> - AV_PIX_FMT_FLAG_BITPACKED_UNALIGNED which indicates formats that are
> >> bit-wise packed in a way that is not aligned on 1, 2 or 4 bytes (e.g.
> >> 4 10-bit values in 5 bytes). This flag is needed because
> >> AV_PIX_FMT_FLAG_BITSTREAM
> >> formats are aligned to 8 or 32 bits, ...
>
> Is this really the case?

See above.

> But in generals you should better describe byte/32bit aligned bitpacked
> formats by using explicit "fill" (X, etc.) pseudo components, than you
> can simply indicate aligned and unaligned groups by the actual sum of
> defined bits res. the reminder of a division by the alignment bit size
> count.

I assume that with fill/X you mean padding, like some of the formats
in ffmpeg have. That would not work here, as that would change the
definition of a component. gray10p (as i called it) only has one
component, but in this scheme it would have five pseudo components (so
five color channels, that then have to be interleaved into one?),
which 1) isn't what components mean in an AVPixFmtDescriptor and 2) we
can only have up to 4.

> I hope, that's at least inspiring food for thought... ;)

Thanks for engaging Martin, I've learned something.