[FFmpeg-devel] RFC: new packed pixel formats (machine vision)

Wed Oct 23 03:32:18 EEST 2024

On 22.10.24 22:33, Diederick C. Niehorster wrote:
> I am writing about machine vision, not machine learning or computer
> vision. So there are no uncommon small bit sizes, we're dealing with
> 8bit, 10bit, 12bit components here.

Sorry -- I'm such a sloppy reader/writer -- especially, when I'm hurry.

> Where possible, i already map to the matching ffmpeg format, the
> problem i am running into is that there isn't one for some of the
> common machine vision pixel formats.
> While this can be fixed with an encoder, that would complicate their
> use in ffmpeg. Having them instead as pixel formats supported by
> swscale as inputs make them far more generally useful, and enables
> easy passing these formats to many of ffmpeg's encoders using an
> auto-negotiated/inserted scale filter.

I'm not a big fan of these auto-negotiated format reading, because the 
actual code which handles this task looks utterly unreadable to me 
--full of exceptions and complicated switches of code flow, which also 
hinders a more efficient processing.

But o.k. it is a comfortable and simple solution for simple demands and 
may even help to reduce bugs, which would otherwise more frequently 
appear by writing new more application/format specific handlers.

> In the previous discussion, Lynne also indicated that the inclusion of
> such formats is in scope for ffmpeg, as there are also cinema cameras
> that produce some of them.

Yes he pointed to this already available Bayer Format entries.

They obviously work different from other pixel format description 
entries. I still don't really grasp, how they work exactly.

CFA sensor data is usually more structured like a one channel pixel 
matrix similar to a monochrome image. The colors are only later 
calculated in the debayer process. But at the beginning you have only 
this one-channel matrix of values and an additional description of the 
used CFA arrangement -- i.e. the location of the different colored 
sensel in relation to each other.

This colored graphics in your linked documents are therefore a little 
bit misleading, because if you really would differentiate the colored 
sensels already at this stage, you would have describe different data 
patterns for odd an even lines in case of a typical image sensor...

>>>> Example formats are 10 and 12 bit Bayer formats, where the 10 bit
>>>> cannot be represented in AVPixFmtDescriptors as currently as effective
>>>> bit depth for the red and blue channels is 2.5 bits, but component
>>>> depths should be integers.

At least in case of all ordinary pixel arrangement description entries 
the values are not just useful metadata for further calculations, but 
real descriptions, where to find the actual data -- i.e.: which 
bytes/bits to pick out of the raw data stream.

>> As bits will always be distinct entities, you don't need more than
>> simple natural numbers to describe their placement and amount precisely.

> An AVPixFmtDescriptor encodes the effective number of bits. Here the
> descriptor for 8 bit bayer formats already included with ffmpeg:
> #define BAYER8_DESC_COMMON \
>          .nb_components= 3, \
>          .log2_chroma_w= 0, \
>          .log2_chroma_h= 0, \
>          .comp = {          \
>              { 0, 1, 0, 0, 2 }, \
>              { 0, 1, 0, 0, 4 }, \
>              { 0, 1, 0, 0, 2 }, \
>          }
> Note that the green component is denoted as having 4 bits, and the red
> and blue as 2 bits. That is because there are only one blue and red
> sample per 4 pixels, and one per 2 pixels for green samples, leading
> to _effective bitdepths_ of 8/4=2 for red and blue, and 8/2=4 for
> green.

It definition so much different to the more ordinary pixel descriptions, 
that hardly understand why they are mixed together at all?

An additional list with more RAW/CFA specific description fields, would 
IMHO be a much more suitable solution.

>> ffmpeg already supports the AV_PIX_FMT_FLAG_BITSTREAM to switch some
>> description fields from byte to bit values. That's enough to describe
>> the layout of most pixelformats -- even those packed ones, which are not
>> aligned to byte or 32bit borders. You just have to use bit size values
>> for step and offset stuct members.
> 
> Lynne indicated that AV_PIX_FMT_FLAG_BITSTREAM is only for 8bit and
> 32bit aligned formats. Here i'm dealing with unaligned formats.

I'm sure, that Lynne is more familiar with this code base and knows it 
much better than me, but I would guess, that this limitation is more 
likely caused by the automatic unpacking mechanism and not so much by 
the pixel format description.

An interesting target for some code contribution, to make the raw data 
reading even more complex, unreadable and a little bit slower again. ;)

> An option could be to release the restriction that
> AV_PIX_FMT_FLAG_BITSTREAM needs to be 8bit or 32bit aligned, but that
> would be a backwards incompatible change with not only significant
> repercussions for the ffmpeg codebase, but also for user code. It is
> better to have a new flag for the new situation.

I don't know if this will trigger so much troubles.

> I think these are less common, the one exception being some Gige

>> For the simple case of just separated MSb and LSb locations within
>> otherwise simply repeating pixel bits group it could be solved by
>> extending the description in a similar way as used in the RGBALayout
>> description sequenz of MXF -- see G.2.40/p174 of
>> https://pub.smpte.org/latest/st377-1/st377-1-2019.pdf
> 
> This looks like a very flexible spec. It would however also require
> totally overhauling/replacing AVPixFmtDescriptors, which is a no go.

I definitely do not want to suggest rewriting vital parts of ffmpeg in 
this manner, but it's important to have this more modern approaches also 
in mind.

Most of the more recent specified description schemes for wide ranges of 
uncompressed video image data use this kind of more complex and 
sequences of component description lists of variable length, instead of 
just this very simple traditional four channel schemata.

>> I think swscale and the internal processing of ffmpeg should not be
>> support an endless amount of arbitrary pixel formats, but be focused on
>> a really useful minimal set of required base formats.
> 
> As argued above, having native for common pixel formats (of which
> there are many) makes ffmpeg versatile, and enables most of ffmpeg
> functionality to be used by most of these pixels formats. Having only
> a small set complicates the use of all the other formats.

There are good arguments for both variants.

I can only tell you, how I think about this topic -- but I may be wrong!

>> But in generals you should better describe byte/32bit aligned bitpacked
>> formats by using explicit "fill" (X, etc.) pseudo components, than you
>> can simply indicate aligned and unaligned groups by the actual sum of
>> defined bits res. the reminder of a division by the alignment bit size
>> count.
> 
> I assume that with fill/X you mean padding, like some of the formats
> in ffmpeg have. 

Isn't padding always used just at one end touching the boundaries, while 
fill may be specified even multiple times anywhere?

> That would not work here, as that would change the
> definition of a component. gray10p (as i called it) only has one
> component, but in this scheme it would have five pseudo components (so
> five color channels, that then have to be interleaved into one?),
> which 1) isn't what components mean in an AVPixFmtDescriptor and 2) we
> can only have up to 4.

Yes -- this 4 channel schema is indeed very limiting!

I don't have any better solution, but additional more specialized 
description lists for groups of similar structure (like RAW CFA data) 
could perhaps help. And I really think, that a more specific processing 
for entries described by those additional lists would also help to 
reduce the complexity of the affected code infrastructure and make the 
actually used separated modules more efficient.

Martin