[FFmpeg-devel] [PATCH 10/14] swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats

Sat Sep 28 19:10:00 EEST 2024

Hi Christophe,

On Sat, Sep 28, 2024 at 3:21 PM Christophe Gisquet
<christophe.gisquet at gmail.com> wrote:
> Le lun. 23 sept. 2024 à 16:19, Ramiro Polla <ramiro.polla at gmail.com> a écrit :
> > For bit depths <= 14, amax is 16-bit and offset is 32-bit.
> > For bit depths > 14, amax is 32-bit and offset is 64-bit.
>
> [...]
>
> > -static void chrRangeToJpeg_c(int16_t *dstU, int16_t *dstV, int width)
> > +static void chrRangeToJpeg_c(int16_t *dstU, int16_t *dstV, int width,
> > +                             int amax, int coeff, int64_t _offset)
> >  {
> > +    int offset = _offset;
> >      int i;
> >      for (i = 0; i < width; i++) {
> > -        dstU[i] = (FFMIN(dstU[i], 30775) * 4663 - 9289992) >> 12; // -264
> > -        dstV[i] = (FFMIN(dstV[i], 30775) * 4663 - 9289992) >> 12; // -264
> > +        dstU[i] = (FFMIN(dstU[i], amax) * coeff + offset) >> 14;
> > +        dstV[i] = (FFMIN(dstV[i], amax) * coeff + offset) >> 14;
> >      }
> >  }
> >
> > -static void chrRangeFromJpeg_c(int16_t *dstU, int16_t *dstV, int width)
> > +static void chrRangeFromJpeg_c(int16_t *dstU, int16_t *dstV, int width,
> > +                               int amax, int coeff, int64_t _offset)
> >  {
> > +    int offset = _offset;
> >      int i;
> >      for (i = 0; i < width; i++) {
> > -        dstU[i] = (dstU[i] * 1799 + 4081085) >> 11; // 1469
> > -        dstV[i] = (dstV[i] * 1799 + 4081085) >> 11; // 1469
> > +        dstU[i] = (dstU[i] * coeff + offset) >> 14;
> > +        dstV[i] = (dstV[i] * coeff + offset) >> 14;
> >      }
> >  }
> >
> > -static void lumRangeToJpeg_c(int16_t *dst, int width)
> > +static void lumRangeToJpeg_c(int16_t *dst, int width,
> > +                             int amax, int coeff, int64_t _offset)
> >  {
> > +    int offset = _offset;
> >      int i;
> >      for (i = 0; i < width; i++)
> > -        dst[i] = (FFMIN(dst[i], 30189) * 19077 - 39057361) >> 14;
> > +        dst[i] = (FFMIN(dst[i], amax) * coeff + offset) >> 14;
> >  }
>
> I'm a bit surprised by some of these formulas and the range you assert
> above. Somehow you make it clear by casting the offset parameters to
> narrower types, so all of the following is a non-issue.
>
> So, maybe some cases are special and so on, and you are working within
> some API constrain, but the code excerpts bother me a bit:
> * If dst as input is 16 bits, I don't see why amax would really be an
> int (except native integer etc)
> * If dst[i] = anything >> 14, with dst[i] 16 bits, you want "anything"
> to be at most of 30 bits of dynamics. offset doesn't sound like it
> could be of a wider range, and not 64 bits.
> * Any version that does treat things as 64 bits computations, will
> essentially halve the likely throughput of the SIMD, besides
> potentially using way slower computations and restricting these to
> fewer archs.
>
> I expect you know that, but I'd have expected an intermediate type of
> functions with parameters eg '(int16_t *dst, int width, int16_t amax,
> int16_t coeff, int32_t _offset)'
> Maybe to avoid issues with the weird Win ABI where the MSBs can be
> garbage, that can indeed be '(int16_t *dst, ptrdiff_t width, int amax,
> int16_t coeff, int32_t _offset)'.

I think you missed that the internal representation (the contents of
dst) is 15 bits for bit_depth <= 14, and 19 bits for bit_depth > 14.
It is confusing, but when I say bit depths > 14, I mean input bit
depths > 14, which have an internal representation of 19 bits (and
therefore the contents of dst are int32_t). Look at chrRangeToJpeg16_c
for example.

For bit depth > 14, I had to raise the precision of the calculation in
order to respect the mpeg range. In this case, the shift is 18 bits
instead of 14, and the offset becomes greater than 32 bits.

These are the actual values calculated for each case (hopefully my
mail client doesn't mangle them):
lum from  8 coeff  14071 (0x0036f7) offset     33553280 ( 0x0001fffb80)
chr from  8 coeff  14393 (0x003839) offset     33528960 ( 0x0001ff9c80)
lum to    8 coeff  19078 (0x004a86) offset    -39092480 (-0x0002548100)
chr to    8 coeff  18652 (0x0048dc) offset    -38215680 (-0x0002472000)
lum from 10 coeff  14030 (0x0036ce) offset     33544640 ( 0x0001ffd9c0)
chr from 10 coeff  14350 (0x00380e) offset     33554880 ( 0x00020001c0)
lum to   10 coeff  19134 (0x004abe) offset    -39204096 (-0x0002563500)
chr to   10 coeff  18707 (0x004913) offset    -38332416 (-0x000248e800)
lum from 12 coeff  14020 (0x0036c4) offset     33535520 ( 0x0001ffb620)
chr from 12 coeff  14340 (0x003804) offset     33538080 ( 0x0001ffc020)
lum to   12 coeff  19148 (0x004acc) offset    -39232000 (-0x000256a200)
chr to   12 coeff  18720 (0x004920) offset    -38338560 (-0x0002490000)
lum from 14 coeff  14017 (0x0036c1) offset     33549698 ( 0x0001ffed82)
chr from 14 coeff  14337 (0x003801) offset     33550338 ( 0x0001fff002)
lum to   14 coeff  19151 (0x004acf) offset    -39223936 (-0x0002568280)
chr to   14 coeff  18723 (0x004923) offset    -38332416 (-0x000248e800)
lum from 16 coeff 224260 (0x036c04) offset   8589631520 ( 0x01fffb6020)
chr from 16 coeff 229380 (0x038004) offset   8589672480 ( 0x01fffc0020)
lum to   16 coeff 306429 (0x04acfd) offset -10041292800 (-0x025681f800)
chr to   16 coeff 299589 (0x049245) offset  -9817128960 (-0x0249258000)

Therefore, for bit depths <= 14, we have:
- shift: 14
- dst: 15 bits
- coeff: 15 bits unsigned
- offset: 27 bits signed

And for bit_depths > 14, we have:
- shift: 18
- dst: 19 bits
- coeff: 19 bits unsigned
- offset: 35 bits signed

The API is indeed a bit confusing because it is shared for all bit
depths. It should be:
(int16_t *dst, int width, uint16_t coeff, int32_t offset) for bit_depth <= 14
and
(int32_t *dst, int width, uint32_t coeff, int64_t offset) for bit_depth > 14

In the code we cast dst from int16_t * to int32_t * in the 16-bit
functions, and cast offset from int64_t to int (it could be int32_t)
in the non-16-bit functions. Suggestions on how to make this clearer
without added complexity or performance hits are welcome.

By the way, the v2 patchset removed the need for amax by saturating
the output instead of limiting the input, but your points are still
valid.

Ramiro