[FFmpeg-devel] [RFC/PATCH] bitpacked_dec: Optimization for bitpacked_dec decoder performance

Fri May 12 18:26:44 EEST 2023

On Thu, May 11, 2023 at 6:20 PM Marton Balint <cus at passwd.hu> wrote:
> Actually the cached bitstream reader was faster here than the manual
> approach:
>
> ./ffmpeg -stream_loop 128 -threads 1 -f bitpacked -pix_fmt yuv422p10le -s 3840x2160 -c:v bitpacked -i source.yuv -pix_fmt yuv422p10le -f null none -loglevel error
>
> Old code:
>
> 821050920 decicycles in bitpacked,       1 runs,      0 skips
> 815402160 decicycles in bitpacked,       2 runs,      0 skips
> 814108410 decicycles in bitpacked,       4 runs,      0 skips
> 814213800 decicycles in bitpacked,       8 runs,      0 skips
> 815048325 decicycles in bitpacked,      16 runs,      0 skips
> 812866713 decicycles in bitpacked,      32 runs,      0 skips
> 809186523 decicycles in bitpacked,      64 runs,      0 skips
> 808317601 decicycles in bitpacked,     128 runs,      0 skips
>
> With the patch:
>
> 379879920 decicycles in bitpacked,       1 runs,      0 skips
> 387491580 decicycles in bitpacked,       2 runs,      0 skips
> 397720260 decicycles in bitpacked,       4 runs,      0 skips
> 389581560 decicycles in bitpacked,       8 runs,      0 skips
> 381820635 decicycles in bitpacked,      16 runs,      0 skips
> 379791675 decicycles in bitpacked,      32 runs,      0 skips
> 379246303 decicycles in bitpacked,      64 runs,      0 skips
> 379221671 decicycles in bitpacked,     128 runs,      0 skips
>
> Old code and #defined CACHED_BITSTREAM_READER 1
>
> 345122280 decicycles in bitpacked,       1 runs,      0 skips
> 343663020 decicycles in bitpacked,       2 runs,      0 skips
> 343372680 decicycles in bitpacked,       4 runs,      0 skips
> 342554535 decicycles in bitpacked,       8 runs,      0 skips
> 340816522 decicycles in bitpacked,      16 runs,      0 skips
> 340225672 decicycles in bitpacked,      32 runs,      0 skips
> 340283520 decicycles in bitpacked,      64 runs,      0 skips
> 339643105 decicycles in bitpacked,     128 runs,      0 skips

I don't have a good explanation for this.  I could speculate that some
of it comes down to the processor architecture, how much onboard cache
it has, gcc version (and what sort of optimization/vectorization it
does, if any), etc.  In my case I was testing on Haswell and Skylake
(both with 12MB cache) with gcc 4.8.

I would welcome feedback from others.

Looking at the code to libavcodec/git_bits.h, it might also be worth
looking at setting #define LONG_BITSTREAM_READER, as that might speed
things up as well for such large files.

Devin

-- 
Devin Heitmueller, Senior Software Engineer
LTN Global Communications
o: +1 (301) 363-1001
w: https://ltnglobal.com  e: devin.heitmueller at ltnglobal.com