[FFmpeg-devel] Some optimization on JPEG decoding

Tue Jun 26 18:19:58 CEST 2007

Hi all,

  Here are some simple ideas, I've implemented on my local copy of 
libavcodec which might be of interest to you.

Concerning JPEG decoding, I've added support for thumbnail decoding.
The idea is to only decode DC info from DCT, and produce a JPEG which is 
8 times smaller in width and height.

The new thumbnail decoding uses its own decode_block, which ignores the 
AC part of the DCT
It also uses its own decode_scan function which shortcut the iDCT call 
into a simple "*ptr = dcVal >> 3;"
As a result, classic 5MP JPEG picture decoding uses 110ms (average other 
272 frames) on my computer (plus the downsampling, not included), while 
the new thumbnail coding uses only 55ms (average other 272 frames).
So, if you need to generate thumbnails quickly this is clearly a good 
optimization (50% less computation time)

The other idea I've implemented is about speeding up the JPEG decoding 
for current code.
Current code does (pseudo code) :
   1) for all macro blocks
      1) Is it progressive ?
          1) Ok, decode block
          2) Not ok, decode block
      2) Is it progressive ?
          1) Ok, idct_put
          2) Not ok, idct_add

My code does:
1) Is it progressive
    1) Ok,  for all macro blocks
        1) decode blocks (plural here, current code does 32 blocks in a 
batch)
        2) idct_put
    2) Not ok, for all macro blocks
        1) decode blocks (plural here, current code does 32 blocks in a 
batch)
        2) idct_add

The 1.1.1 part decodes 32 DCT blocks sequentially (so the processor can 
keep the 32 DCT blocks in cache), and part 1.1.2 perform 32 iDCT 
sequentially (again, this clearly improve the cache coherency).
The modification improved the decoding time to 92ms (average other 272 
frames) on my computer. This is a 16% speedup.
I've tried different DCT sequence size, and 32 is quite good (32 blocks 
takes exactly 4096 bytes).
I think the same idea could be applied to other codec as well.

I've tried to perform all the DCT first, then the IDCT in 2 separate 
process. There was no speed increase as the DCT takes twice the space of 
the current picture plane, so we soon get out of cache.
It might be of interest however to perform the IDCT on the GPU (if 
anyone is interested, I should still have some code about this).
 From NVidia own tests, the IDCT on the GPU takes 20x less times than 
CPU version, so it might finally worth the double memory requirement.

If anyone is interested, please mail me, I'll send my changes.
BTW, my branch is different from current SVN version, and I haven't even 
tried to comply to whatever coding style of the moment.
I clearly don't have time to rewrite the file multiple times, like last 
time. If you are in the mood to do it, you're welcome.

Regards,

-- 
Cyril RUSSO