[FFmpeg-devel] Some optimization on JPEG decoding
Cyril Russo
cyril.russo
Tue Jun 26 18:19:58 CEST 2007
Hi all,
Here are some simple ideas, I've implemented on my local copy of
libavcodec which might be of interest to you.
Concerning JPEG decoding, I've added support for thumbnail decoding.
The idea is to only decode DC info from DCT, and produce a JPEG which is
8 times smaller in width and height.
The new thumbnail decoding uses its own decode_block, which ignores the
AC part of the DCT
It also uses its own decode_scan function which shortcut the iDCT call
into a simple "*ptr = dcVal >> 3;"
As a result, classic 5MP JPEG picture decoding uses 110ms (average other
272 frames) on my computer (plus the downsampling, not included), while
the new thumbnail coding uses only 55ms (average other 272 frames).
So, if you need to generate thumbnails quickly this is clearly a good
optimization (50% less computation time)
The other idea I've implemented is about speeding up the JPEG decoding
for current code.
Current code does (pseudo code) :
1) for all macro blocks
1) Is it progressive ?
1) Ok, decode block
2) Not ok, decode block
2) Is it progressive ?
1) Ok, idct_put
2) Not ok, idct_add
My code does:
1) Is it progressive
1) Ok, for all macro blocks
1) decode blocks (plural here, current code does 32 blocks in a
batch)
2) idct_put
2) Not ok, for all macro blocks
1) decode blocks (plural here, current code does 32 blocks in a
batch)
2) idct_add
The 1.1.1 part decodes 32 DCT blocks sequentially (so the processor can
keep the 32 DCT blocks in cache), and part 1.1.2 perform 32 iDCT
sequentially (again, this clearly improve the cache coherency).
The modification improved the decoding time to 92ms (average other 272
frames) on my computer. This is a 16% speedup.
I've tried different DCT sequence size, and 32 is quite good (32 blocks
takes exactly 4096 bytes).
I think the same idea could be applied to other codec as well.
I've tried to perform all the DCT first, then the IDCT in 2 separate
process. There was no speed increase as the DCT takes twice the space of
the current picture plane, so we soon get out of cache.
It might be of interest however to perform the IDCT on the GPU (if
anyone is interested, I should still have some code about this).
From NVidia own tests, the IDCT on the GPU takes 20x less times than
CPU version, so it might finally worth the double memory requirement.
If anyone is interested, please mail me, I'll send my changes.
BTW, my branch is different from current SVN version, and I haven't even
tried to comply to whatever coding style of the moment.
I clearly don't have time to rewrite the file multiple times, like last
time. If you are in the mood to do it, you're welcome.
Regards,
--
Cyril RUSSO
More information about the ffmpeg-devel
mailing list