[FFmpeg-devel] Some optimization on JPEG decoding
Cyril Russo
cyril.russo
Wed Jun 27 10:53:50 CEST 2007
Michael Niedermayer a ?crit :
> Hi
>
> On Tue, Jun 26, 2007 at 06:19:58PM +0200, Cyril Russo wrote:
>
>> Hi all,
>>
>> Here are some simple ideas, I've implemented on my local copy of
>> libavcodec which might be of interest to you.
>>
>> Concerning JPEG decoding, I've added support for thumbnail decoding.
>> The idea is to only decode DC info from DCT, and produce a JPEG which is
>> 8 times smaller in width and height.
>>
>> The new thumbnail decoding uses its own decode_block, which ignores the
>> AC part of the DCT
>> It also uses its own decode_scan function which shortcut the iDCT call
>> into a simple "*ptr = dcVal >> 3;"
>> As a result, classic 5MP JPEG picture decoding uses 110ms (average other
>> 272 frames) on my computer (plus the downsampling, not included), while
>> the new thumbnail coding uses only 55ms (average other 272 frames).
>> So, if you need to generate thumbnails quickly this is clearly a good
>> optimization (50% less computation time)
>>
>
> IIRC lowres mode is already supported in jpeg, if you have some improvements
> for that they are welcome
>
>
Sure, and I've used that. However, when the lowres factor is 3 (meaning
1/8th of width and height), you are in the case where you can ignore the
AC part of the DCT because 1/8th downsampling of a DCT block is the DC
part anyway.
The IDCT is even simpler is you know that all AC coef are zero, because
the iDCT can then be simply computed as dcValue / 8 (math required, but
easy to demonstrate).
I don't think the lowres alone works correctly through (unless I'm
missing something, the idct_put have no idea about the lowres value, and
it's given a linesize which is not (>> lowres), and will overflow the
picture buffer ??? )
In my example, I've removed the idct_put call as when the lowres factor
is 3, there is no need for an inverse DCT. Haven't tried with it enabled
through.
My code does this:
/* decode block and dequantize */
static int decode_block_tn(MJpegDecodeContext *s, DCTELEM *block,
int component, int dc_index, int ac_index,
int16_t *quant_matrix)
{
int code, i, level, val;
/* DC coef */
val = mjpeg_decode_dc(s, dc_index);
if (val == 0xffff) {
av_log(s->avctx, AV_LOG_ERROR, "error dc\n");
return -1;
}
val = val * quant_matrix[0] + s->last_dc[component];
s->last_dc[component] = val;
block[0] = val;
/* AC coefs */
i = 0;
{OPEN_READER(re, &s->gb)
for(;;) {
UPDATE_CACHE(re, &s->gb);
GET_VLC(code, re, &s->gb, s->vlcs[1][ac_index].table, 9, 2)
/* EOB */
if (code == 0x10)
break;
i += ((unsigned)code) >> 4;
if(code != 0x100){
code &= 0xf;
if(code > MIN_CACHE_BITS - 16){
UPDATE_CACHE(re, &s->gb)
}
{
int cache=GET_CACHE(re,&s->gb); // Not sure what this
does, if it does nothing, then probably it's even faster to remove it
}
LAST_SKIP_BITS(re, &s->gb, code)
if (i >= 63) {
if(i == 63){
break;
}
av_log(s->avctx, AV_LOG_ERROR, "error count: %d\n", i);
return -1;
}
}
}
CLOSE_READER(re, &s->gb)}
return 0;
}
/* Decode when CodecID is MJPEGTN - Overall speed gain is more than 50%
compared to classic decoding (not counting the downsampling step which
is avoided) */
static int mjpeg_decode_one_scan_tn(MJpegDecodeContext *s, int id)
{
int mb_x, mb_y;
int c = s->comp_index[id];
for(mb_y = 0; mb_y < s->mb_height; mb_y++) {
for(mb_x = 0; mb_x < s->mb_width; mb_x++) {
uint8_t *ptr;
if (s->restart_interval && !s->restart_count)
s->restart_count = s->restart_interval;
memset(s->block, 0, sizeof(s->block[0]));
if (decode_block_tn(s, s->block, id,
s->dc_index[0], s->ac_index[0],
s->quant_matrixes[ s->quant_index[c]]) < 0) {
dprintf("error y=%d x=%d\n", mb_y, mb_x);
return -1;
}
ptr = s->picture.data[c] + (((s->linesize[c] * mb_y * 8) +
mb_x * 8) >> s->avctx->lowres);
if (s->interlaced && s->bottom_field)
ptr += s->linesize[c] >> 1;
*ptr = (uint8_t)(s->block[0]>>3);
}
}
return 0;
}
/* Optimized version for thumbnail decoding (doesn't support progressive
JPEG)
Change WRT original version are in decode_block_tn call inplace of
decode_block
and *ptr = dcVal >> 3 part in place of s->dsp->idct_put call
*/
static int mjpeg_decode_scan_tn(MJpegDecodeContext *s, int
nb_components, int ss, int se, int Ah, int Al){
int i, mb_x, mb_y;
int EOBRUN = 0;
if(Ah) return 0; /* TODO decode refinement planes too */
for(mb_y = 0; mb_y < s->mb_height; mb_y++) {
for(mb_x = 0; mb_x < s->mb_width; mb_x++) {
if (s->restart_interval && !s->restart_count)
s->restart_count = s->restart_interval;
for(i=0;i<nb_components;i++) {
uint8_t *ptr;
int n, h, v, x, y, c, j;
n = s->nb_blocks[i];
c = s->comp_index[i];
h = s->h_scount[i];
v = s->v_scount[i];
x = 0;
y = 0;
for(j=0;j<n;j++) {
memset(s->block, 0, sizeof(s->block));
if (decode_block_tn(s, s->block, i,
s->dc_index[i], s->ac_index[i],
s->quant_matrixes[
s->quant_index[c] ]) < 0) {
av_log(s->avctx, AV_LOG_ERROR, "error y=%d
x=%d\n", mb_y, mb_x);
return -1;
}
// av_log(s->avctx, AV_LOG_DEBUG, "mb: %d %d
processed\n", mb_y, mb_x);
ptr = s->picture.data[c] +
(((s->linesize[c] * (v * mb_y + y) * 8) +
(h * mb_x + x) * 8) >> s->avctx->lowres);
if (s->interlaced && s->bottom_field)
ptr += s->linesize[c] >> 1;
//av_log(NULL, AV_LOG_DEBUG, "%d %d %d %d %d %d %d %d \n", mb_x, mb_y,
x, y, c, s->bottom_field, (v * mb_y + y) * 8, (h * mb_x + x) * 8);
*ptr = s->block[0] >> 3;
if (++x == h) {
x = 0;
y++;
}
}
}
/* (< 1350) buggy workaround for Spectralfan.mov, should be
fixed */
if (s->restart_interval && (s->restart_interval < 1350) &&
!--s->restart_count) {
align_get_bits(&s->gb);
skip_bits(&s->gb, 16); /* skip RSTn */
for (i=0; i<nb_components; i++) /* reset dc */
s->last_dc[i] = 1024;
}
}
}
return 0;
}
The end of ff_mjpeg_decode_sos is now like:
}else{
if (!s->thumbnail || s->progressive)
{
if (nb_components > 1){
if (mjpeg_decode_scan(s, nb_components, predictor, ilv,
prev_shift, point_transform) < 0)
return -1;
}
else if (nb_components){
if (s->avctx->codec_id == CODEC_ID_MJPEGFH){
if (mjpeg_decode_one_scan_fh(s, 0) < 0) return -1;
} else{
if (mjpeg_decode_one_scan(s, 0) < 0) return -1;
}
}
} else{
if (nb_components > 1){
if (mjpeg_decode_scan_tn(s, nb_components, predictor,
ilv, prev_shift, point_transform) < 0)
return -1;
}
else if (nb_components){
if (mjpeg_decode_one_scan_tn(s, 0) < 0)
return -1;
}
}
}
I've added a new member to MJpegDecodeContext which is int thumbnail;
I could have done without by testing the "codec ID == CODEC_ID_MJPEGTN
and lowres == 3" but I thought it was cleaner to have a flag that user
could know about and inspect.
>
>>
>> The other idea I've implemented is about speeding up the JPEG decoding
>> for current code.
>> Current code does (pseudo code) :
>> 1) for all macro blocks
>> 1) Is it progressive ?
>> 1) Ok, decode block
>> 2) Not ok, decode block
>> 2) Is it progressive ?
>> 1) Ok, idct_put
>> 2) Not ok, idct_add
>>
>> My code does:
>> 1) Is it progressive
>> 1) Ok, for all macro blocks
>> 1) decode blocks (plural here, current code does 32 blocks in a
>> batch)
>> 2) idct_put
>> 2) Not ok, for all macro blocks
>> 1) decode blocks (plural here, current code does 32 blocks in a
>> batch)
>> 2) idct_add
>>
>>
>
> if its clean (no code duplication but rather uses always_inline) and faster
> then its welcome
>
Code here use duplication, because I didn't want to break existing code.
It's like:
/* Batch Huffman decoding, then batch IDCTing.
Doesn't work with progressive jpeg yet. Need to be implemented for
them, but I don't have any to test the code with */
static int mjpeg_decode_one_scan_fh(MJpegDecodeContext *s, int id)
{
int mb_x = 0, mb_y = 0, i;
int c = s->comp_index[id];
const int nbPreProcess = 32;
static uint8_t someBlocks[sizeof(s->block) * 32]; // gcc reject
sizeof() * nbPreProcess while it's const
int nbBlocks = s->mb_height * s->mb_width;
while (nbBlocks)
{
int limit = nbPreProcess < nbBlocks ? nbPreProcess : nbBlocks;
memset(someBlocks, 0, sizeof(someBlocks));
for (i = 0; i < limit; i++){
if (decode_block(s, (DCTELEM*)&someBlocks[i *
sizeof(s->block)], id,
s->dc_index[0], s->ac_index[0],
s->quant_matrixes[ s->quant_index[c]]) < 0) {
dprintf("error y=%d x=%d\n", mb_y, mb_x);
return -1;
}
}
for (i = 0; i < limit; i++){
uint8_t * ptr;
if (mb_x == s->mb_width) { mb_x = 0; mb_y ++; }
if (s->restart_interval && !s->restart_count)
s->restart_count = s->restart_interval;
ptr = s->picture.data[c] + (((s->linesize[c] * mb_y * 8) +
mb_x * 8) >> s->avctx->lowres);
s->dsp.idct_put(ptr, s->linesize[c], (DCTELEM*)&someBlocks[i
* sizeof(s->block)]);
mb_x ++;
nbBlocks --;
}
}
return 0;
}
>
>
>> The 1.1.1 part decodes 32 DCT blocks sequentially (so the processor can
>> keep the 32 DCT blocks in cache), and part 1.1.2 perform 32 iDCT
>> sequentially (again, this clearly improve the cache coherency).
>> The modification improved the decoding time to 92ms (average other 272
>> frames) on my computer. This is a 16% speedup.
>> I've tried different DCT sequence size, and 32 is quite good (32 blocks
>> takes exactly 4096 bytes).
>> I think the same idea could be applied to other codec as well.
>>
>> I've tried to perform all the DCT first, then the IDCT in 2 separate
>> process. There was no speed increase as the DCT takes twice the space of
>> the current picture plane, so we soon get out of cache.
>> It might be of interest however to perform the IDCT on the GPU (if
>> anyone is interested, I should still have some code about this).
>> From NVidia own tests, the IDCT on the GPU takes 20x less times than
>> CPU version, so it might finally worth the double memory requirement.
>>
>> If anyone is interested, please mail me, I'll send my changes.
>>
>
> you can post them here
>
Other changes are (in ff_mjpeg_decode_sof) :
if (s->thumbnail && (width != s->width / 8 || height != s->height / 8))
{
av_freep(&s->qscale_table);
s->width = width / 8;
s->height = height / 8;
s->interlaced = 0;
/* test interlaced mode */
if (s->first_picture &&
s->org_height != 0 &&
s->height < ((s->org_height * 3) / 4))
{
s->interlaced = 1;
s->bottom_field = s->interlace_polarity;
s->picture.interlaced_frame = 1;
s->picture.top_field_first = !s->interlace_polarity;
height *= 2;
}
s->avctx->lowres = 3;
avcodec_set_dimensions(s->avctx, width, height);
s->qscale_table= av_mallocz((width+15)/16);
s->first_picture = 0;
}
else if (width != s->width || height != s->height) {
and (in last line of ff_mjpeg_decode_init):
s->thumbnail = (avctx->codec_id == CODEC_ID_MJPEGTN);
return 0;
I don't provide a diff, as the diff would include very specific
(unrelated) changes to my own branch, or I would have to manually modify
the patch with all caveats
You'll need to add new codec id "CODEC_ID_MJPEGTN" for thumbnail
decoding and CODEC_ID_MJPEGFH for full huffman decoding.
BTW, it's an experimental solution, hence the code duplication until
someone want to have a look to it, and decide it's worth to be merged in
official code.
I couldn't get ffmpeg application to understand my jpeg files (or I
don't understand what option to provide, something has been broken since
last checkout I guess, as this used to work), so
I test my code as a shared DLL under Windows and a profiling application.
>
>
>> BTW, my branch is different from current SVN version, and I haven't even
>> tried to comply to whatever coding style of the moment.
>> I clearly don't have time to rewrite the file multiple times, like last
>> time. If you are in the mood to do it, you're welcome.
>>
>
> well if someone (you or someone else) does provide clean patches which
> pass review then they are welcome
> messy patches though wont reach svn no matter what great improvements
> they provide, you can fork ffmpeg and learn on your own why applying
> messy patches is a very very bad idea
>
Sure, patches that break code are a PITA (I do maintain my repository
for my job too).
I usually don't try to keep in sync with the official branches (I
clearly can't duplicate your effort).
But the speed improvement was so important I thought it would be better
that everyone profit from the change.
I haven't tried to mix both technics to thumbnail decoding, it will
probably be the next step.
BTW, the splitting the entropy decoding from the IDCT in every codec is
an path to explore, as it improves cache coherency for the CPU,
and will probably allow GPU IDCT decoding in the future (which, I know,
ffmpeg doesn't provides yet), or could be easily threaded and then,
speed up in multicore CPU.
The decode_block_tn part contains a GET_CACHE macro I don't understand.
If it is not required, then it would simplify even more the entropy
decoding by removing it.
Sincerely,
--
Cyril RUSSO
More information about the ffmpeg-devel
mailing list