[FFmpeg-devel] Some optimization on JPEG decoding

Wed Jun 27 10:53:50 CEST 2007

Michael Niedermayer a ?crit :
> Hi
>
> On Tue, Jun 26, 2007 at 06:19:58PM +0200, Cyril Russo wrote:
>   
>> Hi all,
>>
>>   Here are some simple ideas, I've implemented on my local copy of 
>> libavcodec which might be of interest to you.
>>
>> Concerning JPEG decoding, I've added support for thumbnail decoding.
>> The idea is to only decode DC info from DCT, and produce a JPEG which is 
>> 8 times smaller in width and height.
>>
>> The new thumbnail decoding uses its own decode_block, which ignores the 
>> AC part of the DCT
>> It also uses its own decode_scan function which shortcut the iDCT call 
>> into a simple "*ptr = dcVal >> 3;"
>> As a result, classic 5MP JPEG picture decoding uses 110ms (average other 
>> 272 frames) on my computer (plus the downsampling, not included), while 
>> the new thumbnail coding uses only 55ms (average other 272 frames).
>> So, if you need to generate thumbnails quickly this is clearly a good 
>> optimization (50% less computation time)
>>     
>
> IIRC lowres mode is already supported in jpeg, if you have some improvements
> for that they are welcome
>
>   
Sure, and I've used that. However, when the lowres factor is 3 (meaning 
1/8th of width and height), you are in the case where you can ignore the 
AC part of the DCT because 1/8th downsampling of a DCT block is the DC 
part anyway.
The IDCT is even simpler is you know that all AC coef are zero, because 
the iDCT can then be simply computed as dcValue / 8 (math required, but 
easy to demonstrate).
I don't think the lowres alone works correctly through (unless I'm 
missing something, the idct_put have no idea about the lowres value, and 
it's given a linesize which is not (>> lowres), and will overflow the 
picture buffer ??? )
In my example, I've removed the idct_put call as when the lowres factor 
is 3, there is no need for an inverse DCT. Haven't tried with it enabled 
through.
My code does this:

/* decode block and dequantize */
static int decode_block_tn(MJpegDecodeContext *s, DCTELEM *block,
                        int component, int dc_index, int ac_index, 
int16_t *quant_matrix)
{
    int code, i, level, val;

    /* DC coef */
    val = mjpeg_decode_dc(s, dc_index);
    if (val == 0xffff) {
        av_log(s->avctx, AV_LOG_ERROR, "error dc\n");
        return -1;
    }
    val = val * quant_matrix[0] + s->last_dc[component];
    s->last_dc[component] = val;
    block[0] = val;
    /* AC coefs */
    i = 0;
    {OPEN_READER(re, &s->gb)
    for(;;) {
        UPDATE_CACHE(re, &s->gb);
        GET_VLC(code, re, &s->gb, s->vlcs[1][ac_index].table, 9, 2)

        /* EOB */
        if (code == 0x10)
            break;
        i += ((unsigned)code) >> 4;
        if(code != 0x100){
            code &= 0xf;
            if(code > MIN_CACHE_BITS - 16){
                UPDATE_CACHE(re, &s->gb)
            }
            {
                int cache=GET_CACHE(re,&s->gb); // Not sure what this 
does, if it does nothing, then probably it's even faster to remove it
            }

            LAST_SKIP_BITS(re, &s->gb, code)

            if (i >= 63) {
                if(i == 63){
                    break;
                }
                av_log(s->avctx, AV_LOG_ERROR, "error count: %d\n", i);
                return -1;
            }
        }
    }
    CLOSE_READER(re, &s->gb)}

    return 0;
}

/* Decode when CodecID is MJPEGTN - Overall speed gain is more than 50% 
compared to classic decoding (not counting the downsampling step which 
is avoided) */
static int mjpeg_decode_one_scan_tn(MJpegDecodeContext *s, int id)
{
    int mb_x, mb_y;
    int c = s->comp_index[id];
    for(mb_y = 0; mb_y < s->mb_height; mb_y++) {
        for(mb_x = 0; mb_x < s->mb_width; mb_x++) {
            uint8_t *ptr;

            if (s->restart_interval && !s->restart_count)
                s->restart_count = s->restart_interval;

            memset(s->block, 0, sizeof(s->block[0]));
            if (decode_block_tn(s, s->block, id,
                             s->dc_index[0], s->ac_index[0],
                             s->quant_matrixes[ s->quant_index[c]]) < 0) {
                dprintf("error y=%d x=%d\n", mb_y, mb_x);
                return -1;
            }
            ptr = s->picture.data[c] + (((s->linesize[c] * mb_y * 8) + 
mb_x * 8) >> s->avctx->lowres);
            if (s->interlaced && s->bottom_field)
                ptr += s->linesize[c] >> 1;
            *ptr = (uint8_t)(s->block[0]>>3);
        }
    }
    return 0;
}

/* Optimized version for thumbnail decoding (doesn't support progressive 
JPEG)
   Change WRT original version are in decode_block_tn call inplace of 
decode_block
   and *ptr = dcVal >> 3 part in place of s->dsp->idct_put call
*/
static int mjpeg_decode_scan_tn(MJpegDecodeContext *s, int 
nb_components, int ss, int se, int Ah, int Al){
    int i, mb_x, mb_y;
    int EOBRUN = 0;

    if(Ah) return 0; /* TODO decode refinement planes too */
    for(mb_y = 0; mb_y < s->mb_height; mb_y++) {
        for(mb_x = 0; mb_x < s->mb_width; mb_x++) {
            if (s->restart_interval && !s->restart_count)
                s->restart_count = s->restart_interval;

            for(i=0;i<nb_components;i++) {
                uint8_t *ptr;
                int n, h, v, x, y, c, j;
                n = s->nb_blocks[i];
                c = s->comp_index[i];
                h = s->h_scount[i];
                v = s->v_scount[i];
                x = 0;
                y = 0;
                for(j=0;j<n;j++) {
                    memset(s->block, 0, sizeof(s->block));
                    if (decode_block_tn(s, s->block, i,
                                     s->dc_index[i], s->ac_index[i],
                                     s->quant_matrixes[ 
s->quant_index[c] ]) < 0) {
                        av_log(s->avctx, AV_LOG_ERROR, "error y=%d 
x=%d\n", mb_y, mb_x);
                        return -1;
                    }

//                    av_log(s->avctx, AV_LOG_DEBUG, "mb: %d %d 
processed\n", mb_y, mb_x);
                    ptr = s->picture.data[c] +
                        (((s->linesize[c] * (v * mb_y + y) * 8) +
                        (h * mb_x + x) * 8) >> s->avctx->lowres);
                    if (s->interlaced && s->bottom_field)
                        ptr += s->linesize[c] >> 1;
//av_log(NULL, AV_LOG_DEBUG, "%d %d %d %d %d %d %d %d \n", mb_x, mb_y, 
x, y, c, s->bottom_field, (v * mb_y + y) * 8, (h * mb_x + x) * 8);
                    *ptr = s->block[0] >> 3;
                    if (++x == h) {
                        x = 0;
                        y++;
                    }
                }
            }
            /* (< 1350) buggy workaround for Spectralfan.mov, should be 
fixed */
            if (s->restart_interval && (s->restart_interval < 1350) &&
                !--s->restart_count) {
                align_get_bits(&s->gb);
                skip_bits(&s->gb, 16); /* skip RSTn */
                for (i=0; i<nb_components; i++) /* reset dc */
                    s->last_dc[i] = 1024;
            }
        }
    }
    return 0;
}

The end of ff_mjpeg_decode_sos is now like:
    }else{
        if (!s->thumbnail || s->progressive)
        {
            if (nb_components > 1){
                if (mjpeg_decode_scan(s, nb_components, predictor, ilv, 
prev_shift, point_transform) < 0)
                return -1;
            }
            else if (nb_components){
                if (s->avctx->codec_id == CODEC_ID_MJPEGFH){
                    if (mjpeg_decode_one_scan_fh(s, 0) < 0) return -1;
                } else{
                    if (mjpeg_decode_one_scan(s, 0) < 0) return -1;
                }
            }
        } else{
            if (nb_components > 1){
                if (mjpeg_decode_scan_tn(s, nb_components, predictor, 
ilv, prev_shift, point_transform) < 0)
                return -1;
            }
            else if (nb_components){
                if (mjpeg_decode_one_scan_tn(s, 0) < 0)
                return -1;
            }
        }
    }

I've added a new member to MJpegDecodeContext which is int thumbnail;
I could have done without by testing the "codec ID == CODEC_ID_MJPEGTN 
and lowres == 3" but I thought it was cleaner to have a flag that user 
could know about and inspect.

>   
>>
>> The other idea I've implemented is about speeding up the JPEG decoding 
>> for current code.
>> Current code does (pseudo code) :
>>    1) for all macro blocks
>>       1) Is it progressive ?
>>           1) Ok, decode block
>>           2) Not ok, decode block
>>       2) Is it progressive ?
>>           1) Ok, idct_put
>>           2) Not ok, idct_add
>>
>> My code does:
>> 1) Is it progressive
>>     1) Ok,  for all macro blocks
>>         1) decode blocks (plural here, current code does 32 blocks in a 
>> batch)
>>         2) idct_put
>>     2) Not ok, for all macro blocks
>>         1) decode blocks (plural here, current code does 32 blocks in a 
>> batch)
>>         2) idct_add
>>
>>     
>
> if its clean (no code duplication but rather uses always_inline) and faster
> then its welcome
>   
Code here use duplication, because I didn't want to break existing code.
It's like:
/* Batch Huffman decoding, then batch IDCTing.
    Doesn't work with progressive jpeg yet. Need to be implemented for 
them, but I don't have any to test the code with */
static int mjpeg_decode_one_scan_fh(MJpegDecodeContext *s, int id)
{
    int mb_x = 0, mb_y = 0, i;
    int c = s->comp_index[id];
    const int nbPreProcess = 32;
    static uint8_t someBlocks[sizeof(s->block) * 32]; // gcc reject 
sizeof() * nbPreProcess while it's const
    int nbBlocks = s->mb_height * s->mb_width;

    while (nbBlocks)
    {
        int limit = nbPreProcess < nbBlocks ? nbPreProcess : nbBlocks;
        memset(someBlocks, 0, sizeof(someBlocks));

        for (i = 0; i < limit; i++){
            if (decode_block(s, (DCTELEM*)&someBlocks[i * 
sizeof(s->block)], id,
                             s->dc_index[0], s->ac_index[0],
                             s->quant_matrixes[ s->quant_index[c]]) < 0) {
                dprintf("error y=%d x=%d\n", mb_y, mb_x);
                return -1;
            }
        }

        for (i = 0; i < limit; i++){
            uint8_t * ptr;
            if (mb_x == s->mb_width) { mb_x = 0; mb_y ++; }

            if (s->restart_interval && !s->restart_count)
                s->restart_count = s->restart_interval;

            ptr = s->picture.data[c] + (((s->linesize[c] * mb_y * 8) + 
mb_x * 8) >> s->avctx->lowres);
            s->dsp.idct_put(ptr, s->linesize[c], (DCTELEM*)&someBlocks[i 
* sizeof(s->block)]);
            mb_x ++;
            nbBlocks --;
        }
    }
    return 0;
}
>
>   
>> The 1.1.1 part decodes 32 DCT blocks sequentially (so the processor can 
>> keep the 32 DCT blocks in cache), and part 1.1.2 perform 32 iDCT 
>> sequentially (again, this clearly improve the cache coherency).
>> The modification improved the decoding time to 92ms (average other 272 
>> frames) on my computer. This is a 16% speedup.
>> I've tried different DCT sequence size, and 32 is quite good (32 blocks 
>> takes exactly 4096 bytes).
>> I think the same idea could be applied to other codec as well.
>>
>> I've tried to perform all the DCT first, then the IDCT in 2 separate 
>> process. There was no speed increase as the DCT takes twice the space of 
>> the current picture plane, so we soon get out of cache.
>> It might be of interest however to perform the IDCT on the GPU (if 
>> anyone is interested, I should still have some code about this).
>>  From NVidia own tests, the IDCT on the GPU takes 20x less times than 
>> CPU version, so it might finally worth the double memory requirement.
>>
>> If anyone is interested, please mail me, I'll send my changes.
>>     
>
> you can post them here
>   
Other changes are (in ff_mjpeg_decode_sof) :
    if (s->thumbnail && (width != s->width / 8 || height != s->height / 8))
    {
        av_freep(&s->qscale_table);

        s->width = width / 8;
        s->height = height / 8;
        s->interlaced = 0;

        /* test interlaced mode */
        if (s->first_picture &&
            s->org_height != 0 &&
            s->height < ((s->org_height * 3) / 4))
        {
            s->interlaced = 1;
            s->bottom_field = s->interlace_polarity;
            s->picture.interlaced_frame = 1;
            s->picture.top_field_first = !s->interlace_polarity;
            height *= 2;
        }

        s->avctx->lowres = 3;
        avcodec_set_dimensions(s->avctx, width, height);
        s->qscale_table= av_mallocz((width+15)/16);
        s->first_picture = 0;
    }
    else if (width != s->width || height != s->height) {

and (in last line of ff_mjpeg_decode_init):
   s->thumbnail = (avctx->codec_id == CODEC_ID_MJPEGTN);
   return 0;

I don't provide a diff, as the diff would include very specific 
(unrelated) changes to my own branch, or I would have to manually modify 
the patch with all caveats
You'll need to add new codec id "CODEC_ID_MJPEGTN" for thumbnail 
decoding and CODEC_ID_MJPEGFH for full huffman decoding.
BTW, it's an experimental solution, hence the code duplication until 
someone want to have a look to it, and decide it's worth to be merged in 
official code.

I couldn't get ffmpeg application to understand my jpeg files (or I 
don't understand what option to provide, something has been broken since 
last checkout I guess, as this used to work), so
I test my code as a shared DLL under Windows and a profiling application.

>
>   
>> BTW, my branch is different from current SVN version, and I haven't even 
>> tried to comply to whatever coding style of the moment.
>> I clearly don't have time to rewrite the file multiple times, like last 
>> time. If you are in the mood to do it, you're welcome.
>>     
>
> well if someone (you or someone else) does provide clean patches which
> pass review then they are welcome
> messy patches though wont reach svn no matter what great improvements
> they provide, you can fork ffmpeg and learn on your own why applying
> messy patches is a very very bad idea
>   
Sure, patches that break code are a PITA (I do maintain my repository 
for my job too).
I usually don't try to keep in sync with the official branches (I 
clearly can't duplicate your effort).
But the speed improvement was so important I thought it would be better 
that everyone profit from the change.

I haven't tried to mix both technics to thumbnail decoding, it will 
probably be the next step.
BTW, the splitting the entropy decoding from the IDCT in every codec is 
an path to explore, as it improves cache coherency for the CPU,
and will probably allow GPU IDCT decoding in the future (which, I know, 
ffmpeg doesn't provides yet), or could be easily threaded and then, 
speed up in multicore CPU.

The decode_block_tn part contains a GET_CACHE macro I don't understand. 
If it is not required, then it would simplify even more the entropy 
decoding by removing it.

Sincerely,

-- 
Cyril RUSSO