[FFmpeg-devel] [PATCH] Heavy optimization of IFF decoder

Tue Apr 27 22:54:08 CEST 2010

Sebastian Vater a ?crit :
> Michael Niedermayer a ?crit :
>   
>> On Tue, Apr 27, 2010 at 09:59:13PM +0200, Sebastian Vater wrote:
>>   
>>     
>>> Ronald S. Bultje a ?crit :
>>>     
>>>       
>>>> Hi,
>>>>
>>>> On Tue, Apr 27, 2010 at 3:05 PM, Sebastian Vater
>>>> <cdgs.basty at googlemail.com> wrote:
>>>>   
>>>>       
>>>>         
>>>>> M?ns Rullg?rd a ?crit :
>>>>>     
>>>>>         
>>>>>           
>>>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>>>       
>>>>>>           
>>>>>>             
>>>>>>>  {
>>>>>>>      GetBitContext gb;
>>>>>>>      unsigned int i;
>>>>>>>      const unsigned b = (buf_size * 8) + bps - 1;
>>>>>>> +    const unsigned b32 = b & ~3;
>>>>>>>      init_get_bits(&gb, buf, buf_size * 8);
>>>>>>> -    for(i = 0; i < b; i++) {
>>>>>>> +    for(i = 0; i < b32; i += 4) {
>>>>>>> +        const uint32_t v = decodeplane8_tab[plane][get_bits(&gb, 4)];
>>>>>>> +        AV_WN32A(dst+i, AV_RN32A(dst+i) | v);
>>>>>>> +    }
>>>>>>>
>>>>>>>         
>>>>>>>             
>>>>>>>               
>>>>>> I suggest using a local variable here, like this:
>>>>>>
>>>>>>     const uint32_t *lut = decodeplane8_tab[plane];
>>>>>>     [...]
>>>>>>     uint32_t v = lut[get_bits(...)];
>>>>>>
>>>>>> I don't trust gcc to do that on its own.
>>>>>>
>>>>>>       
>>>>>>           
>>>>>>             
>>>>> Didn't change anything, i.e. still 20% slower...what now?
>>>>>     
>>>>>         
>>>>>           
>>>> With that, it becomes from 20% slower to 3-4% faster for me. Check
>>>> your performance again, or increase the amount of cycles in
>>>> decodeplane8() (as Mans suggested earlier in this thread).
>>>>   
>>>>       
>>>>         
>>> Well, do you know what I think now?
>>>
>>> It's just that my GCC 4.2.4 doesn't optimize that as well as
>>> yours...updated patch attached!
>>>     
>>>       
>> or you are on x86_32 while mans/ronald are on x86_64 and gcc runs out
>> of registers (i didnt look at the asm so i could be totally wrong)
>>
>> [..]
>>   
>>     
>>> -    for(i = 0; i < b; i++) {
>>> +    for(i = 0; i < b32; i += 4) {
>>> +        const uint32_t v = lut[get_bits(&gb, 4)];
>>> +        AV_WN32A(dst+i, AV_RN32A(dst+i) | v);
>>> +    }
>>>     
>>>       
>> speaking of registers
>> for(dst; dst<end; dst+=4)
>>
>> would reduce the number of variables by 1
>> also
>>  const uint32_t v = AV_RN32A(dst) | lut[get_bits(&gb, 4)];
>>  AV_WN32A(dst, v);
>> might be worth a try
>>
>> and make sure dst is 4 byte aligned
>>   
>>     
>
> Great! It works now like a charm now for me, too.
> I also got the 3% speedup ;)
>
> Patch attached!
>
> But could you try out and compare the new patch, if it doesn't cause any
> side effects?
> Load several 8bpp IFF files and see if something is truncated, etc.
>
> I'm far too tired right now, and just falling to bed!
> Good night and nice dreams!
>   

Just figured out that the (buf_size * 8) + bps - 1 expression can be
simplified to (buf_size * 8), at least it didn't change Heart.ILBM.

New patch attached fixing this! But that's really the very last one for
today!

-- 

Best regards,
                   :-) Basty/CDGS (-:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: iff-opt-dp8.patch
Type: text/x-patch
Size: 1871 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100427/9fbb134a/attachment.bin>