[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder

Thu Apr 29 16:12:31 CEST 2010

Sebastian Vater <cdgs.basty at googlemail.com> writes:

> M?ns Rullg?rd a ?crit :
>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>
>>   
>>> M?ns Rullg?rd a ?crit :
>>>     
>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>
>>>>   
>>>>       
>>>>> M?ns Rullg?rd a ?crit :
>>>>>     
>>>>>         
>>>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>>>
>>>>>>   
>>>>>>       
>>>>>>           
>>>>>>> Just got the idea, we can get rid of the GetBitContext
>>>>>>> completely...Instead of reading 4 bits, we simply read a byte:
>>>>>>> const uint8_t lut_offsets = *buf++; // instead of get_bits(gb,4);
>>>>>>>         
>>>>>>>             
>>>>>> That's a separate thing.
>>>>>>       
>>>>>>           
>>>>> Separate in what way? What did you mean exactly?
>>>>>     
>>>>>         
>>>> Separate from the LUT byte order.
>>>>
>>>>   
>>>>       
>>>>>>> Then we do loop unrolling by 8 and do two accesses to lut one with >> 4
>>>>>>> and one with & 0x0F, or we get even rid of this and create a lut table
>>>>>>> with 256 entries using AV_WN64A / AV_RN64A ;-)
>>>>>>>
>>>>>>> The advance here is that on a 64 bit CPU we get another nice speed
>>>>>>> improvement ;-)
>>>>>>> If we avoid calculations with AV_RN64A etc.
>>>>>>>     
>>>>>>>         
>>>>>>>             
>>>>>> Those macros don't do any calculations.  All they do is some magic to
>>>>>> avoid type aliasing errors.
>>>>>>       
>>>>>>           
>>>>> Yes, I know, but I meant stuff like (lut0[...] << 32ULL) | lut1[...];
>>>>>     
>>>>>         
>>>> Why on earth would you do that?
>>>>
>>>>   
>>>>       
>>>>> But this isn't necessary if we use an 8-bit table storing uint64_t's...
>>>>>     
>>>>>         
>>>> That would fall apart completely on 32-bit machines.  I doubt any
>>>> speedup you might see on 64-bit is worth the added complexity of
>>>> doing it conditionally.  Just leave it as 32-bit.
>>>>
>>>>   
>>>>       
>>>>>>> gcc just should use 2 registers on 32-bit CPU and that's it.
>>>>>>>         
>>>>>>>             
>>>>>> Should, but doesn't.
>>>>>>       
>>>>>>           
>>>>> With the way I meant above, it should...I'll test that now, but without
>>>>> a completed table and tell you what it does.
>>>>>     
>>>>>         
>>>> Believe me, it doesn't.  GCC is terrible with 64-bit data on 32-bit
>>>> machines.  Do not tempt it.
>>>>   
>>>>       
>>> I just tried, but unfortunately, I messed my last mail with the
>>> benchmarks completely up, so here again in a clean manner:
>>>
>>> 32-bit 4 bits per loop:
>>>     
>>
>> Still using get_bits?
>
> Yes, shall I compare with byte operand modes, also?

Never measure more than one change at a time.

>>> 64-bit with got rid of GetBitContext, containing the following code:
>>> static void decodeplane8(uint8_t *dst,
>>>                          const uint8_t *buf,
>>>                          const unsigned buf_size,
>>>                          const unsigned bps,
>>>                          const unsigned plane)
>>> {
>>>     START_TIMER;
>>>     const uint8_t *end = dst + (buf_size * 8);
>>>     const uint64_t *lut = plane8_lut[plane];
>>>     for(; dst < end; dst += 8) {
>>>         const uint64_t v  = AV_RN64A(dst) | lut[*buf++];
>>>         AV_WN64A(dst, v);
>>>     }
>>>     STOP_TIMER("decodeplane8");
>>> }
>>>
>>> Which outputs the following diassembly (from START/STOP_TIMER):

[...]

>> That looks _VERY_ inefficient.
>>   
> No, there's some part of START_TIMER:
> The for (; dst < end; dst++) is just:
>
>      5b0:       0f b6 75 00             movzbl 0x0(%ebp),%esi
>      5b4:       83 c5 01                add    $0x1,%ebp
>      5b7:       8b 4c 24 58             mov    0x58(%esp),%ecx
>      5bb:       8b 5f 04                mov    0x4(%edi),%ebx
>      5be:       8b 07                   mov    (%edi),%eax
>      5c0:       8b 54 f1 04             mov    0x4(%ecx,%esi,8),%edx
>      5c4:       0b 04 f1                or     (%ecx,%esi,8),%eax
>      5c7:       09 da                   or     %ebx,%edx
>      5c9:       89 07                   mov    %eax,(%edi)
>      5cb:       89 57 04                mov    %edx,0x4(%edi)
>      5ce:       83 c7 08                add    $0x8,%edi
>      5d1:       39 7c 24 5c             cmp    %edi,0x5c(%esp)
>      5d5:       77 d9                   ja     5b0 <decode_frame_ilbm+0x300>
>
> That looks very well.

START_TIMER is just an rdtsc instruction, no more.  What's the rest of
the code?

I'm surprised that gcc doesn't totally fuck up this bit of code.  It
usually adds a ton of useless load/stores, multiplies by constant zero
etc.

Nevertheless, I suspect most of the improvement you saw was due to
dropping get_bits(), not the 64-bit table.

-- 
M?ns Rullg?rd
mans at mansr.com