[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder

Thu Apr 29 15:47:05 CEST 2010

Sebastian Vater <cdgs.basty at googlemail.com> writes:

> M?ns Rullg?rd a ?crit :
>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>
>>   
>>> M?ns Rullg?rd a ?crit :
>>>     
>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>
>>>>   
>>>>       
>>>>> Just got the idea, we can get rid of the GetBitContext
>>>>> completely...Instead of reading 4 bits, we simply read a byte:
>>>>> const uint8_t lut_offsets = *buf++; // instead of get_bits(gb,4);
>>>>>         
>>>> That's a separate thing.
>>>>       
>>> Separate in what way? What did you mean exactly?
>>>     
>>
>> Separate from the LUT byte order.
>>
>>   
>>>>> Then we do loop unrolling by 8 and do two accesses to lut one with >> 4
>>>>> and one with & 0x0F, or we get even rid of this and create a lut table
>>>>> with 256 entries using AV_WN64A / AV_RN64A ;-)
>>>>>
>>>>> The advance here is that on a 64 bit CPU we get another nice speed
>>>>> improvement ;-)
>>>>> If we avoid calculations with AV_RN64A etc.
>>>>>     
>>>>>         
>>>> Those macros don't do any calculations.  All they do is some magic to
>>>> avoid type aliasing errors.
>>>>       
>>> Yes, I know, but I meant stuff like (lut0[...] << 32ULL) | lut1[...];
>>>     
>>
>> Why on earth would you do that?
>>
>>   
>>> But this isn't necessary if we use an 8-bit table storing uint64_t's...
>>>     
>>
>> That would fall apart completely on 32-bit machines.  I doubt any
>> speedup you might see on 64-bit is worth the added complexity of
>> doing it conditionally.  Just leave it as 32-bit.
>>
>>   
>>>>> gcc just should use 2 registers on 32-bit CPU and that's it.
>>>>>         
>>>> Should, but doesn't.
>>>>       
>>> With the way I meant above, it should...I'll test that now, but without
>>> a completed table and tell you what it does.
>>>     
>>
>> Believe me, it doesn't.  GCC is terrible with 64-bit data on 32-bit
>> machines.  Do not tempt it.
>>   
>
> I just tried, but unfortunately, I messed my last mail with the
> benchmarks completely up, so here again in a clean manner:
>
> 32-bit 4 bits per loop:

Still using get_bits?

> 64-bit with got rid of GetBitContext, containing the following code:
> static void decodeplane8(uint8_t *dst,
>                          const uint8_t *buf,
>                          const unsigned buf_size,
>                          const unsigned bps,
>                          const unsigned plane)
> {
>     START_TIMER;
>     const uint8_t *end = dst + (buf_size * 8);
>     const uint64_t *lut = plane8_lut[plane];
>     for(; dst < end; dst += 8) {
>         const uint64_t v  = AV_RN64A(dst) | lut[*buf++];
>         AV_WN64A(dst, v);
>     }
>     STOP_TIMER("decodeplane8");
> }
>
> Which outputs the following diassembly (from START/STOP_TIMER):
>      544:       0f 31                   rdtsc
>      546:       89 54 24 60             mov    %edx,0x60(%esp)
>      54a:       8b 5c 24 60             mov    0x60(%esp),%ebx
>      54e:       31 d2                   xor    %edx,%edx
>      550:       c7 44 24 64 00 00 00    movl   $0x0,0x64(%esp)
>      557:       00
>      558:       8b 74 24 64             mov    0x64(%esp),%esi
>      55c:       89 de                   mov    %ebx,%esi
>      55e:       bb 00 00 00 00          mov    $0x0,%ebx
>      563:       89 5c 24 60             mov    %ebx,0x60(%esp)
>      567:       01 44 24 60             add    %eax,0x60(%esp)
>      56b:       8b 44 24 44             mov    0x44(%esp),%eax
>      56f:       89 74 24 64             mov    %esi,0x64(%esp)
>      573:       11 54 24 64             adc    %edx,0x64(%esp)
>      577:       2b 84 24 9c 00 00 00    sub    0x9c(%esp),%eax
>      57e:       39 c8                   cmp    %ecx,%eax
>      580:       76 02                   jbe    584 <decode_frame_ilbm+0x2d4>
>      582:       89 c8                   mov    %ecx,%eax
>      584:       8b 74 24 50             mov    0x50(%esp),%esi
>      588:       8d 04 c6                lea    (%esi,%eax,8),%eax
>      58b:       89 44 24 5c             mov    %eax,0x5c(%esp)
>      58f:       8b 44 24 4c             mov    0x4c(%esp),%eax
>      593:       c1 e0 07                shl    $0x7,%eax
>      596:       05 00 00 00 00          add    $0x0,%eax
>      59b:       89 44 24 58             mov    %eax,0x58(%esp)
>      59f:       8b 44 24 5c             mov    0x5c(%esp),%eax
>      5a3:       39 c6                   cmp    %eax,%esi
>      5a5:       73 30                   jae    5d7 <decode_frame_ilbm+0x327>
>      5a7:       8b ac 24 9c 00 00 00    mov    0x9c(%esp),%ebp
>      5ae:       89 f7                   mov    %esi,%edi
>      5b0:       0f b6 75 00             movzbl 0x0(%ebp),%esi
>      5b4:       83 c5 01                add    $0x1,%ebp
>      5b7:       8b 4c 24 58             mov    0x58(%esp),%ecx
>      5bb:       8b 5f 04                mov    0x4(%edi),%ebx
>      5be:       8b 07                   mov    (%edi),%eax
>      5c0:       8b 54 f1 04             mov    0x4(%ecx,%esi,8),%edx
>      5c4:       0b 04 f1                or     (%ecx,%esi,8),%eax
>      5c7:       09 da                   or     %ebx,%edx
>      5c9:       89 07                   mov    %eax,(%edi)
>      5cb:       89 57 04                mov    %edx,0x4(%edi)
>      5ce:       83 c7 08                add    $0x8,%edi
>      5d1:       39 7c 24 5c             cmp    %edi,0x5c(%esp)
>      5d5:       77 d9                   ja     5b0 <decode_frame_ilbm+0x300>
>      5d7:       0f 31                   rdtsc

That looks _VERY_ inefficient.

-- 
M?ns Rullg?rd
mans at mansr.com