[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder
Måns Rullgård
mans
Thu Apr 29 15:47:05 CEST 2010
Sebastian Vater <cdgs.basty at googlemail.com> writes:
> M?ns Rullg?rd a ?crit :
>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>
>>
>>> M?ns Rullg?rd a ?crit :
>>>
>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>
>>>>
>>>>
>>>>> Just got the idea, we can get rid of the GetBitContext
>>>>> completely...Instead of reading 4 bits, we simply read a byte:
>>>>> const uint8_t lut_offsets = *buf++; // instead of get_bits(gb,4);
>>>>>
>>>> That's a separate thing.
>>>>
>>> Separate in what way? What did you mean exactly?
>>>
>>
>> Separate from the LUT byte order.
>>
>>
>>>>> Then we do loop unrolling by 8 and do two accesses to lut one with >> 4
>>>>> and one with & 0x0F, or we get even rid of this and create a lut table
>>>>> with 256 entries using AV_WN64A / AV_RN64A ;-)
>>>>>
>>>>> The advance here is that on a 64 bit CPU we get another nice speed
>>>>> improvement ;-)
>>>>> If we avoid calculations with AV_RN64A etc.
>>>>>
>>>>>
>>>> Those macros don't do any calculations. All they do is some magic to
>>>> avoid type aliasing errors.
>>>>
>>> Yes, I know, but I meant stuff like (lut0[...] << 32ULL) | lut1[...];
>>>
>>
>> Why on earth would you do that?
>>
>>
>>> But this isn't necessary if we use an 8-bit table storing uint64_t's...
>>>
>>
>> That would fall apart completely on 32-bit machines. I doubt any
>> speedup you might see on 64-bit is worth the added complexity of
>> doing it conditionally. Just leave it as 32-bit.
>>
>>
>>>>> gcc just should use 2 registers on 32-bit CPU and that's it.
>>>>>
>>>> Should, but doesn't.
>>>>
>>> With the way I meant above, it should...I'll test that now, but without
>>> a completed table and tell you what it does.
>>>
>>
>> Believe me, it doesn't. GCC is terrible with 64-bit data on 32-bit
>> machines. Do not tempt it.
>>
>
> I just tried, but unfortunately, I messed my last mail with the
> benchmarks completely up, so here again in a clean manner:
>
> 32-bit 4 bits per loop:
Still using get_bits?
> 64-bit with got rid of GetBitContext, containing the following code:
> static void decodeplane8(uint8_t *dst,
> const uint8_t *buf,
> const unsigned buf_size,
> const unsigned bps,
> const unsigned plane)
> {
> START_TIMER;
> const uint8_t *end = dst + (buf_size * 8);
> const uint64_t *lut = plane8_lut[plane];
> for(; dst < end; dst += 8) {
> const uint64_t v = AV_RN64A(dst) | lut[*buf++];
> AV_WN64A(dst, v);
> }
> STOP_TIMER("decodeplane8");
> }
>
> Which outputs the following diassembly (from START/STOP_TIMER):
> 544: 0f 31 rdtsc
> 546: 89 54 24 60 mov %edx,0x60(%esp)
> 54a: 8b 5c 24 60 mov 0x60(%esp),%ebx
> 54e: 31 d2 xor %edx,%edx
> 550: c7 44 24 64 00 00 00 movl $0x0,0x64(%esp)
> 557: 00
> 558: 8b 74 24 64 mov 0x64(%esp),%esi
> 55c: 89 de mov %ebx,%esi
> 55e: bb 00 00 00 00 mov $0x0,%ebx
> 563: 89 5c 24 60 mov %ebx,0x60(%esp)
> 567: 01 44 24 60 add %eax,0x60(%esp)
> 56b: 8b 44 24 44 mov 0x44(%esp),%eax
> 56f: 89 74 24 64 mov %esi,0x64(%esp)
> 573: 11 54 24 64 adc %edx,0x64(%esp)
> 577: 2b 84 24 9c 00 00 00 sub 0x9c(%esp),%eax
> 57e: 39 c8 cmp %ecx,%eax
> 580: 76 02 jbe 584 <decode_frame_ilbm+0x2d4>
> 582: 89 c8 mov %ecx,%eax
> 584: 8b 74 24 50 mov 0x50(%esp),%esi
> 588: 8d 04 c6 lea (%esi,%eax,8),%eax
> 58b: 89 44 24 5c mov %eax,0x5c(%esp)
> 58f: 8b 44 24 4c mov 0x4c(%esp),%eax
> 593: c1 e0 07 shl $0x7,%eax
> 596: 05 00 00 00 00 add $0x0,%eax
> 59b: 89 44 24 58 mov %eax,0x58(%esp)
> 59f: 8b 44 24 5c mov 0x5c(%esp),%eax
> 5a3: 39 c6 cmp %eax,%esi
> 5a5: 73 30 jae 5d7 <decode_frame_ilbm+0x327>
> 5a7: 8b ac 24 9c 00 00 00 mov 0x9c(%esp),%ebp
> 5ae: 89 f7 mov %esi,%edi
> 5b0: 0f b6 75 00 movzbl 0x0(%ebp),%esi
> 5b4: 83 c5 01 add $0x1,%ebp
> 5b7: 8b 4c 24 58 mov 0x58(%esp),%ecx
> 5bb: 8b 5f 04 mov 0x4(%edi),%ebx
> 5be: 8b 07 mov (%edi),%eax
> 5c0: 8b 54 f1 04 mov 0x4(%ecx,%esi,8),%edx
> 5c4: 0b 04 f1 or (%ecx,%esi,8),%eax
> 5c7: 09 da or %ebx,%edx
> 5c9: 89 07 mov %eax,(%edi)
> 5cb: 89 57 04 mov %edx,0x4(%edi)
> 5ce: 83 c7 08 add $0x8,%edi
> 5d1: 39 7c 24 5c cmp %edi,0x5c(%esp)
> 5d5: 77 d9 ja 5b0 <decode_frame_ilbm+0x300>
> 5d7: 0f 31 rdtsc
That looks _VERY_ inefficient.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list