[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder
Måns Rullgård
mans
Thu Apr 29 16:12:31 CEST 2010
Sebastian Vater <cdgs.basty at googlemail.com> writes:
> M?ns Rullg?rd a ?crit :
>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>
>>
>>> M?ns Rullg?rd a ?crit :
>>>
>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>
>>>>
>>>>
>>>>> M?ns Rullg?rd a ?crit :
>>>>>
>>>>>
>>>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Just got the idea, we can get rid of the GetBitContext
>>>>>>> completely...Instead of reading 4 bits, we simply read a byte:
>>>>>>> const uint8_t lut_offsets = *buf++; // instead of get_bits(gb,4);
>>>>>>>
>>>>>>>
>>>>>> That's a separate thing.
>>>>>>
>>>>>>
>>>>> Separate in what way? What did you mean exactly?
>>>>>
>>>>>
>>>> Separate from the LUT byte order.
>>>>
>>>>
>>>>
>>>>>>> Then we do loop unrolling by 8 and do two accesses to lut one with >> 4
>>>>>>> and one with & 0x0F, or we get even rid of this and create a lut table
>>>>>>> with 256 entries using AV_WN64A / AV_RN64A ;-)
>>>>>>>
>>>>>>> The advance here is that on a 64 bit CPU we get another nice speed
>>>>>>> improvement ;-)
>>>>>>> If we avoid calculations with AV_RN64A etc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Those macros don't do any calculations. All they do is some magic to
>>>>>> avoid type aliasing errors.
>>>>>>
>>>>>>
>>>>> Yes, I know, but I meant stuff like (lut0[...] << 32ULL) | lut1[...];
>>>>>
>>>>>
>>>> Why on earth would you do that?
>>>>
>>>>
>>>>
>>>>> But this isn't necessary if we use an 8-bit table storing uint64_t's...
>>>>>
>>>>>
>>>> That would fall apart completely on 32-bit machines. I doubt any
>>>> speedup you might see on 64-bit is worth the added complexity of
>>>> doing it conditionally. Just leave it as 32-bit.
>>>>
>>>>
>>>>
>>>>>>> gcc just should use 2 registers on 32-bit CPU and that's it.
>>>>>>>
>>>>>>>
>>>>>> Should, but doesn't.
>>>>>>
>>>>>>
>>>>> With the way I meant above, it should...I'll test that now, but without
>>>>> a completed table and tell you what it does.
>>>>>
>>>>>
>>>> Believe me, it doesn't. GCC is terrible with 64-bit data on 32-bit
>>>> machines. Do not tempt it.
>>>>
>>>>
>>> I just tried, but unfortunately, I messed my last mail with the
>>> benchmarks completely up, so here again in a clean manner:
>>>
>>> 32-bit 4 bits per loop:
>>>
>>
>> Still using get_bits?
>
> Yes, shall I compare with byte operand modes, also?
Never measure more than one change at a time.
>>> 64-bit with got rid of GetBitContext, containing the following code:
>>> static void decodeplane8(uint8_t *dst,
>>> const uint8_t *buf,
>>> const unsigned buf_size,
>>> const unsigned bps,
>>> const unsigned plane)
>>> {
>>> START_TIMER;
>>> const uint8_t *end = dst + (buf_size * 8);
>>> const uint64_t *lut = plane8_lut[plane];
>>> for(; dst < end; dst += 8) {
>>> const uint64_t v = AV_RN64A(dst) | lut[*buf++];
>>> AV_WN64A(dst, v);
>>> }
>>> STOP_TIMER("decodeplane8");
>>> }
>>>
>>> Which outputs the following diassembly (from START/STOP_TIMER):
[...]
>> That looks _VERY_ inefficient.
>>
> No, there's some part of START_TIMER:
> The for (; dst < end; dst++) is just:
>
> 5b0: 0f b6 75 00 movzbl 0x0(%ebp),%esi
> 5b4: 83 c5 01 add $0x1,%ebp
> 5b7: 8b 4c 24 58 mov 0x58(%esp),%ecx
> 5bb: 8b 5f 04 mov 0x4(%edi),%ebx
> 5be: 8b 07 mov (%edi),%eax
> 5c0: 8b 54 f1 04 mov 0x4(%ecx,%esi,8),%edx
> 5c4: 0b 04 f1 or (%ecx,%esi,8),%eax
> 5c7: 09 da or %ebx,%edx
> 5c9: 89 07 mov %eax,(%edi)
> 5cb: 89 57 04 mov %edx,0x4(%edi)
> 5ce: 83 c7 08 add $0x8,%edi
> 5d1: 39 7c 24 5c cmp %edi,0x5c(%esp)
> 5d5: 77 d9 ja 5b0 <decode_frame_ilbm+0x300>
>
> That looks very well.
START_TIMER is just an rdtsc instruction, no more. What's the rest of
the code?
I'm surprised that gcc doesn't totally fuck up this bit of code. It
usually adds a ton of useless load/stores, multiplies by constant zero
etc.
Nevertheless, I suspect most of the improvement you saw was due to
dropping get_bits(), not the 64-bit table.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list