[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder

Thu Apr 29 14:57:43 CEST 2010

M?ns Rullg?rd a ?crit :
> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>
>   
>> Michael Niedermayer a ?crit :
>>     
>>> On Wed, Apr 28, 2010 at 02:23:07PM +0200, Sebastian Vater wrote:
>>>   
>>>       
>>>> Sebastian Vater a ?crit :
>>>>     
>>>>         
>>>>> I have fixed the wrong IFF decoding issue in the IFF decoder.
>>>>>
>>>>> The reason is that the IFF docs say that each line in the BODY chunk has
>>>>> it's width rounded up to next 16-bit boundary, such that each new line
>>>>> begins on a word boundary (address divisible by 2).
>>>>>
>>>>> Please review and apply.
>>>>>
>>>>> I will do the heavy optimization stuff now based on this.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>> Heavy optimization for decodeplane8 done. Patch attached.
>>>>
>>>> Please note that I used a different IFF file for benchmarking this (one
>>>> which was displayed incorrectly before fix non-rounding on word boundary
>>>> patch applied).
>>>>
>>>> This image also has a larger width and thus the decodeplane8 function is
>>>> called more often per line (should yield more accurate results).
>>>>
>>>>  iff.c |   37 ++++++++++++++++++++++++++++++++++---
>>>>  1 file changed, 34 insertions(+), 3 deletions(-)
>>>> c47530e98c6fc28fa943f51b8b5d14b64cafa5c5  iff-decoder-fix-heavy-dp8.patch
>>>>     
>>>>         
>>> just wanted to say iam still fine with this patch once its tested on
>>> little & big & the other devels are ok with it too
>>>
>>> [...]
>>>   
>>>       
>> Bad news here...
>>
>> I just got a report from the guy who did sent the IFF files which were
>> decoded wrong and he said that it doesn't work on big-endian.
>>
>> I assume that the tables have to be swapped around. Hence, what's the
>> ffmpeg prefered way of checking if it's big or little endian (I just
>> will then do sth. like):
>> #ifdef BIG_ENDIAN
>>   // declare table for BE here
>> #else
>>   // declare table for LE here
>>     
>
> I think we can do better.  Please stand by.
>   

Yes, we can. :)
(No, I'm not Obama *gg*).

Just got the idea, we can get rid of the GetBitContext
completely...Instead of reading 4 bits, we simply read a byte:
const uint8_t lut_offsets = *buf++; // instead of get_bits(gb,4);

Then we do loop unrolling by 8 and do two accesses to lut one with >> 4
and one with & 0x0F, or we get even rid of this and create a lut table
with 256 entries using AV_WN64A / AV_RN64A ;-)

The advance here is that on a 64 bit CPU we get another nice speed
improvement ;-)
If we avoid calculations with AV_RN64A etc. gcc just should use 2
registers on 32-bit CPU and that's it.

Since we got rid of GetBitContext that shall be no problem ;-)

Whatever way we take, I think it's better to create the tables based on
endianess instead of doing this in the loop, because this adds an extra
bit swap instruction in the inner loop if the target CPU has wrong
alignment, which I'ld prefer to avoid. ;-)

-- 

Best regards,
                   :-) Basty/CDGS (-