[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder

Thu Apr 29 20:08:44 CEST 2010

Ronald S. Bultje a ?crit :
> Hi,
>
> On Thu, Apr 29, 2010 at 10:37 AM, Sebastian Vater
> <cdgs.basty at googlemail.com> wrote:
>   
>>    START_TIMER;
>>    const uint8_t *end = dst + (buf_size * 8);
>>    const uint32_t *lut = plane8_lut[plane];
>>    for(; dst < end; dst += 8) {
>>        const uint8_t lut_offsets = *buf++;
>>        uint32_t v  = AV_RN32A(dst) | lut[lut_offsets >> 4];
>>        uint32_t v2  = AV_RN32A(dst + 4) | lut[lut_offsets & 0x0F];
>>        AV_WN32A(dst, v);
>>        AV_WN32A(dst + 4, v2);
>>    }
>>    STOP_TIMER("decodeplane8");
>>     
> [..]
>   
>>      52:       0f b6 06                movzbl (%esi),%eax
>>      55:       83 c6 01                add    $0x1,%esi
>>     
>
> const uint8_t lut_offsets = *buf++;
>
>   
>>      58:       8b 53 04                mov    0x4(%ebx),%edx
>>     
>
> read dst+4
>
>   
>>      5b:       89 c1                   mov    %eax,%ecx
>>     
>
> move lut_offset from eac->ecx
>
>   
>>      5d:       83 e1 0f                and    $0xf,%ecx
>>     
>
> ecx has lower 4 bits
>
>   
>>      60:       0b 14 8f                or     (%edi,%ecx,4),%edx
>>     
>
> edx |= table
>
>   
>>      63:       c0 e8 04                shr    $0x4,%al
>>     
>
> eax has higher 4 bits
>
>   
>>      66:       0f b6 c0                movzbl %al,%eax
>>     
>
> Err...?
>   

Zero extends al to eax in order to use eax as offset for next instruction.

>   
>>      69:       8b 04 87                mov    (%edi,%eax,4),%eax
>>     
>
> table value
>
>   
>>      6c:       09 03                   or     %eax,(%ebx)
>>     
>
> read/write dest (note how this is a single instruction).
>
>   
>>      6e:       89 53 04                mov    %edx,0x4(%ebx)
>>     
>
> write dst+4 (note how this isn't a single instruction)
>
>   
>>      71:       83 c3 08                add    $0x8,%ebx
>>      74:       39 dd                   cmp    %ebx,%ebp
>>      76:       77 da                   ja     52 <decodeplane8+0x52>
>>     
>
> loop end
>
> You probably see from the disassembly you could remove another few
> instructions by doing the calculations serially instead of parallel.
> For example, it reads/writes dst in a single instruction, but uses
> several instructions for dst+4. Serializing this might improve the
> code another few instructions.
>
> for() {
> uint32_t v;
> v = AV_RN32A(..) | lut[x>>4];
> AV_WN32A(dst, v);
> v = AV_RN32A(..) | lut[x&0xf];
> AV_WN32A(dst, v+4);
> }
>
> Ideally this becomes (12 instructions instead of original 14):
>
> movzbl (%esi),%eax
> add    $0x1,%esi
> mov    %eax,%ecx
> shr    $0x4,%eax
> and    $0xf,%ecx
> mov    (%edi,%eax,4),%eax
> mov    (%edi,%ecx,4),%ecx
> or     %eax,(%ebx)
> or     %ecx,(%ebx, 4) <- not sure if that's possible
> add    $0x8,%ebx
> cmp    %ebx,%ebp
> ja     52 <decodeplane8+0x52>
>   

The question remains if it's better to use a 8-bit or 4-bit table with
>> and &.

For me it seems that using 8-bit table is slower at initial steps but
becomes faster in the next turns (very probably due to more cache misses).

I think the speed advantage is proportional to image size, i.e. bigger
images benefit more from a 8-bit table while smaller ones more from a
4-bit table.

Since speed gain is more important for larger images, the 8-bit table is
probably the way to go. What do you think?

-- 

Best regards,
                   :-) Basty/CDGS (-: