[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder

Thu Apr 29 16:37:58 CEST 2010

M?ns Rullg?rd a ?crit :
> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>
>   
>> M?ns Rullg?rd a ?crit :
>>     
>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>
>>>   
>>>       
>>>> M?ns Rullg?rd a ?crit :
>>>>     
>>>>         
>>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> M?ns Rullg?rd a ?crit :
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> Sebastian Vater <cdgs.basty at googlemail.com> writes:
>>>>>>>
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>>>> Just got the idea, we can get rid of the GetBitContext
>>>>>>>> completely...Instead of reading 4 bits, we simply read a byte:
>>>>>>>> const uint8_t lut_offsets = *buf++; // instead of get_bits(gb,4);
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>> That's a separate thing.
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> Separate in what way? What did you mean exactly?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Separate from the LUT byte order.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>>>> Then we do loop unrolling by 8 and do two accesses to lut one with >> 4
>>>>>>>> and one with & 0x0F, or we get even rid of this and create a lut table
>>>>>>>> with 256 entries using AV_WN64A / AV_RN64A ;-)
>>>>>>>>
>>>>>>>> The advance here is that on a 64 bit CPU we get another nice speed
>>>>>>>> improvement ;-)
>>>>>>>> If we avoid calculations with AV_RN64A etc.
>>>>>>>>     
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>> Those macros don't do any calculations.  All they do is some magic to
>>>>>>> avoid type aliasing errors.
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> Yes, I know, but I meant stuff like (lut0[...] << 32ULL) | lut1[...];
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Why on earth would you do that?
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> But this isn't necessary if we use an 8-bit table storing uint64_t's...
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> That would fall apart completely on 32-bit machines.  I doubt any
>>>>> speedup you might see on 64-bit is worth the added complexity of
>>>>> doing it conditionally.  Just leave it as 32-bit.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>>>> gcc just should use 2 registers on 32-bit CPU and that's it.
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>> Should, but doesn't.
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> With the way I meant above, it should...I'll test that now, but without
>>>>>> a completed table and tell you what it does.
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Believe me, it doesn't.  GCC is terrible with 64-bit data on 32-bit
>>>>> machines.  Do not tempt it.
>>>>>   
>>>>>       
>>>>>           
>>>> I just tried, but unfortunately, I messed my last mail with the
>>>> benchmarks completely up, so here again in a clean manner:
>>>>
>>>> 32-bit 4 bits per loop:
>>>>     
>>>>         
>>> Still using get_bits?
>>>       
>> Yes, shall I compare with byte operand modes, also?
>>     
>
> Never measure more than one change at a time.
>
>   
>>>> 64-bit with got rid of GetBitContext, containing the following code:
>>>> static void decodeplane8(uint8_t *dst,
>>>>                          const uint8_t *buf,
>>>>                          const unsigned buf_size,
>>>>                          const unsigned bps,
>>>>                          const unsigned plane)
>>>> {
>>>>     START_TIMER;
>>>>     const uint8_t *end = dst + (buf_size * 8);
>>>>     const uint64_t *lut = plane8_lut[plane];
>>>>     for(; dst < end; dst += 8) {
>>>>         const uint64_t v  = AV_RN64A(dst) | lut[*buf++];
>>>>         AV_WN64A(dst, v);
>>>>     }
>>>>     STOP_TIMER("decodeplane8");
>>>> }
>>>>
>>>> Which outputs the following diassembly (from START/STOP_TIMER):
>>>>         
>
> [...]
>
>   
>>> That looks _VERY_ inefficient.
>>>   
>>>       
>> No, there's some part of START_TIMER:
>> The for (; dst < end; dst++) is just:
>>
>>      5b0:       0f b6 75 00             movzbl 0x0(%ebp),%esi
>>      5b4:       83 c5 01                add    $0x1,%ebp
>>      5b7:       8b 4c 24 58             mov    0x58(%esp),%ecx
>>      5bb:       8b 5f 04                mov    0x4(%edi),%ebx
>>      5be:       8b 07                   mov    (%edi),%eax
>>      5c0:       8b 54 f1 04             mov    0x4(%ecx,%esi,8),%edx
>>      5c4:       0b 04 f1                or     (%ecx,%esi,8),%eax
>>      5c7:       09 da                   or     %ebx,%edx
>>      5c9:       89 07                   mov    %eax,(%edi)
>>      5cb:       89 57 04                mov    %edx,0x4(%edi)
>>      5ce:       83 c7 08                add    $0x8,%edi
>>      5d1:       39 7c 24 5c             cmp    %edi,0x5c(%esp)
>>      5d5:       77 d9                   ja     5b0 <decode_frame_ilbm+0x300>
>>
>> That looks very well.
>>     
>
> START_TIMER is just an rdtsc instruction, no more.  What's the rest of
> the code?
>   

I looked at START/STOP_TIMER macros, you're right regarding START_TIMER,
but STOP_TIMER adds much more, I guess that gcc moved some code around
it...remember that gcc also inlined decodeplane8...so it probably sees
more freedom to shuffle stuff around.

> I'm surprised that gcc doesn't totally fuck up this bit of code.  It
> usually adds a ton of useless load/stores, multiplies by constant zero
> etc.
>   

Well on this really primitive usage of 64-bit ints I'ld REALLY be
surprised if it would have fucked up it.

Maybe gcc just starts to fuck up with 64-bit stuff when it runs out of
registers, which it doesn't here.
> Nevertheless, I suspect most of the improvement you saw was due to
> dropping get_bits(), not the 64-bit table.
>   

Well, just did another benchmark with just removing GetBitContext, here is:
basty at euler:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
  built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  configuration:
  libavutil     50.14. 0 / 50.14. 0
  libavcodec    52.66. 0 / 52.66. 0
  libavformat   52.61. 0 / 52.61. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libswscale     0.10. 0 /  0.10. 0
[IFF @ 0x8b2feb0]Estimating duration from bitrate, this may be inaccurate
Input #0, IFF, from '../patches/MRLake.iff':
  Duration: N/A, bitrate: N/A
    Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
737:700, 90k tbr, 90k tbn, 90k tbc
20200 dezicycles in decodeplane8, 1 runs, 0 skips
14940 dezicycles in decodeplane8, 2 runs, 0 skips
12680 dezicycles in decodeplane8, 4 runs, 0 skips
11840 dezicycles in decodeplane8, 8 runs, 0 skips
10565 dezicycles in decodeplane8, 16 runs, 0 skips
9917 dezicycles in decodeplane8, 32 runs, 0 skips
9530 dezicycles in decodeplane8, 64 runs, 0 skips
9373 dezicycles in decodeplane8, 128 runs, 0 skips
9413 dezicycles in decodeplane8, 255 runs, 1 skips
9313 dezicycles in decodeplane8, 511 runs, 1 skips
9287 dezicycles in decodeplane8, 1023 runs, 1 skips
9267 dezicycles in decodeplane8, 2047 runs, 1 skips
9267 dezicycles in decodeplane8, 4095 runs, 1 skips
   1.30 A-V:  0.000 s:0.0 aq=    0KB vq=    0KB sq=    0B f=0/0   0/0

Code for this was:
static void decodeplane8(uint8_t *dst,
                         const uint8_t *buf,
                         const unsigned buf_size,
                         const unsigned bps,
                         const unsigned plane)
{
    START_TIMER;
    const uint8_t *end = dst + (buf_size * 8);
    const uint32_t *lut = plane8_lut[plane];
    for(; dst < end; dst += 8) {
        const uint8_t lut_offsets = *buf++;
        uint32_t v  = AV_RN32A(dst) | lut[lut_offsets >> 4];
        uint32_t v2  = AV_RN32A(dst + 4) | lut[lut_offsets & 0x0F];
        AV_WN32A(dst, v);
        AV_WN32A(dst + 4, v2);
    }
    STOP_TIMER("decodeplane8");
}

That means the initial execution is way faster (but I guess it's the
fact that the 8-bit table requires longer to be filled into L1 cache).

Disassembly output of this:
       b:       0f 31                   rdtsc
       d:       89 54 24 30             mov    %edx,0x30(%esp)
      11:       89 c7                   mov    %eax,%edi
      13:       8b 44 24 30             mov    0x30(%esp),%eax
      17:       c7 44 24 34 00 00 00    movl   $0x0,0x34(%esp)
      1e:       00
      1f:       8b 54 24 34             mov    0x34(%esp),%edx
      23:       31 ed                   xor    %ebp,%ebp
      25:       89 c2                   mov    %eax,%edx
      27:       b8 00 00 00 00          mov    $0x0,%eax
      2c:       89 44 24 30             mov    %eax,0x30(%esp)
      30:       01 7c 24 30             add    %edi,0x30(%esp)
      34:       89 54 24 34             mov    %edx,0x34(%esp)
      38:       11 6c 24 34             adc    %ebp,0x34(%esp)
      3c:       c1 64 24 54 06          shll   $0x6,0x54(%esp)
      41:       8b 7c 24 54             mov    0x54(%esp),%edi
      45:       8d 2c cb                lea    (%ebx,%ecx,8),%ebp
      48:       81 c7 00 00 00 00       add    $0x0,%edi
      4e:       39 eb                   cmp    %ebp,%ebx
      50:       73 26                   jae    78 <decodeplane8+0x78>
      52:       0f b6 06                movzbl (%esi),%eax
      55:       83 c6 01                add    $0x1,%esi
      58:       8b 53 04                mov    0x4(%ebx),%edx
      5b:       89 c1                   mov    %eax,%ecx
      5d:       83 e1 0f                and    $0xf,%ecx
      60:       0b 14 8f                or     (%edi,%ecx,4),%edx
      63:       c0 e8 04                shr    $0x4,%al
      66:       0f b6 c0                movzbl %al,%eax
      69:       8b 04 87                mov    (%edi,%eax,4),%eax
      6c:       09 03                   or     %eax,(%ebx)
      6e:       89 53 04                mov    %edx,0x4(%ebx)
      71:       83 c3 08                add    $0x8,%ebx
      74:       39 dd                   cmp    %ebx,%ebp
      76:       77 da                   ja     52 <decodeplane8+0x52>
      78:       0f 31                   rdtsc

The loop itself:
      52:       0f b6 06                movzbl (%esi),%eax
      55:       83 c6 01                add    $0x1,%esi
      58:       8b 53 04                mov    0x4(%ebx),%edx
      5b:       89 c1                   mov    %eax,%ecx
      5d:       83 e1 0f                and    $0xf,%ecx
      60:       0b 14 8f                or     (%edi,%ecx,4),%edx
      63:       c0 e8 04                shr    $0x4,%al
      66:       0f b6 c0                movzbl %al,%eax
      69:       8b 04 87                mov    (%edi,%eax,4),%eax
      6c:       09 03                   or     %eax,(%ebx)
      6e:       89 53 04                mov    %edx,0x4(%ebx)
      71:       83 c3 08                add    $0x8,%ebx
      74:       39 dd                   cmp    %ebx,%ebp
      76:       77 da                   ja     52 <decodeplane8+0x52>

I would say the inner loop is almost the same except the and and shift.

So now for the 8-bit table using 32-bits using following code:
static void decodeplane8(uint8_t *dst,
                         const uint8_t *buf,
                         const unsigned buf_size,
                         const unsigned bps,
                         const unsigned plane)
{
    START_TIMER;
    const uint8_t *end = dst + (buf_size * 8);
    const uint32_t *lut0 = plane8_lut[plane];
    const uint32_t *lut1 = plane8_lut[plane+64];
    for(; dst < end; dst += 8) {
        const uint8_t lut_offsets = *buf++;
        uint32_t v  = AV_RN32A(dst) | lut0[lut_offsets];
        uint32_t v2  = AV_RN32A(dst + 4) | lut1[lut_offsets];
        AV_WN32A(dst, v);
        AV_WN32A(dst + 4, v2);
    }
    STOP_TIMER("decodeplane8");
}

basty at euler:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
  built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
  configuration:
  libavutil     50.14. 0 / 50.14. 0
  libavcodec    52.66. 0 / 52.66. 0
  libavformat   52.61. 0 / 52.61. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libswscale     0.10. 0 /  0.10. 0
[IFF @ 0x8b2feb0]Estimating duration from bitrate, this may be inaccurate
Input #0, IFF, from '../patches/MRLake.iff':
  Duration: N/A, bitrate: N/A
    Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
737:700, 90k tbr, 90k tbn, 90k tbc
49680 dezicycles in decodeplane8, 1 runs, 0 skips
34380 dezicycles in decodeplane8, 2 runs, 0 skips
21900 dezicycles in decodeplane8, 4 runs, 0 skips
15655 dezicycles in decodeplane8, 8 runs, 0 skips
11572 dezicycles in decodeplane8, 16 runs, 0 skips
9616 dezicycles in decodeplane8, 32 runs, 0 skips
8590 dezicycles in decodeplane8, 64 runs, 0 skips
8087 dezicycles in decodeplane8, 128 runs, 0 skips
8163 dezicycles in decodeplane8, 256 runs, 0 skips
7845 dezicycles in decodeplane8, 512 runs, 0 skips
7797 dezicycles in decodeplane8, 1024 runs, 0 skips
7741 dezicycles in decodeplane8, 2048 runs, 0 skips
7722 dezicycles in decodeplane8, 4095 runs, 1 skips
   1.30 A-V:  0.000 s:0.0 aq=    0KB vq=    0KB sq=    0B f=0/0   0/0

Disassembly:
       b:       0f 31                   rdtsc
       d:       89 54 24 30             mov    %edx,0x30(%esp)
      11:       89 c7                   mov    %eax,%edi
      13:       8b 44 24 30             mov    0x30(%esp),%eax
      17:       c7 44 24 34 00 00 00    movl   $0x0,0x34(%esp)
      1e:       00
      1f:       8b 54 24 34             mov    0x34(%esp),%edx
      23:       31 ed                   xor    %ebp,%ebp
      25:       89 c2                   mov    %eax,%edx
      27:       b8 00 00 00 00          mov    $0x0,%eax
      2c:       89 44 24 30             mov    %eax,0x30(%esp)
      30:       01 7c 24 30             add    %edi,0x30(%esp)
      34:       89 54 24 34             mov    %edx,0x34(%esp)
      38:       11 6c 24 34             adc    %ebp,0x34(%esp)
      3c:       c1 64 24 54 06          shll   $0x6,0x54(%esp)
      41:       8d 2c cb                lea    (%ebx,%ecx,8),%ebp
      44:       8b 7c 24 54             mov    0x54(%esp),%edi
      48:       8b 4c 24 54             mov    0x54(%esp),%ecx
      4c:       81 c7 00 00 00 00       add    $0x0,%edi
      52:       81 c1 00 10 00 00       add    $0x1000,%ecx
      58:       39 eb                   cmp    %ebp,%ebx
      5a:       73 22                   jae    7e <decodeplane8+0x7e>
      5c:       8d 74 26 00             lea    0x0(%esi),%esi
      60:       0f b6 06                movzbl (%esi),%eax
      63:       83 c6 01                add    $0x1,%esi
      66:       8b 53 04                mov    0x4(%ebx),%edx
      69:       0f b6 c0                movzbl %al,%eax
      6c:       0b 14 81                or     (%ecx,%eax,4),%edx
      6f:       8b 04 87                mov    (%edi,%eax,4),%eax
      72:       09 03                   or     %eax,(%ebx)
      74:       89 53 04                mov    %edx,0x4(%ebx)
      77:       83 c3 08                add    $0x8,%ebx
      7a:       39 dd                   cmp    %ebx,%ebp
      7c:       77 e2                   ja     60 <decodeplane8+0x60>
      7e:       0f 31                   rdtsc

for loop:
      60:       0f b6 06                movzbl (%esi),%eax
      63:       83 c6 01                add    $0x1,%esi
      66:       8b 53 04                mov    0x4(%ebx),%edx
      69:       0f b6 c0                movzbl %al,%eax
      6c:       0b 14 81                or     (%ecx,%eax,4),%edx
      6f:       8b 04 87                mov    (%edi,%eax,4),%eax
      72:       09 03                   or     %eax,(%ebx)
      74:       89 53 04                mov    %edx,0x4(%ebx)
      77:       83 c3 08                add    $0x8,%ebx
      7a:       39 dd                   cmp    %ebx,%ebp
      7c:       77 e2                   ja     60 <decodeplane8+0x60>

-- 

Best regards,
                   :-) Basty/CDGS (-: