[FFmpeg-devel] [PATCH] h264.c/decode_cabac_residual optimization

Wed Jul 2 12:37:35 CEST 2008

Laurent Desnogues wrote:
> On Wed, Jul 2, 2008 at 11:45 AM, Siarhei Siamashka
> <siarhei.siamashka at gmail.com> wrote:
> [...]
>>>> /**********************/
>>>> int q();
>>>>
>>>> void f1(int n)
>>>> {
>>>>     while (--n >= 0) {
>>>>         q();
>>>>     }
>>>> }
>>>>
>>>> void f2(int n)
>>>> {
>>>>     while (n--) {
>>>>         q();
>>>>     }
>>>> }
>>>> /**********************/
>>>
>>> Any half-decent compiler should generate the same code for those two
>>> functions.
>>
>> That's not true, just because these two functions are not identical.
>> Hint: what happens if you pass -1 or any other negative value to these
>> functions?
>>
>>> GCC for ARM generates a slightly different, but equivalent, setup sequence,
>>> and the loops are exactly the same.
>>
>> In my case, gcc 3.4.4 (using '-march=armv6 -O3 -c' options) generated
>> the following assembly output, which is definitely better for 'f1' (3
>> instructions in the inner loop instead of 4):
>>
>> 00000000 <f1>:
>>   0:   e92d4010        stmdb   sp!, {r4, lr}
>>   4:   e2504001        subs    r4, r0, #1      ; 0x1
>>   8:   48bd8010        ldmmiia sp!, {r4, pc}
>>   c:   ebfffffe        bl      0 <q>
>>  10:   e2544001        subs    r4, r4, #1      ; 0x1
>>  14:   5afffffc        bpl     c <f1+0xc>
>>  18:   e8bd8010        ldmia   sp!, {r4, pc}
>>
>> 0000001c <f2>:
>>  1c:   e92d4010        stmdb   sp!, {r4, lr}
>>  20:   e2504001        subs    r4, r0, #1      ; 0x1
>>  24:   38bd8010        ldmccia sp!, {r4, pc}
>>  28:   e2444001        sub     r4, r4, #1      ; 0x1
>>  2c:   ebfffffe        bl      0 <q>
>>  30:   e3740001        cmn     r4, #1  ; 0x1
>>  34:   1afffffb        bne     28 <q+0x28>
>>  38:   e8bd8010        ldmia   sp!, {r4, pc}
>>
>> I'm curious, what is the output of your compiler?
>
> CSL 2007q3 and 2008q1 both generate this:
>
> 00000000 <f2>:
>    0:   e92d4070        push    {r4, r5, r6, lr}
>    4:   e2505000        subs    r5, r0, #0      ; 0x0
>    8:   08bd8070        popeq   {r4, r5, r6, pc}
>    c:   e3a04000        mov     r4, #0  ; 0x0
>   10:   e2844001        add     r4, r4, #1      ; 0x1
>   14:   ebfffffe        bl      0 <q>
>   18:   e1540005        cmp     r4, r5
>   1c:   1afffffb        bne     10 <f2+0x10>
>   20:   e8bd8070        pop     {r4, r5, r6, pc}
>
> 00000024 <f1>:
>   24:   e3500001        cmp     r0, #1  ; 0x1
>   28:   e92d4070        push    {r4, r5, r6, lr}
>   2c:   e1a05000        mov     r5, r0
>   30:   48bd8070        popmi   {r4, r5, r6, pc}
>   34:   e3a04000        mov     r4, #0  ; 0x0
>   38:   e2844001        add     r4, r4, #1      ; 0x1
>   3c:   ebfffffe        bl      0 <q>
>   40:   e1540005        cmp     r4, r5
>   44:   1afffffb        bne     38 <q+0x38>
>   48:   e8bd8070        pop     {r4, r5, r6, pc}

That's exactly what I got too.  It's curious that it saves r6, even
though it is never used.  Perhaps it does this to keep the stack
8-byte aligned.  Also curious is why r4 and r5 are used, rather than
the callee-saved r1 and r2.  What a waste of 4 bytes stack space.

-- 
M?ns Rullg?rd
mans at mansr.com