[FFmpeg-devel] [PATCH] h264.c/decode_cabac_residual optimization
Måns Rullgård
mans
Wed Jul 2 12:37:35 CEST 2008
Laurent Desnogues wrote:
> On Wed, Jul 2, 2008 at 11:45 AM, Siarhei Siamashka
> <siarhei.siamashka at gmail.com> wrote:
> [...]
>>>> /**********************/
>>>> int q();
>>>>
>>>> void f1(int n)
>>>> {
>>>> while (--n >= 0) {
>>>> q();
>>>> }
>>>> }
>>>>
>>>> void f2(int n)
>>>> {
>>>> while (n--) {
>>>> q();
>>>> }
>>>> }
>>>> /**********************/
>>>
>>> Any half-decent compiler should generate the same code for those two
>>> functions.
>>
>> That's not true, just because these two functions are not identical.
>> Hint: what happens if you pass -1 or any other negative value to these
>> functions?
>>
>>> GCC for ARM generates a slightly different, but equivalent, setup sequence,
>>> and the loops are exactly the same.
>>
>> In my case, gcc 3.4.4 (using '-march=armv6 -O3 -c' options) generated
>> the following assembly output, which is definitely better for 'f1' (3
>> instructions in the inner loop instead of 4):
>>
>> 00000000 <f1>:
>> 0: e92d4010 stmdb sp!, {r4, lr}
>> 4: e2504001 subs r4, r0, #1 ; 0x1
>> 8: 48bd8010 ldmmiia sp!, {r4, pc}
>> c: ebfffffe bl 0 <q>
>> 10: e2544001 subs r4, r4, #1 ; 0x1
>> 14: 5afffffc bpl c <f1+0xc>
>> 18: e8bd8010 ldmia sp!, {r4, pc}
>>
>> 0000001c <f2>:
>> 1c: e92d4010 stmdb sp!, {r4, lr}
>> 20: e2504001 subs r4, r0, #1 ; 0x1
>> 24: 38bd8010 ldmccia sp!, {r4, pc}
>> 28: e2444001 sub r4, r4, #1 ; 0x1
>> 2c: ebfffffe bl 0 <q>
>> 30: e3740001 cmn r4, #1 ; 0x1
>> 34: 1afffffb bne 28 <q+0x28>
>> 38: e8bd8010 ldmia sp!, {r4, pc}
>>
>> I'm curious, what is the output of your compiler?
>
> CSL 2007q3 and 2008q1 both generate this:
>
> 00000000 <f2>:
> 0: e92d4070 push {r4, r5, r6, lr}
> 4: e2505000 subs r5, r0, #0 ; 0x0
> 8: 08bd8070 popeq {r4, r5, r6, pc}
> c: e3a04000 mov r4, #0 ; 0x0
> 10: e2844001 add r4, r4, #1 ; 0x1
> 14: ebfffffe bl 0 <q>
> 18: e1540005 cmp r4, r5
> 1c: 1afffffb bne 10 <f2+0x10>
> 20: e8bd8070 pop {r4, r5, r6, pc}
>
> 00000024 <f1>:
> 24: e3500001 cmp r0, #1 ; 0x1
> 28: e92d4070 push {r4, r5, r6, lr}
> 2c: e1a05000 mov r5, r0
> 30: 48bd8070 popmi {r4, r5, r6, pc}
> 34: e3a04000 mov r4, #0 ; 0x0
> 38: e2844001 add r4, r4, #1 ; 0x1
> 3c: ebfffffe bl 0 <q>
> 40: e1540005 cmp r4, r5
> 44: 1afffffb bne 38 <q+0x38>
> 48: e8bd8070 pop {r4, r5, r6, pc}
That's exactly what I got too. It's curious that it saves r6, even
though it is never used. Perhaps it does this to keep the stack
8-byte aligned. Also curious is why r4 and r5 are used, rather than
the callee-saved r1 and r2. What a waste of 4 bytes stack space.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list