[FFmpeg-devel] [PATCH] AAC: unroll parts of decode_spectrum_and_dequant()
Måns Rullgård
mans
Tue Dec 9 20:36:29 CET 2008
"Robert Swain" <robert.swain at gmail.com> writes:
> 2008/12/9 M?ns Rullg?rd <mans at mansr.com>:
>> M?ns Rullg?rd wrote:
>>> Michael Niedermayer wrote:
>>>> something like:
>>>> if (vq_ptr[2]) ((uint32_t*)coef)[coef_tmp_idx + 2] = (get_bits1(gb)<<31) +
>>>> 0x3F800000;
>>>>
>>>> might be even faster
>>>> but i agree with robert that this should be a seperate patch
>>>
>>> Strict aliasing violation. Depending on CPU it might also be slower.
>>> Most FPUs can generate +-1 constants efficiently.
>>
>> I meant that something like this could be faster:
>>
>> if (vq_ptr[2]) coef[coef_tmp_idx + 2] = get_bits1(gb)? -1.0 : 1.0;
>>
>> IMO it deserves testing.
>
> If it is, would you want the commit reverting and this committing
> instead or would committing this over the top be acceptable?
In cases like this, fewer commits are better IMO. Anyone who wants
can easily get a diff from rN to rN+2.
> Alex: Feel free to test M?ns' suggestion. :) I'm looking at your
> other patches.
I tried a few compilers on a simplified test case:
void f1(float *f, int x)
{
*f = x? -1.0: 1.0;
}
void f2(float *f, int x)
{
static const float t[2] = { 1.0, -1.0 };
*f = t[x];
}
The results are varied.
x86 gcc-4.3.2 -march=core2 -O3:
0000000000000000 <f1>:
0: b8 00 00 80 3f mov $0x3f800000,%eax
5: 85 f6 test %esi,%esi
7: 74 05 je e <f1+0xe>
9: b8 00 00 80 bf mov $0xbf800000,%eax
e: 89 07 mov %eax,(%rdi)
10: c3 retq
0000000000000020 <f2>:
20: 48 63 f6 movslq %esi,%rsi
23: 8b 04 b5 00 00 00 00 mov 0x0(,%rsi,4),%eax
2a: 89 07 mov %eax,(%rdi)
2c: c3 retq
I reckon the table wins here. I wonder why it didn't use cmov in the
first case.
ARM gcc-4.2.1 -mcpu=cortex-a8 -mfpu=neon -O3:
00000000 <f1>:
0: eef77a00 fconsts s15, #112
4: e3510000 cmp r1, #0 ; 0x0
8: eebf7a00 fconsts s14, #240
c: 1ef07a47 fcpysne s15, s14
10: edc07a00 fsts s15, [r0]
14: e12fff1e bx lr
00000018 <f2>:
18: e59f3008 ldr r3, [pc, #8] ; 28 <f2+0x10>
1c: e7932101 ldr r2, [r3, r1, lsl #2]
20: e5802000 str r2, [r0]
24: e12fff1e bx lr
28: 00000000 .word 0x00000000
The table probably wins if it is cached. Otherwise the conditional
looks faster.
ARM gcc-4.3.2 -mcpu=cortex-a8 -mfpu=neon -O3:
00000000 <f1>:
0: eeb77a00 fconsts s14, #112
4: e3510000 cmp r1, #0 ; 0x0
8: eeff7a00 fconsts s15, #240
c: 0ef07a47 fcpyseq s15, s14
10: edc07a00 fsts s15, [r0]
14: e12fff1e bx lr
00000020 <f2>:
20: e3003000 movw r3, #0 ; 0x0
24: e3403000 movt r3, #0 ; 0x0
28: e7932101 ldr r2, [r3, r1, lsl #2]
2c: e5802000 str r2, [r0]
30: e12fff1e bx lr
The table case is slightly improved by the use of a movw/movt pair
instead of loading the table address from a literal pool.
ARM RVCT 4.0 armcc -O3 -Otime --cpu cortex-a8:
00000000 <f1>:
0: e3510000 cmp r1, #0 ; 0x0
4: 0eb70a00 fconstseq s0, #112
8: 1ebf0a00 fconstsne s0, #240
c: ed800a00 fsts s0, [r0]
10: e12fff1e bx lr
00000014 <f2>:
14: e59f200c ldr r2, [pc, #12] ; 28 <f2+0x14>
18: e0821101 add r1, r2, r1, lsl #2
1c: ed910a00 flds s0, [r1]
20: ed800a00 fsts s0, [r0]
24: e12fff1e bx lr
28: 00000000 .word 0x00000000
This version of f1() is what I was imagining when I suggested doing it
this way.
TI TMS470 v4.6.0A08249 cl470 -O3 -me -mf=5 -mv=7a8 --abi=eabi -op=3:
00000000 <f2>:
0: e59fc01c ldr ip, [pc, #28] ; 24 <f1+0x14>
4: e79cc101 ldr ip, [ip, r1, lsl #2]
8: e580c000 str ip, [r0]
c: e12fff1e bx lr
00000010 <f1>:
10: e3510000 cmp r1, #0 ; 0x0
14: 159fc010 ldrne ip, [pc, #16] ; 2c <f1+0x1c>
18: 059fc008 ldreq ip, [pc, #8] ; 28 <f1+0x18>
1c: e580c000 str ip, [r0]
20: e12fff1e bx lr
24: 00000000 .word 0x00000000
28: 3f800000 .word 0x3f800000
2c: bf800000 .word 0xbf800000
Note that the order of the functions is reversed. It uses a table
even for the conditional.
PPC gcc-4.2.4 -mcpu=970 -O3:
0000000000000000 <.f1>:
0: 2f a4 00 00 cmpdi cr7,r4,0
4: 41 9e 00 1c beq- cr7,20 <.f1+0x20>
8: c0 02 00 00 lfs f0,0(r2)
c: d0 03 00 00 stfs f0,0(r3)
10: 4e 80 00 20 blr
14: 60 00 00 00 nop
18: 60 00 00 00 nop
1c: 60 00 00 00 nop
20: c0 02 00 08 lfs f0,8(r2)
24: d0 03 00 00 stfs f0,0(r3)
28: 4e 80 00 20 blr
...
0000000000000040 <.f2>:
40: e9 22 00 10 ld r9,16(r2)
44: 78 84 17 64 rldicr r4,r4,2,61
48: 7c 04 4c 2e lfsx f0,r4,r9
4c: d0 03 00 00 stfs f0,0(r3)
50: 4e 80 00 20 blr
My PPC assembler knowledge isn't the best, but I don't like what f1()
looks like here.
Looking at the above, I think it's safest to leave the table.
Silly compilers...
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list