[FFmpeg-devel] [PATCH] AAC: unroll parts of decode_spectrum_and_dequant()

Tue Dec 9 20:36:29 CET 2008

"Robert Swain" <robert.swain at gmail.com> writes:

> 2008/12/9 M?ns Rullg?rd <mans at mansr.com>:
>> M?ns Rullg?rd wrote:
>>> Michael Niedermayer wrote:
>>>> something like:
>>>> if (vq_ptr[2]) ((uint32_t*)coef)[coef_tmp_idx + 2] = (get_bits1(gb)<<31) +
>>>> 0x3F800000;
>>>>
>>>> might be even faster
>>>> but i agree with robert that this should be a seperate patch
>>>
>>> Strict aliasing violation.  Depending on CPU it might also be slower.
>>> Most FPUs can generate +-1 constants efficiently.
>>
>> I meant that something like this could be faster:
>>
>>  if (vq_ptr[2]) coef[coef_tmp_idx + 2] = get_bits1(gb)? -1.0 : 1.0;
>>
>> IMO it deserves testing.
>
> If it is, would you want the commit reverting and this committing
> instead or would committing this over the top be acceptable?

In cases like this, fewer commits are better IMO.  Anyone who wants
can easily get a diff from rN to rN+2.

> Alex: Feel free to test M?ns' suggestion. :) I'm looking at your
> other patches.

I tried a few compilers on a simplified test case:

void f1(float *f, int x)
{
    *f = x? -1.0: 1.0;
}

void f2(float *f, int x)
{
    static const float t[2] = { 1.0, -1.0 };
    *f = t[x];
}

The results are varied.

x86 gcc-4.3.2 -march=core2 -O3:
0000000000000000 <f1>:
   0:   b8 00 00 80 3f          mov    $0x3f800000,%eax
   5:   85 f6                   test   %esi,%esi
   7:   74 05                   je     e <f1+0xe>
   9:   b8 00 00 80 bf          mov    $0xbf800000,%eax
   e:   89 07                   mov    %eax,(%rdi)
  10:   c3                      retq   

0000000000000020 <f2>:
  20:   48 63 f6                movslq %esi,%rsi
  23:   8b 04 b5 00 00 00 00    mov    0x0(,%rsi,4),%eax
  2a:   89 07                   mov    %eax,(%rdi)
  2c:   c3                      retq   

I reckon the table wins here.  I wonder why it didn't use cmov in the
first case.

ARM gcc-4.2.1 -mcpu=cortex-a8 -mfpu=neon -O3:
00000000 <f1>:
   0:   eef77a00        fconsts s15, #112
   4:   e3510000        cmp     r1, #0  ; 0x0
   8:   eebf7a00        fconsts s14, #240
   c:   1ef07a47        fcpysne s15, s14
  10:   edc07a00        fsts    s15, [r0]
  14:   e12fff1e        bx      lr

00000018 <f2>:
  18:   e59f3008        ldr     r3, [pc, #8]    ; 28 <f2+0x10>
  1c:   e7932101        ldr     r2, [r3, r1, lsl #2]
  20:   e5802000        str     r2, [r0]
  24:   e12fff1e        bx      lr
  28:   00000000        .word   0x00000000

The table probably wins if it is cached.  Otherwise the conditional
looks faster.

ARM gcc-4.3.2 -mcpu=cortex-a8 -mfpu=neon -O3:
00000000 <f1>:
   0:   eeb77a00        fconsts s14, #112
   4:   e3510000        cmp     r1, #0  ; 0x0
   8:   eeff7a00        fconsts s15, #240
   c:   0ef07a47        fcpyseq s15, s14
  10:   edc07a00        fsts    s15, [r0]
  14:   e12fff1e        bx      lr

00000020 <f2>:
  20:   e3003000        movw    r3, #0  ; 0x0
  24:   e3403000        movt    r3, #0  ; 0x0
  28:   e7932101        ldr     r2, [r3, r1, lsl #2]
  2c:   e5802000        str     r2, [r0]
  30:   e12fff1e        bx      lr

The table case is slightly improved by the use of a movw/movt pair
instead of loading the table address from a literal pool.

ARM RVCT 4.0 armcc -O3 -Otime --cpu cortex-a8:
00000000 <f1>:
   0:   e3510000        cmp     r1, #0  ; 0x0
   4:   0eb70a00        fconstseq       s0, #112
   8:   1ebf0a00        fconstsne       s0, #240
   c:   ed800a00        fsts    s0, [r0]
  10:   e12fff1e        bx      lr

00000014 <f2>:
  14:   e59f200c        ldr     r2, [pc, #12]   ; 28 <f2+0x14>
  18:   e0821101        add     r1, r2, r1, lsl #2
  1c:   ed910a00        flds    s0, [r1]
  20:   ed800a00        fsts    s0, [r0]
  24:   e12fff1e        bx      lr
  28:   00000000        .word   0x00000000

This version of f1() is what I was imagining when I suggested doing it
this way.

TI TMS470 v4.6.0A08249 cl470 -O3 -me -mf=5 -mv=7a8 --abi=eabi -op=3:
00000000 <f2>:
   0:   e59fc01c        ldr     ip, [pc, #28]   ; 24 <f1+0x14>
   4:   e79cc101        ldr     ip, [ip, r1, lsl #2]
   8:   e580c000        str     ip, [r0]
   c:   e12fff1e        bx      lr

00000010 <f1>:
  10:   e3510000        cmp     r1, #0  ; 0x0
  14:   159fc010        ldrne   ip, [pc, #16]   ; 2c <f1+0x1c>
  18:   059fc008        ldreq   ip, [pc, #8]    ; 28 <f1+0x18>
  1c:   e580c000        str     ip, [r0]
  20:   e12fff1e        bx      lr
  24:   00000000        .word   0x00000000
  28:   3f800000        .word   0x3f800000
  2c:   bf800000        .word   0xbf800000

Note that the order of the functions is reversed.  It uses a table
even for the conditional.

PPC gcc-4.2.4 -mcpu=970 -O3:
0000000000000000 <.f1>:
   0:   2f a4 00 00     cmpdi   cr7,r4,0
   4:   41 9e 00 1c     beq-    cr7,20 <.f1+0x20>
   8:   c0 02 00 00     lfs     f0,0(r2)
   c:   d0 03 00 00     stfs    f0,0(r3)
  10:   4e 80 00 20     blr
  14:   60 00 00 00     nop
  18:   60 00 00 00     nop
  1c:   60 00 00 00     nop
  20:   c0 02 00 08     lfs     f0,8(r2)
  24:   d0 03 00 00     stfs    f0,0(r3)
  28:   4e 80 00 20     blr
        ...

0000000000000040 <.f2>:
  40:   e9 22 00 10     ld      r9,16(r2)
  44:   78 84 17 64     rldicr  r4,r4,2,61
  48:   7c 04 4c 2e     lfsx    f0,r4,r9
  4c:   d0 03 00 00     stfs    f0,0(r3)
  50:   4e 80 00 20     blr

My PPC assembler knowledge isn't the best, but I don't like what f1()
looks like here.

Looking at the above, I think it's safest to leave the table.
Silly compilers...

-- 
M?ns Rullg?rd
mans at mansr.com