[FFmpeg-devel] [PATCH] NEON FFT/IMDCT
David Conrad
lessen42
Thu Sep 3 15:37:41 CEST 2009
On Sep 3, 2009, at 5:41 AM, Naotoshi Nojiri wrote:
> Hi,
>
> I tested the patch on Cortex-A8 @500MHz (BeagleBoard).
> FFT (fft-test -s):
> 440.8 -> 34.2 us/transform (12.9x speed up)
> IMDCT (fft-test -i -m -s):
> 142.4 -> 11.8 us/transform (12.1x speed up)
>
> I had written NEON intrinsics code a bit, but this is my first
> ARM/NEON code in assembly.
> So, any comments and suggestions would be appreciated.
> +__attribute__((noinline)) void ff_imdct_half_neon(MDCTContext *s,
> FFTSample *output, const FFTSample *input)
av_noinline
> +fft4_neon: // r0: FFTComplex *z
> + vld1.32 {d16-d19}, [r0, :128] // q8{r0,i0,r1,i1} q9{r2,i2,r3,i3}
> + vext.32 q9, q9, q9, #1
> + vswp d17, d18 // q8{r0,i0,i2,r3} q9{r1,i1,i3,r2}
> + vadd.f32 q10, q8, q9 // {t1,t2,t5,t6}
> + vsub.f32 q9, q8, q9 // {t3,t4,t7,t8}
> + vrev64.32 d21, d21
> + vswp d21, d18 // q10{t1,t2,t3,t4} q9{t6,t5,t7,t8}
> + vadd.f32 q8, q10, q9 // {r0,i0,r1,i1}
> + vsub.f32 q9, q10, q9 // {r2,i2,r3,i3}
> + vst1.32 {d16-d19}, [r0, :128]
> + bx lr
This sequence is very much latency-bound; vadd/vsub on d registers
then vtrn.32 should be faster. On A8, most NEON floating point
instructions aren't any faster to do on q registers as opposed to both
d registers individually, so if it avoids some permutes to use 64-bit
registers it'll probably be worth it.
> +function ff_fft_calc_neon, export=1
> + ldr r2, [r0]
> + mov r0, r1
> + subs r2, r2, #3
> + blt fft4_neon
> +
> + push {r4-r6, lr}
> + movrel r3, fft_dispatch_neon
> + mov lr, pc
> + ldr pc, [r3, r2, lsl #2]
> + pop {r4-r6, pc}
This causes a branch misprediction; always call functions with bl or
blx.
Although this case looks like it can simply return directly from
pass_neon.
More information about the ffmpeg-devel
mailing list