[FFmpeg-devel] [PATCH 2/7] ARM: NEON optimised simple_idct
Måns Rullgård
mans
Sat Dec 6 03:57:41 CET 2008
"Ian Caulfield" <ian.caulfield at gmail.com> writes:
> 2008/12/5 Mans Rullgard <mans at mansr.com>:
>
>> +function idct_col4_st8_neon
>> + vshr.s32 q2, q3, #COL_SHIFT
>> + vshr.s32 q3, q4, #COL_SHIFT
>> + vmovn.i32 d2, q2
>> + vshr.s32 q4, q7, #COL_SHIFT
>> + vmovn.i32 d3, q3
>> + vshr.s32 q5, q8, #COL_SHIFT
>> + vqmovun.s16 d2, q1
>> + vmovn.i32 d4, q4
>> + vshr.s32 q6, q14, #COL_SHIFT
>> + vst1.32 {d2[0]}, [r0,:32], r1
>> + vmovn.i32 d5, q5
>> + vshr.s32 q7, q13, #COL_SHIFT
>> + vst1.32 {d2[1]}, [r0,:32], r1
>> + vmovn.i32 d6, q6
>> + vqmovun.s16 d3, q2
>
> I'm probably missing something fundamental here, but could the
> sequence of instructions
>
> vadd.i32 q3, q11, q9 (in col4_neon)
> vadd.i32 q4, q12, q10 (in col4_neon)
> vshr.s32 q2, q3, #COL_SHIFT
> vshr.s32 q3, q4, #COL_SHIFT
> vmovn.i32 d2, q2
> vmovn.i32 d3, q3
> vqmovun.s16 d2, q1
>
> be replaced by something like
>
> vaddhn.s32 d6, q11, q9
> vaddhn.s32 d7, q12,q10
> vqshrun.s16 d2, q3, #COL_SHIFT-16
I'm the one missing the obvious. That works perfectly. Thanks.
A quick benchmark suggests it's a little faster too.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list