[FFmpeg-devel] [PATCH 2/7] ARM: NEON optimised simple_idct

Måns Rullgård mans
Sat Dec 6 03:57:41 CET 2008


"Ian Caulfield" <ian.caulfield at gmail.com> writes:

> 2008/12/5 Mans Rullgard <mans at mansr.com>:
>
>> +function idct_col4_st8_neon
>> +        vshr.s32        q2,  q3,  #COL_SHIFT
>> +        vshr.s32        q3,  q4,  #COL_SHIFT
>> +        vmovn.i32       d2,  q2
>> +        vshr.s32        q4,  q7,  #COL_SHIFT
>> +        vmovn.i32       d3,  q3
>> +        vshr.s32        q5,  q8,  #COL_SHIFT
>> +        vqmovun.s16     d2,  q1
>> +        vmovn.i32       d4,  q4
>> +        vshr.s32        q6,  q14, #COL_SHIFT
>> +        vst1.32         {d2[0]}, [r0,:32], r1
>> +        vmovn.i32       d5,  q5
>> +        vshr.s32        q7,  q13, #COL_SHIFT
>> +        vst1.32         {d2[1]}, [r0,:32], r1
>> +        vmovn.i32       d6,  q6
>> +        vqmovun.s16     d3,  q2
>
> I'm probably missing something fundamental here, but could the
> sequence of instructions
>
> vadd.i32        q3,  q11, q9 (in col4_neon)
> vadd.i32        q4,  q12, q10 (in col4_neon)
> vshr.s32        q2,  q3,  #COL_SHIFT
> vshr.s32        q3,  q4,  #COL_SHIFT
> vmovn.i32       d2,  q2
> vmovn.i32       d3,  q3
> vqmovun.s16     d2,  q1
>
> be replaced by something like
>
> vaddhn.s32    d6,  q11, q9
> vaddhn.s32    d7,  q12,q10
> vqshrun.s16  d2, q3, #COL_SHIFT-16

I'm the one missing the obvious.  That works perfectly.  Thanks.
A quick benchmark suggests it's a little faster too.

-- 
M?ns Rullg?rd
mans at mansr.com




More information about the ffmpeg-devel mailing list