[FFmpeg-devel] [PATCH] NEON code for basic scalar ops
Måns Rullgård
mans
Thu Aug 13 11:51:36 CEST 2009
Kostya <kostya.shishkov at gmail.com> writes:
> On Thu, Aug 13, 2009 at 12:33:07AM +0100, M?ns Rullg?rd wrote:
>> Kostya <kostya.shishkov at gmail.com> writes:
>>
>> > On Tue, Jul 21, 2009 at 03:23:58PM +0100, M?ns Rullg?rd wrote:
>> >> Kostya <kostya.shishkov at gmail.com> writes:
>> >>
>> >> > While waiting for RTMP patch review, here's a bit of NEON code to speed
>> >> > up int16 array addition/subtraction and scalar product calculation.
>> >> >
>> >> > This about halves decoding time for APE compressed at insane level
>> >> > (so it's only 7 times slower than realtime on my BeagleBoard).
>> >>
>> >> These functions are far from optimal.
>> >
>> > Since I won't be able to work at it for some time I post here version
>> > that is few cycles closer to optimal (but still far away).
>> >
>> > +function ff_scalarproduct_int16_neon, export=1
>> > + vmov.i16 q0, #0
>> > + vmov.i16 q1, #0
>> > + vmov.i16 q2, #0
>> > + vmov.i16 q3, #0
>> > +1: vld1.16 {d16-d17}, [r0]!
>> > + vld1.16 {d20-d21}, [r1,:128]!
>> > + vmlal.s16 q0, d16, d20
>> > + vld1.16 {d18-d19}, [r0]!
>> > + vmlal.s16 q1, d17, d21
>> > + vld1.16 {d22-d23}, [r1,:128]!
>> > + vmlal.s16 q2, d18, d22
>> > + vmlal.s16 q3, d19, d23
>> > + subs r2, r2, #16
>> > + bne 1b
>> > + vpadd.s32 d8, d0, d1
>> > + vpadd.s32 d9, d2, d3
>> > + vpadd.s32 d10, d4, d5
>> > + vpadd.s32 d11, d6, d7
>> > + vpadd.s32 d0, d8, d9
>> > + vpadd.s32 d1, d10, d11
>> > + vpadd.s32 d2, d0, d1
>> > + vpaddl.s32 d3, d2
>> > + vmov.32 r0, d3[0]
>> > + asr r0, r3
>> > + bx lr
>> > + .endfunc
>>
>> This doesn't do exactly the same thing as the C version, which shifts
>> immediately after the multiplication, before accumulating. However,
>> all calls to DSPContext.scalarproduct_int16 have a zero shift.
>>
>> Since shifting at the end is both more accurate and faster, maybe we
>> should change it. Someone would have to update the sse and altivec
>> versions of course.
>
> The intent was to have sped-up scalar product calculating for Monkey
> Audio but with CELP filters in mind too. Since those use fixed point
> values, shift right after multiplication is logical there (and will
> prevent overflows).
If you shift after multiplying, you can't use multiply-accumulate
instructions.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list