[FFmpeg-devel] [PATCH] optimize for ARM NEON
Alfred E. Heggestad
aeh
Thu Feb 25 10:14:54 CET 2010
On 2/24/10 4:08 PM, Ian Caulfield wrote:
> On 24 February 2010 13:51, Alfred E. Heggestad<aeh at db.org> wrote:
>> Hi
>>
>> We have optimized various routines in the FFmpeg libavcodec for
>> ARM with NEON instructions.. The patch have been tested with the
>> H.263 codec on iPhone 3GS hardware, with major performance improvements..
>>
>>
>> we would like to donate the code to the FFmpeg project, please let me
>> know if you have any comments.
>>
>
> I've only had a very brief look through, but there seems to be a fair
> amount of scope for further optimisation, eg in ff_pix_abs16_xy2_neon,
> I reckon could be simplified as:
>
> push {r4-r5, lr}
>
> ldr r4, [sp, #20]
> add r5, r2, #1 /* pix2' = pix2 + 1 */
>
> /* Clear result registers */
> vmov.i16 q11, #0
> vmov.i16 q12, #0
>
> /* fetch first pix2 and pix2' */
> vld1.8 {d2, d3}, [r2], r3
> vld1.8 {d4, d5}, [r5], r3
>
> 1:
> /* fetch pix3 and pix3' */
> vld1.8 {d6, d7}, [r2], r3
> vld1.8 {d8, d9}, [r5], r3
>
> subs r4, r4, #1 /* h = h - 1 */
>
> /* fetch pix1 */
> vld1.8 {d0, d1}, [r1], r3 /* 16 bytes in q1 */
>
> /* Average the pixels */
> /* a1 = pix2 + pix2' */
> vaddl.u16 q5, d2, d3
> vaddl.u16 q6, d4, d5
>
> /* a2 = pix3 + pix3' */
> vaddl.u16 q7, d6, d7
> vaddl.u16 q8, d8, d9
>
> /* a1 + a2 */
> vadd.u16 q5, q5, q7
> vadd.u16 q6, q6, q8
>
> /* ROUND((a1 + a2)/4) */
> vrshrn.u16 d14, q5, #2
> vrshrn.u16 d15, q6, #2
> /* Now, abs(pix1 - avg4(pix20, pix21, pix30, pix31) */
> vabal.u8 q11, d0, d14
> vabal.u8 q12, d1, d15
>
> /* Move pix3 from this iteration to be pix2 in next */
> vmov q1, q3
> vmov q2, q4
>
> bgt 1b /* while (h> 0) */
>
> /* We are done!
> * calcualte the result
> */
> vpaddl.u16 q4, q11
> vpadal.u16 q4, q12
>
> vpadd.u32 d0, d8, d9
>
> vmov r5, r6, d0
> add r0, r5, r6
>
> pop {r4-r7, pc}
>
> Although I don't know if this can potentially overflow, as q11 and q12
> are kept as 16-bit elements, and s in the c version is 32-bit.
>
> In general, many of the vmovl instructions are redundant, as they then
> feed into vadd instructions - it's fewer instructions (and therefore
> likely quicker) to fold the widen into the add and use vaddw or vaddl
> instead.
>
Ian,
many thanks for your feedback..
I will have a look at your comments and send an updated Patch next week ..
/alfred
More information about the ffmpeg-devel
mailing list