[FFmpeg-devel] [PATCH] optimize for ARM NEON

Thu Feb 25 10:14:54 CET 2010

On 2/24/10 4:08 PM, Ian Caulfield wrote:
> On 24 February 2010 13:51, Alfred E. Heggestad<aeh at db.org>  wrote:
>> Hi
>>
>> We have optimized various routines in the FFmpeg libavcodec for
>> ARM with NEON instructions.. The patch have been tested with the
>> H.263 codec on iPhone 3GS hardware, with major performance improvements..
>>
>>
>> we would like to donate the code to the FFmpeg project, please let me
>> know if you have any comments.
>>
>
> I've only had a very brief look through, but there seems to be a fair
> amount of scope for further optimisation, eg in ff_pix_abs16_xy2_neon,
> I reckon could be simplified as:
>
> 	push {r4-r5, lr}
>
> 	ldr r4, [sp, #20]
> 	add r5, r2, #1 /* pix2' = pix2 + 1 */
>
> 	/* Clear result registers */
> 	vmov.i16 q11, #0
> 	vmov.i16 q12, #0
>
> 	/* fetch first pix2 and pix2' */
> 	vld1.8 {d2, d3}, [r2], r3
> 	vld1.8 {d4, d5}, [r5], r3
>
> 1:
> 	/* fetch pix3 and pix3' */
> 	vld1.8 {d6, d7}, [r2], r3
> 	vld1.8 {d8, d9}, [r5], r3
>
> 	subs r4, r4, #1 /* h = h - 1 */
>
> 	/* fetch pix1 */
> 	vld1.8 {d0, d1}, [r1], r3 /* 16 bytes in q1 */
>
> 	/* Average the pixels */
> 	/* a1 = pix2 + pix2' */
> 	vaddl.u16 q5, d2, d3
> 	vaddl.u16 q6, d4, d5
>
> 	/* a2 = pix3 + pix3' */
> 	vaddl.u16 q7, d6, d7
> 	vaddl.u16 q8, d8, d9
>
> 	/* a1 + a2 */
> 	vadd.u16 q5, q5, q7
> 	vadd.u16 q6, q6, q8
>
> 	/* ROUND((a1 + a2)/4) */
> 	vrshrn.u16 d14, q5, #2
> 	vrshrn.u16 d15, q6, #2
> 	/* Now, abs(pix1 - avg4(pix20, pix21, pix30, pix31) */
> 	vabal.u8 q11, d0, d14
> 	vabal.u8 q12, d1, d15
>
> 	/* Move pix3 from this iteration to be pix2 in next */
> 	vmov q1, q3
> 	vmov q2, q4
>
> 	bgt 1b /* while (h>  0) */
>
> 	/* We are done!
> 	 * calcualte the result
> 	 */
> 	vpaddl.u16 q4, q11
> 	vpadal.u16 q4, q12
>
> 	vpadd.u32 d0, d8, d9
>
> 	vmov r5, r6, d0
> 	add r0, r5, r6
>
> 	pop {r4-r7, pc}
>
> Although I don't know if this can potentially overflow, as q11 and q12
> are kept as 16-bit elements, and s in the c version is 32-bit.
>
> In general, many of the vmovl instructions are redundant, as they then
> feed into vadd instructions - it's fewer instructions (and therefore
> likely quicker) to fold the widen into the add and use vaddw or vaddl
> instead.
>

Ian,

many thanks for your feedback..

I will have a look at your comments and send an updated Patch next week ..

/alfred