[FFmpeg-devel] [PATCH] optimize for ARM NEON

Wed Feb 24 16:08:33 CET 2010

On 24 February 2010 13:51, Alfred E. Heggestad <aeh at db.org> wrote:
> Hi
>
> We have optimized various routines in the FFmpeg libavcodec for
> ARM with NEON instructions.. The patch have been tested with the
> H.263 codec on iPhone 3GS hardware, with major performance improvements..
>
>
> we would like to donate the code to the FFmpeg project, please let me
> know if you have any comments.
>

I've only had a very brief look through, but there seems to be a fair
amount of scope for further optimisation, eg in ff_pix_abs16_xy2_neon,
I reckon could be simplified as:

	push {r4-r5, lr}

	ldr r4, [sp, #20]
	add r5, r2, #1 /* pix2' = pix2 + 1 */

	/* Clear result registers */
	vmov.i16 q11, #0
	vmov.i16 q12, #0

	/* fetch first pix2 and pix2' */
	vld1.8 {d2, d3}, [r2], r3
	vld1.8 {d4, d5}, [r5], r3

1:
	/* fetch pix3 and pix3' */
	vld1.8 {d6, d7}, [r2], r3
	vld1.8 {d8, d9}, [r5], r3

	subs r4, r4, #1 /* h = h - 1 */

	/* fetch pix1 */
	vld1.8 {d0, d1}, [r1], r3 /* 16 bytes in q1 */

	/* Average the pixels */
	/* a1 = pix2 + pix2' */
	vaddl.u16 q5, d2, d3
	vaddl.u16 q6, d4, d5

	/* a2 = pix3 + pix3' */
	vaddl.u16 q7, d6, d7
	vaddl.u16 q8, d8, d9

	/* a1 + a2 */
	vadd.u16 q5, q5, q7
	vadd.u16 q6, q6, q8

	/* ROUND((a1 + a2)/4) */
	vrshrn.u16 d14, q5, #2
	vrshrn.u16 d15, q6, #2
	/* Now, abs(pix1 - avg4(pix20, pix21, pix30, pix31) */
	vabal.u8 q11, d0, d14
	vabal.u8 q12, d1, d15

	/* Move pix3 from this iteration to be pix2 in next */
	vmov q1, q3
	vmov q2, q4

	bgt 1b /* while (h > 0) */

	/* We are done!
	 * calcualte the result
	 */
	vpaddl.u16 q4, q11
	vpadal.u16 q4, q12

	vpadd.u32 d0, d8, d9

	vmov r5, r6, d0
	add r0, r5, r6

	pop {r4-r7, pc}

Although I don't know if this can potentially overflow, as q11 and q12
are kept as 16-bit elements, and s in the c version is 32-bit.

In general, many of the vmovl instructions are redundant, as they then
feed into vadd instructions - it's fewer instructions (and therefore
likely quicker) to fold the widen into the add and use vaddw or vaddl
instead.

Ian