[FFmpeg-devel] [PATCH] optimize for ARM NEON
Ian Caulfield
ian.caulfield
Wed Feb 24 16:08:33 CET 2010
On 24 February 2010 13:51, Alfred E. Heggestad <aeh at db.org> wrote:
> Hi
>
> We have optimized various routines in the FFmpeg libavcodec for
> ARM with NEON instructions.. The patch have been tested with the
> H.263 codec on iPhone 3GS hardware, with major performance improvements..
>
>
> we would like to donate the code to the FFmpeg project, please let me
> know if you have any comments.
>
I've only had a very brief look through, but there seems to be a fair
amount of scope for further optimisation, eg in ff_pix_abs16_xy2_neon,
I reckon could be simplified as:
push {r4-r5, lr}
ldr r4, [sp, #20]
add r5, r2, #1 /* pix2' = pix2 + 1 */
/* Clear result registers */
vmov.i16 q11, #0
vmov.i16 q12, #0
/* fetch first pix2 and pix2' */
vld1.8 {d2, d3}, [r2], r3
vld1.8 {d4, d5}, [r5], r3
1:
/* fetch pix3 and pix3' */
vld1.8 {d6, d7}, [r2], r3
vld1.8 {d8, d9}, [r5], r3
subs r4, r4, #1 /* h = h - 1 */
/* fetch pix1 */
vld1.8 {d0, d1}, [r1], r3 /* 16 bytes in q1 */
/* Average the pixels */
/* a1 = pix2 + pix2' */
vaddl.u16 q5, d2, d3
vaddl.u16 q6, d4, d5
/* a2 = pix3 + pix3' */
vaddl.u16 q7, d6, d7
vaddl.u16 q8, d8, d9
/* a1 + a2 */
vadd.u16 q5, q5, q7
vadd.u16 q6, q6, q8
/* ROUND((a1 + a2)/4) */
vrshrn.u16 d14, q5, #2
vrshrn.u16 d15, q6, #2
/* Now, abs(pix1 - avg4(pix20, pix21, pix30, pix31) */
vabal.u8 q11, d0, d14
vabal.u8 q12, d1, d15
/* Move pix3 from this iteration to be pix2 in next */
vmov q1, q3
vmov q2, q4
bgt 1b /* while (h > 0) */
/* We are done!
* calcualte the result
*/
vpaddl.u16 q4, q11
vpadal.u16 q4, q12
vpadd.u32 d0, d8, d9
vmov r5, r6, d0
add r0, r5, r6
pop {r4-r7, pc}
Although I don't know if this can potentially overflow, as q11 and q12
are kept as 16-bit elements, and s in the c version is 32-bit.
In general, many of the vmovl instructions are redundant, as they then
feed into vadd instructions - it's fewer instructions (and therefore
likely quicker) to fold the widen into the add and use vaddw or vaddl
instead.
Ian
More information about the ffmpeg-devel
mailing list