[FFmpeg-devel] [PATCH] swscale/arm: add yuv2planeX_8_neon
Benoit Fouet
benoit.fouet at free.fr
Mon Apr 11 09:58:12 CEST 2016
Hi,
(again, thanks to both of you for documenting all this assembly /NEON code)
On 09/04/2016 10:22, Matthieu Bouron wrote:
> From: Matthieu Bouron <matthieu.bouron at stupeflix.com>
>
> ---
>
> Hello,
>
> The following patch add yuv2planeX_8_neon function for the arm platform. It is
> currently restricted to 8-bit per component sources until I fix fate issues
> with 10-bit sources (the dnxhd-*-10bit tests fail but I haven't figured out yet
> where it comes from).
>
> Matthieu
>
> ---
> libswscale/arm/Makefile | 1 +
> libswscale/arm/output.S | 78 ++++++++++++++++++++++++++++++++++++++++++++++++
> libswscale/arm/swscale.c | 7 +++++
> libswscale/utils.c | 3 +-
> 4 files changed, 88 insertions(+), 1 deletion(-)
> create mode 100644 libswscale/arm/output.S
>
> [...]
>
> diff --git a/libswscale/arm/output.S b/libswscale/arm/output.S
> new file mode 100644
> index 0000000..4437447
> --- /dev/null
> +++ b/libswscale/arm/output.S
> @@ -0,0 +1,78 @@
[...]
> +function ff_yuv2planeX_8_neon, export=1
> + push {r4-r12, lr}
> + vpush {q4-q7}
> + ldr r4, [sp, #104] @ dstW
> + ldr r5, [sp, #108] @ dither
> + ldr r6, [sp, #112] @ offset
> + vld1.8 {d0}, [r5] @ load 8x8-bit dither values
> + tst r6, #0 @ check offsetting which can be 0 or 3 only
> + beq 1f
> + vext.u8 d0, d0, d0, #3 @ honor offseting which can be 3 only
> +1: vmovl.u8 q0, d0 @ extend dither to 16-bit
> + vshll.u16 q1, d0, #12 @ extend dither to 32-bit with left shift by 12 (part 1)
> + vshll.u16 q2, d1, #12 @ extend dither to 32-bit with left shift by 12 (part 2)
> + mov r7, #0 @ i = 0
> +2: vmov.u8 q3, q1 @ initialize accumulator with dithering values (part 1)
> + vmov.u8 q4, q2 @ initialize accumulator with dithering values (part 2)
> + mov r8, r1 @ tmpFilterSize = filterSize
> + mov r9, r2 @ srcp
> + mov r10, r0 @ filterp
> +3: ldr r11, [r9], #4 @ get pointer @ src[j]
> + ldr r12, [r9], #4 @ get pointer @ src[j+1]
> + add r11, r11, r7, lsl #1 @ &src[j][i]
> + add r12, r12, r7, lsl #1 @ &src[j+1][i]
> + vld1.16 {q5}, [r11] @ read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H
> + vld1.16 {q6}, [r12] @ read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P
> + ldr r11, [r10], #4 @ read 2x16-bit coeffs (X, Y) at (filter[j], filter[j+1])
> + vmov.16 q7, q5 @ copy 8x16-bit @ src[j ][i + {0..7}] for following inplace zip instruction
> + vmov.16 q8, q6 @ copy 8x16-bit @ src[j+1][i + {0..7}] for following inplace zip instruction
> + vzip.16 q7, q8 @ A,I,B,J,C,K,D,L,E,M,F,N,G,O,H,L
nit: O,H,P
--
Ben
More information about the ffmpeg-devel
mailing list