[FFmpeg-devel] [PATCH 4/4] swscale/aarch64: add nv24/nv42 to yuv420p unscaled converter
Martin Storsjö
martin at martin.st
Wed Aug 14 15:31:12 EEST 2024
On Fri, 9 Aug 2024, Ramiro Polla wrote:
> checkasm --bench for Raspberry Pi 5 Model B Rev 1.0:
> nv24_yuv420p_128_c: 423.0
> nv24_yuv420p_128_neon: 115.7
> nv24_yuv420p_1920_c: 5939.5
> nv24_yuv420p_1920_neon: 1339.7
> nv42_yuv420p_128_c: 423.2
> nv42_yuv420p_128_neon: 115.7
> nv42_yuv420p_1920_c: 5907.5
> nv42_yuv420p_1920_neon: 1342.5
> ---
> libswscale/aarch64/Makefile | 1 +
> libswscale/aarch64/swscale_unscaled.c | 30 +++++++++
> libswscale/aarch64/swscale_unscaled_neon.S | 75 ++++++++++++++++++++++
> 3 files changed, 106 insertions(+)
> create mode 100644 libswscale/aarch64/swscale_unscaled_neon.S
> diff --git a/libswscale/aarch64/swscale_unscaled_neon.S b/libswscale/aarch64/swscale_unscaled_neon.S
> new file mode 100644
> index 0000000000..a206fda41f
> --- /dev/null
> +++ b/libswscale/aarch64/swscale_unscaled_neon.S
> @@ -0,0 +1,75 @@
> +/*
> + * Copyright (c) 2024 Ramiro Polla
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +function ff_nv24_to_yuv420p_chroma_neon, export=1
> +// x0 uint8_t *dst1
> +// x1 int dstStride1
> +// x2 uint8_t *dst2
> +// x3 int dstStride2
> +// x4 const uint8_t *src
> +// x5 int srcStride
> +// w6 int w
> +// w7 int h
> +
> + uxtw x1, w1
> + uxtw x3, w3
> + uxtw x5, w5
You can often avoid the explicit uxtw instructions, if you can fold an
uxtw attribute into the cases where the register is used. (If it's used
often, it may be slightly more performant to do it upfront like this
though, but often it can be omitted entirely.) And whenever you do an
operation with a wN register as destination, the upper half of the
register gets explicitly cleared, so these also may be avoided that way.
> +
> + add x9, x4, x5 // x9 = src + srcStride
> + lsl w5, w5, #1 // srcStride *= 2
> +
> +1:
> + mov w10, w6 // w10 = w
> + mov x11, x4 // x11 = src1 (line 1)
> + mov x12, x9 // x12 = src2 (line 2)
> + mov x13, x0 // x13 = dst1 (dstU)
> + mov x14, x2 // x14 = dst2 (dstV)
> +
> +2:
> + ld2 { v0.16b, v1.16b }, [x11], #32 // v0 = U1, v1 = V1
> + ld2 { v2.16b, v3.16b }, [x12], #32 // v2 = U2, v3 = V2
> +
> + uaddlp v0.8h, v0.16b // pairwise add U1 into v0
> + uaddlp v1.8h, v1.16b // pairwise add V1 into v1
> + uadalp v0.8h, v2.16b // pairwise add U2, accumulate into v0
> + uadalp v1.8h, v3.16b // pairwise add V2, accumulate into v1
> +
> + shrn v0.8b, v0.8h, #2 // divide by 4
> + shrn v1.8b, v1.8h, #2 // divide by 4
> +
> + st1 { v0.8b }, [x13], #8 // store U into dst1
> + st1 { v1.8b }, [x14], #8 // store V into dst2
> +
> + subs w10, w10, #8
> + b.gt 2b
> +
> + // next row
> + add x4, x4, x5 // src1 += srcStride * 2
> + add x9, x9, x5 // src2 += srcStride * 2
> + add x0, x0, x1 // dst1 += dstStride1
> + add x2, x2, x3 // dst2 += dstStride2
It's often possible to avoid the extra step of moving the pointers back
into the the x11/x12/x13/x14 registers, if you subtract the width from the
stride at the start of the function. Then you don't need two separate
registers for each pointer, and shortens dependency chain when moving on
to the next line.
If the width can be any uneven value, but we in practice write in
increments of 8 pixels, you may need to align the width up to 8 before
using it to decrement the stride that way though.
// Martin
More information about the ffmpeg-devel
mailing list