[FFmpeg-devel] [PATCH 4/4] swscale/aarch64: add nv24/nv42 to yuv420p unscaled converter

Wed Aug 14 15:31:12 EEST 2024

On Fri, 9 Aug 2024, Ramiro Polla wrote:

> checkasm --bench for Raspberry Pi 5 Model B Rev 1.0:
> nv24_yuv420p_128_c: 423.0
> nv24_yuv420p_128_neon: 115.7
> nv24_yuv420p_1920_c: 5939.5
> nv24_yuv420p_1920_neon: 1339.7
> nv42_yuv420p_128_c: 423.2
> nv42_yuv420p_128_neon: 115.7
> nv42_yuv420p_1920_c: 5907.5
> nv42_yuv420p_1920_neon: 1342.5
> ---
> libswscale/aarch64/Makefile                |  1 +
> libswscale/aarch64/swscale_unscaled.c      | 30 +++++++++
> libswscale/aarch64/swscale_unscaled_neon.S | 75 ++++++++++++++++++++++
> 3 files changed, 106 insertions(+)
> create mode 100644 libswscale/aarch64/swscale_unscaled_neon.S

> diff --git a/libswscale/aarch64/swscale_unscaled_neon.S b/libswscale/aarch64/swscale_unscaled_neon.S
> new file mode 100644
> index 0000000000..a206fda41f
> --- /dev/null
> +++ b/libswscale/aarch64/swscale_unscaled_neon.S
> @@ -0,0 +1,75 @@
> +/*
> + * Copyright (c) 2024 Ramiro Polla
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +function ff_nv24_to_yuv420p_chroma_neon, export=1
> +// x0  uint8_t *dst1
> +// x1  int dstStride1
> +// x2  uint8_t *dst2
> +// x3  int dstStride2
> +// x4  const uint8_t *src
> +// x5  int srcStride
> +// w6  int w
> +// w7  int h
> +
> +        uxtw            x1, w1
> +        uxtw            x3, w3
> +        uxtw            x5, w5

You can often avoid the explicit uxtw instructions, if you can fold an 
uxtw attribute into the cases where the register is used. (If it's used 
often, it may be slightly more performant to do it upfront like this 
though, but often it can be omitted entirely.) And whenever you do an 
operation with a wN register as destination, the upper half of the 
register gets explicitly cleared, so these also may be avoided that way.

> +
> +        add             x9, x4, x5                  // x9 = src + srcStride
> +        lsl             w5, w5, #1                  // srcStride *= 2
> +
> +1:
> +        mov             w10, w6                     // w10 = w
> +        mov             x11, x4                     // x11 = src1 (line 1)
> +        mov             x12, x9                     // x12 = src2 (line 2)
> +        mov             x13, x0                     // x13 = dst1 (dstU)
> +        mov             x14, x2                     // x14 = dst2 (dstV)
> +
> +2:
> +        ld2             { v0.16b, v1.16b }, [x11], #32 // v0 = U1, v1 = V1
> +        ld2             { v2.16b, v3.16b }, [x12], #32 // v2 = U2, v3 = V2
> +
> +        uaddlp          v0.8h, v0.16b               // pairwise add U1 into v0
> +        uaddlp          v1.8h, v1.16b               // pairwise add V1 into v1
> +        uadalp          v0.8h, v2.16b               // pairwise add U2, accumulate into v0
> +        uadalp          v1.8h, v3.16b               // pairwise add V2, accumulate into v1
> +
> +        shrn            v0.8b, v0.8h, #2            // divide by 4
> +        shrn            v1.8b, v1.8h, #2            // divide by 4
> +
> +        st1             { v0.8b }, [x13], #8        // store U into dst1
> +        st1             { v1.8b }, [x14], #8        // store V into dst2
> +
> +        subs            w10, w10, #8
> +        b.gt            2b
> +
> +        // next row
> +        add             x4, x4, x5                  // src1 += srcStride * 2
> +        add             x9, x9, x5                  // src2 += srcStride * 2
> +        add             x0, x0, x1                  // dst1 += dstStride1
> +        add             x2, x2, x3                  // dst2 += dstStride2

It's often possible to avoid the extra step of moving the pointers back 
into the the x11/x12/x13/x14 registers, if you subtract the width from the 
stride at the start of the function. Then you don't need two separate 
registers for each pointer, and shortens dependency chain when moving on 
to the next line.

If the width can be any uneven value, but we in practice write in 
increments of 8 pixels, you may need to align the width up to 8 before 
using it to decrement the stride that way though.

// Martin