[FFmpeg-devel] [PATCH] avcodec/aarch64/aacencdsp: NEON implementation

Sun Jan 26 01:29:38 EET 2025

On Fri, 24 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> This patch supplies handwritten NEON code for AAC.
>
> The benchmarks below were collected by invoking these two commands on
> each of my boards, A78, A72 and Thinkpad x13s:
> 1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12
> 2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first line is
> speed without the patch, second, with)
>
>
> - A78
> abs_pow34_c:                                          4161.5 ( 1.00x)
> abs_pow34_neon:                                       3586.2 ( 1.16x)
> quant_bands_signed_c:                                 5548.0 ( 1.00x)
> quant_bands_signed_neon:                              1126.8 ( 4.92x)
> quant_bands_unsigned_c:                               3979.2 ( 1.00x)
> quant_bands_unsigned_neon:                             800.2 ( 4.97x)
>
> size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=71.6x
> size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=82.3x
>
> - A72
> abs_pow34_c:                                         15362.2 ( 1.00x)
> abs_pow34_neon:                                      15382.5 ( 1.00x)
> quant_bands_signed_c:                                 9926.5 ( 1.00x)
> quant_bands_signed_neon:                              2467.8 ( 4.02x)
> quant_bands_unsigned_c:                               5469.8 ( 1.00x)
> quant_bands_unsigned_neon:                            2089.5 ( 2.62x)
>
> size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=34.3x
> size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed=37.8
>
> - x13s
> abs_pow34_c:                                          2413.4 ( 1.00x)
> abs_pow34_neon:                                       1796.2 ( 1.34x)
> quant_bands_signed_c:                                 2968.9 ( 1.00x)
> quant_bands_signed_neon:                               675.6 ( 4.39x)
> quant_bands_unsigned_c:                               2311.9 ( 1.00x)
> quant_bands_unsigned_neon:                             477.1 ( 4.85x)
>
> size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed= 135x
> size=    5251KiB time=00:10:00.00 bitrate=  71.7kbits/s speed= 159x

Thanks for the numbers!

It's interesting how the numbers for abs_pow34 end up so identical on the 
A72, while they do differ on the other cores.

In my test, I'm getting the following numbers for that function:

                          Cortex A53      A72      A78
abs_pow34_c:                36889.0  15359.5  14345.0
abs_pow34_neon:              9237.0  15359.0   3592.5

This test is done with the same binary across all three devices. The test 
is made with a fairly old C compiler so that may explain the much larger 
difference in speed for the C version - but surprisingly, it is still 
almost identical in speed compared with the SIMD version on A72. Very 
surprising, but anyway, the code seems to add value.


The code overall looks ok, but there are some minor opportunities to make 
things a little bit better, primarily by looking at the scheduling.

Normally, scheduling mostly helps for the in-order cores, but I do see 
some smaller speedups on out-of-order cores here as well.

> diff --git a/libavcodec/aarch64/aacencdsp_neon.S b/libavcodec/aarch64/aacencdsp_neon.S
> new file mode 100644
> index 0000000000..f0bf013c85
> --- /dev/null
> +++ b/libavcodec/aarch64/aacencdsp_neon.S
> @@ -0,0 +1,62 @@
> +/*
> + * Copyright (c) 2025 Krzysztof Aleksander Pyrkosz
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +function ff_abs_pow34_neon, export=1
> +1:      subs            w2, w2, #4
> +        ld1             {v0.4s}, [x1], #16
> +        fabs            v0.4s, v0.4s
> +        fsqrt           v2.4s, v0.4s
> +        fmul            v0.4s, v2.4s, v0.4s
> +        fsqrt           v0.4s, v0.4s
> +        st1             {v0.4s}, [x0], #16
> +        b.ne            1b
> +        ret
> +endfunc

This mostly looks straightforward. As the whole instruction sequence 
mostly depends on the results of the previous instruction, there's not 
really much one can do here. But the main usual thing to do is to place 
the loop decrement instruction where it can help hide latency the best; 
either between a load ands the use of it (if there's nothing else that can 
be placed there), or e.g. between the last calculation and the store of 
it).

Here, I did the following modification:

@@ -21,8 +21,9 @@
  #include "libavutil/aarch64/asm.S"

  function ff_abs_pow34_neon, export=1
-1:      subs            w2, w2, #4
+1:
          ld1             {v0.4s}, [x1], #16
+        subs            w2, w2, #4
          fabs            v0.4s, v0.4s
          fsqrt           v2.4s, v0.4s
          fmul            v0.4s, v2.4s, v0.4s

And I got the following results:

Before:                  Cortex A53      A72      A78
abs_pow34_neon:              9488.0  15359.0   3586.5
Aftere:
abs_pow34_neon:              9232.0  15359.0   3586.5

Thus, no change on the big out-of-order cores, but a small consistent 
improvement on the in-order core. I.e. no reason not to do it here.

> +
> +function ff_aac_quant_bands_neon, export=1
> +        scvtf           s2, w5
> +        dup             v1.4s, v1.s[0]
> +        dup             v2.4s, v2.s[0]
> +        cbz             w4, 0f
> +        movi            v5.4s, 0x80, lsl #24
> +.irp signed,1,0
> +\signed:
> +        subs            w3, w3, #4
> +        ld1             {v3.4s}, [x2], #16
> +        fmul            v3.4s, v3.4s, v0.s[0]
> +.if \signed
> +        ld1             {v4.4s}, [x1], #16
> +.endif

Here, you could do the same - do the "subs" after the first ld1 (while 
we're waiting for the result to be loaded) anyway. I see that you've 
interleaved the extra operations on v4 here, that's good. We could try 
doing the load of v4 before the first fmul, but that doesn't seem to make 
any difference at all.

With the following diff:

@@ -40,8 +41,8 @@ function ff_aac_quant_bands_neon, export=1
          movi            v5.4s, 0x80, lsl #24
  .irp signed,1,0
  \signed:
-        subs            w3, w3, #4
          ld1             {v3.4s}, [x2], #16
+        subs            w3, w3, #4
          fmul            v3.4s, v3.4s, v0.s[0]
  .if \signed
          ld1             {v4.4s}, [x1], #16

I'm getting the following improvement:

Before:                  Cortex A53      A72      A78
quant_bands_signed_neon:     5661.0   2383.2   1113.2
quant_bands_unsigned_neon:   5401.5   2067.8    811.8
After:
quant_bands_signed_neon:     5402.5   2385.5   1090.0
quant_bands_unsigned_neon:   5145.5   2067.8    809.5

No change on the A72 here, but apparently a (very) small improvement on 
the A78, and a bigger improvement on the A53 as expected.

If you don't mind these changes, we could land the change with that 
tweaked. (I guess the numbers in the commit message could be re-measured, 
but I'm not sure if they change enough to make much of a difference there, 
especially on the cores you've measured on.)

// Martin