[FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon

Sun Jan 19 22:41:22 EET 2025

On Sun, 19 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> ---
> libavutil/aarch64/float_dsp_neon.S | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
> index 35e2715b87..b21f34c084 100644
> --- a/libavutil/aarch64/float_dsp_neon.S
> +++ b/libavutil/aarch64/float_dsp_neon.S
> @@ -40,18 +40,17 @@ function ff_vector_fmul_neon, export=1
> endfunc
>
> function ff_vector_fmac_scalar_neon, export=1
> -        mov             x3,  #-32
> 1:      subs            w2,  w2,  #16
> -        ld1             {v16.4s, v17.4s}, [x0], #32
> -        ld1             {v18.4s, v19.4s}, [x0], x3
> -        ld1             {v4.4s,  v5.4s},  [x1], #32
> -        ld1             {v6.4s,  v7.4s},  [x1], #32
> +        ldp             q16, q17, [x0]
> +        ldp             q4, q5, [x1], #32
>         fmla            v16.4s, v4.4s,  v0.s[0]
>         fmla            v17.4s, v5.4s,  v0.s[0]
> +        stp             q16, q17, [x0], #32
> +        ldp             q18, q19, [x0]
> +        ldp             q6, q7, [x1], #32
>         fmla            v18.4s, v6.4s,  v0.s[0]
>         fmla            v19.4s, v7.4s,  v0.s[0]
> -        st1             {v16.4s, v17.4s}, [x0], #32
> -        st1             {v18.4s, v19.4s}, [x0], #32
> +        stp             q18, q19, [x0], #32
>         b.ne            1b
>         ret
> endfunc
> --

The change looks ok to me, but for changes like this, please do quote 
actual checkasm benchmark numbers, with the actual numbers before/after 
(not just the change in speedup factor), and include such numbers in the 
commit message for reference. And include info about what core you have 
benchmarked it on.


The reason for that is that various tunings of assembly like this can have 
a lot of effect on one core, and little/no effect on others, or even 
contradictory effects on others. So in order not to go back and forth (if 
someone else comes along next week and does a different tuning for a 
different core), please include info about the actual numbers and what you 
have benchmarked on.


I benchmarked this change, on a handful of different cores, and got the 
following numbers:

Before:               Cortex A53    A72     A73    A78
vector_fmac_scalar_neon:   606.0  213.8   321.8  115.5
After:
vector_fmac_scalar_neon:   608.0  198.2   303.2   90.8

So it shows no change (but no regression either) on A53, and a 5-27% 
speedup on in-order cores. So it looks like a reasonable tradeoff.


However, to show the contrast, if one were to favour optimizations for the 
A53, one could do this simple change:

--- a/libavutil/aarch64/float_dsp_neon.S
+++ b/libavutil/aarch64/float_dsp_neon.S
@@ -43,8 +43,8 @@ function ff_vector_fmac_scalar_neon, export=1
          mov             x3,  #-32
  1:      subs            w2,  w2,  #16
          ld1             {v16.4s, v17.4s}, [x0], #32
-        ld1             {v18.4s, v19.4s}, [x0], x3
          ld1             {v4.4s,  v5.4s},  [x1], #32
+        ld1             {v18.4s, v19.4s}, [x0], x3
          ld1             {v6.4s,  v7.4s},  [x1], #32
          fmla            v16.4s, v4.4s,  v0.s[0]
          fmla            v17.4s, v5.4s,  v0.s[0]

This gives the following numbers:

Before:               Cortex A53    A72     A73    A78
vector_fmac_scalar_neon:   606.0  213.8   321.8  115.5
After:
vector_fmac_scalar_neon:   572.0  213.8   308.0  115.2

I.e. such a change would be almost as good for the A73 and even better for 
A53, while having no effect on A72 and A78.

In this case, the change probably is fine and probably a good compromise. 
But I wanted to show this example to highlight the fact that tunings of 
functions can be very subjective.

// Martin