[FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmul_add_neon
Krzysztof Pyrkosz
ffmpeg at szaka.eu
Sun Jan 19 19:33:00 EET 2025
Removed a branch, unrolled loop. Speed increase bumped from 3.95 to 5.60.
Krzysztof
---
libavutil/aarch64/float_dsp_neon.S | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
index 35e2715b87..0ee5c67b91 100644
--- a/libavutil/aarch64/float_dsp_neon.S
+++ b/libavutil/aarch64/float_dsp_neon.S
@@ -138,19 +138,21 @@ function ff_vector_fmul_window_neon, export=1
endfunc
function ff_vector_fmul_add_neon, export=1
- ld1 {v0.4s, v1.4s}, [x1], #32
- ld1 {v2.4s, v3.4s}, [x2], #32
- ld1 {v4.4s, v5.4s}, [x3], #32
-1: subs w4, w4, #8
- fmla v4.4s, v0.4s, v2.4s
- fmla v5.4s, v1.4s, v3.4s
- b.eq 2f
- ld1 {v0.4s, v1.4s}, [x1], #32
- ld1 {v2.4s, v3.4s}, [x2], #32
- st1 {v4.4s, v5.4s}, [x0], #32
- ld1 {v4.4s, v5.4s}, [x3], #32
- b 1b
-2: st1 {v4.4s, v5.4s}, [x0], #32
+1:
+ ldp q0, q1, [x1], #32
+ ldp q2, q3, [x2], #32
+ ldp q4, q5, [x3], #32
+ fmla v4.4s, v0.4s, v2.4s
+ fmla v5.4s, v1.4s, v3.4s
+ stp q4, q5, [x0], #32
+ ldp q0, q1, [x1], #32
+ ldp q2, q3, [x2], #32
+ ldp q4, q5, [x3], #32
+ fmla v4.4s, v0.4s, v2.4s
+ fmla v5.4s, v1.4s, v3.4s
+ stp q4, q5, [x0], #32
+ sub w4, w4, #16
+ cbnz w4, 1b
ret
endfunc
--
2.45.2
More information about the ffmpeg-devel
mailing list