[FFmpeg-devel] [PATCH] lavu/tx: implement aarch64 NEON SIMD
Martin Storsjö
martin at martin.st
Thu Aug 25 13:51:57 EEST 2022
On Sun, 14 Aug 2022, Lynne wrote:
> The fastest fast Fourier transform in not just the west, but the world,
> now for the most popular toy ISA.
>
> On a high level, it follows the design of the AVX2 version closely,
> with the exception that the input is slightly less permuted as we don't have
> to do lane switching with the input on double 4pt and 8pt.
>
> On a low level, the lack of subadd/addsub instructions REALLY penalizes
> any attempt at writing an FFT. That single register matters a _lot_,
> and reloading it simply takes unacceptably long.
> In x86 land, vendors would've noticed developers need this.
> In ARM land, you get a badly designed complex multiplication instruction
> we cannot use, that's not present on 95% of devices. Because only
> compilers matter, right?
>
> There's still room for improvement. I think using stp
> instead of st1 may help in a few places, some reordering
> may help performance in the recombination macro,
> and there are other TODOs I've left marked in the code.
> There are also a few places where the limited range on
> immediates in adds may be worked around.
>
> All timings below are in cycles:
> A53:
> Length | C | New (lavu) | Old (lavc) | FFTW
> ------ |-------------|-------------|-------------|-----
> 4 | 842 | 420 | 1210 | 1460
> 8 | 1538 | 1020 | 1850 | 2520
> 16 | 3717 | 1900 | 3700 | 3990
> 32 | 9156 | 4070 | 8289 | 8860
> 64 | 21160 | 9931 | 18600 | 19625
> 128 | 49180 | 23278 | 41922 | 41922
> 256 | 112073 | 53876 | 93202 | 101092
> 512 | 252864 | 122884 | 205897 | 207868
> 1024 | 560512 | 278322 | 458071 | 453053
> 2048 | 1295402 | 775835 | 1038205 | 1020265
> 4096 | 3281263 | 2021221 | 2409718 | 2577554
> 8192 | 8577845 | 4780526 | 5673041 | 6802722
>
> Apple M1
> New - Total for len 512 reps 2097152 = 1.459141 s
> Old - Total for len 512 reps 2097152 = 2.251344 s
> FFTW - Total for len 512 reps 2097152 = 1.868429 s
>
> New - Total for len 1024 reps 4194304 = 6.490080 s
> Old - Total for len 1024 reps 4194304 = 9.604949 s
> FFTW - Total for len 1024 reps 4194304 = 7.889281 s
>
> New - Total for len 16384 reps 262144 = 10.374001 s
> Old - Total for len 16384 reps 262144 = 15.266713 s
> FFTW - Total for len 16384 reps 262144 = 12.341745 s
>
> New - Total for len 65536 reps 8192 = 1.769812 s
> Old - Total for len 65536 reps 8192 = 4.209413 s
> FFTW - Total for len 65536 reps 8192 = 3.012365 s
>
> New - Total for len 131072 reps 4096 = 1.942836 s
> Old - Segfaults
> FFTW - Total for len 131072 reps 4096 = 3.713713 s
>
> Patch attached.
I've had a look at this now.
I don't have much to add/comment about the core implementation itself and
the performance of it (I didn't try to read it and follow it from that
perspective).
Wrt non-functional aspects, the patch needs a couple fixes to build with
other assemblers (binutils, and MS armasm64.exe). I've also done a couple
minor fixes - instead of using a series of mov+add+add for loading a large
constant, use the ldr= pseudo instruction which is made exactly for
loading odd constants, and avoid unnecessary \() operators after macro
arguments.
See https://github.com/mstorsjo/ffmpeg/commits/aarch64-fft for my
incremental fixes on top; at least the first three are needed for fixing
assembling with the other tools, but all up to the WIP (for removing
prefetching) probably are worthwhile to include; feel free to squash these
into your patch.
Coding style wise, it looks mostly reasonable; some things use a bit
nonstandard style (spaces within {} for loads/stores, and some operand
columns are right-adjusted instead of left-adjusted), but it's probably
acceptable as such.
// Martin
More information about the ffmpeg-devel
mailing list