[FFmpeg-devel] [PATCH] lavu/tx: implement aarch64 NEON SIMD

Thu Aug 25 13:51:57 EEST 2022

On Sun, 14 Aug 2022, Lynne wrote:

> The fastest fast Fourier transform in not just the west, but the world,
> now for the most popular toy ISA.
>
> On a high level, it follows the design of the AVX2 version closely,
> with the exception that the input is slightly less permuted as we don't have
> to do lane switching with the input on double 4pt and 8pt.
>
> On a low level, the lack of subadd/addsub instructions REALLY penalizes
> any attempt at writing an FFT. That single register matters a _lot_,
> and reloading it simply takes unacceptably long.
> In x86 land, vendors would've noticed developers need this.
> In ARM land, you get a badly designed complex multiplication instruction
> we cannot use, that's not present on 95% of devices. Because only
> compilers matter, right?
>
> There's still room for improvement. I think using stp
> instead of st1 may help in a few places, some reordering
> may help performance in the recombination macro,
> and there are other TODOs I've left marked in the code.
> There are also a few places where the limited range on
> immediates in adds may be worked around.
>
> All timings below are in cycles:
> A53:
> Length | C           | New (lavu)  | Old (lavc)  | FFTW
> ------ |-------------|-------------|-------------|-----
> 4      |         842 | 420         | 1210        | 1460
> 8      |        1538 | 1020        | 1850        | 2520
> 16     |        3717 | 1900        | 3700        | 3990
> 32     |        9156 | 4070        | 8289        | 8860
> 64     |       21160 | 9931        | 18600       | 19625
> 128    |       49180 | 23278       | 41922       | 41922
> 256    |      112073 | 53876       | 93202       | 101092
> 512    |      252864 | 122884      | 205897      | 207868
> 1024   |      560512 | 278322      | 458071      | 453053
> 2048   |     1295402 | 775835      | 1038205     | 1020265
> 4096   |     3281263 | 2021221     | 2409718     | 2577554
> 8192   |     8577845 | 4780526     | 5673041     | 6802722
>
> Apple M1
> New  - Total for len 512 reps 2097152 = 1.459141 s
> Old  - Total for len 512 reps 2097152 = 2.251344 s
> FFTW - Total for len 512 reps 2097152 = 1.868429 s
>
> New  - Total for len 1024 reps 4194304 = 6.490080 s
> Old  - Total for len 1024 reps 4194304 = 9.604949 s
> FFTW - Total for len 1024 reps 4194304 = 7.889281 s
>
> New  - Total for len 16384 reps 262144 = 10.374001 s
> Old  - Total for len 16384 reps 262144 = 15.266713 s
> FFTW - Total for len 16384 reps 262144 = 12.341745 s
>
> New  - Total for len 65536 reps 8192 = 1.769812 s
> Old  - Total for len 65536 reps 8192 = 4.209413 s
> FFTW - Total for len 65536 reps 8192 = 3.012365 s
>
> New  - Total for len 131072 reps 4096 = 1.942836 s
> Old  - Segfaults
> FFTW - Total for len 131072 reps 4096 = 3.713713 s
>
> Patch attached.

I've had a look at this now.

I don't have much to add/comment about the core implementation itself and 
the performance of it (I didn't try to read it and follow it from that 
perspective).

Wrt non-functional aspects, the patch needs a couple fixes to build with 
other assemblers (binutils, and MS armasm64.exe). I've also done a couple 
minor fixes - instead of using a series of mov+add+add for loading a large 
constant, use the ldr= pseudo instruction which is made exactly for 
loading odd constants, and avoid unnecessary \() operators after macro 
arguments.

See https://github.com/mstorsjo/ffmpeg/commits/aarch64-fft for my 
incremental fixes on top; at least the first three are needed for fixing 
assembling with the other tools, but all up to the WIP (for removing 
prefetching) probably are worthwhile to include; feel free to squash these 
into your patch.

Coding style wise, it looks mostly reasonable; some things use a bit 
nonstandard style (spaces within {} for loads/stores, and some operand 
columns are right-adjusted instead of left-adjusted), but it's probably 
acceptable as such.

// Martin