[FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon
Martin Storsjö
martin at martin.st
Sun Dec 1 23:01:29 EET 2019
On Sun, 1 Dec 2019, Clément Bœsch wrote:
> On Wed, Nov 27, 2019 at 12:30:35PM -0600, Sebastian Pop wrote:
> [...]
>> From 9ecaa99fab4b8bedf3884344774162636eaa5389 Mon Sep 17 00:00:00 2001
>> From: Sebastian Pop <spop at amazon.com>
>> Date: Sun, 17 Nov 2019 14:13:13 -0600
>> Subject: [PATCH] [aarch64] use FMA and increase vector factor to 4
>>
>> This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate
>> and bumps the vectorization factor from 2 to 4.
>> The speedup is of 34% on Graviton A1 instances based on A-72 cpus:
>>
>> $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
>> before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214
>> after: t:0.030079 avg:0.030102 max:0.030462 min:0.030051
>>
>> Tested with `make check` on aarch64-linux.
FWIW, I'm not certain how much this routine actually is tested by that -
in particular, there's no checkasm test for it as far as I can see.
>> + add x17, x3, x18 // srcp + filterPos[0]
>> + add x18, x3, x0 // srcp + filterPos[1]
>> + add x0, x3, x2 // srcp + filterPos[2]
>> + add x2, x3, x6 // srcp + filterPos[3]
>
>> +2: ldr d4, [x17, x15] // srcp[filterPos[0] + {0..7}]
>> + ldr q5, [x16] // load 8x16-bit filter values, part 1
>> + ldr d6, [x18, x15] // srcp[filterPos[1] + {0..7}]
>> + ldr q7, [x16, x12] // load 8x16-bit at filter+filterSize
>
> Why not use ld1 {v4.8B} etc like it was before? The use of Dn/Qn in is
> very confusing here.
The ldr instruction, instead of ld1, allows you to to do a load (or store,
similarly, for str instead of st1) with a constant/register offset, like
[x17, x15] here, without incrementing the source register inbetween for
each load (which can help with latency between individual load
instructions, or can avoid extra instructions for incrementing the
register inbetween).
That works for loading the first 1/2/4/8/16 bytes of a vector, but can't
be used e.g. for loading a lane other than the first (e.g. ld1 {v4.s}[1]).
But it does require using the b/h/s/d/q names for the registers instead of
v.
I didn't check the changes here if they're essential for the optimization
though.
// Martin
More information about the ffmpeg-devel
mailing list