[FFmpeg-devel] [aarch64] improve performance of ff_hscale_8_to_15_neon

Thu Dec 5 01:24:46 EET 2019

Hi Clément,

please find attached the updated patch addressing all your comments.
Let me know if there is anything else that I missed and that I need to address.

Thanks,
Sebastian

On Sun, Dec 1, 2019 at 3:01 PM Martin Storsjö <martin at martin.st> wrote:
>
> On Sun, 1 Dec 2019, Clément Bœsch wrote:
>
> > On Wed, Nov 27, 2019 at 12:30:35PM -0600, Sebastian Pop wrote:
> > [...]
> >> From 9ecaa99fab4b8bedf3884344774162636eaa5389 Mon Sep 17 00:00:00 2001
> >> From: Sebastian Pop <spop at amazon.com>
> >> Date: Sun, 17 Nov 2019 14:13:13 -0600
> >> Subject: [PATCH] [aarch64] use FMA and increase vector factor to 4
> >>
> >> This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate
> >> and bumps the vectorization factor from 2 to 4.
> >> The speedup is of 34% on Graviton A1 instances based on A-72 cpus:
> >>
> >> $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
> >> before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214
> >> after:  t:0.030079 avg:0.030102 max:0.030462 min:0.030051
> >>
> >> Tested with `make check` on aarch64-linux.
>
> FWIW, I'm not certain how much this routine actually is tested by that -
> in particular, there's no checkasm test for it as far as I can see.
>
> >> +        add                 x17, x3, x18                // srcp + filterPos[0]
> >> +        add                 x18, x3, x0                 // srcp + filterPos[1]
> >> +        add                 x0, x3, x2                  // srcp + filterPos[2]
> >> +        add                 x2, x3, x6                  // srcp + filterPos[3]
> >
> >> +2:      ldr                 d4, [x17, x15]              // srcp[filterPos[0] + {0..7}]
> >> +        ldr                 q5, [x16]                   // load 8x16-bit filter values, part 1
> >> +        ldr                 d6, [x18, x15]              // srcp[filterPos[1] + {0..7}]
> >> +        ldr                 q7, [x16, x12]              // load 8x16-bit at filter+filterSize
> >
> > Why not use ld1 {v4.8B} etc like it was before? The use of Dn/Qn in is
> > very confusing here.
>
> The ldr instruction, instead of ld1, allows you to to do a load (or store,
> similarly, for str instead of st1) with a constant/register offset, like
> [x17, x15] here, without incrementing the source register inbetween for
> each load (which can help with latency between individual load
> instructions, or can avoid extra instructions for incrementing the
> register inbetween).
>
> That works for loading the first 1/2/4/8/16 bytes of a vector, but can't
> be used e.g. for loading a lane other than the first (e.g. ld1 {v4.s}[1]).
> But it does require using the b/h/s/d/q names for the registers instead of
> v.
>
> I didn't check the changes here if they're essential for the optimization
> though.
>
> // Martin
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-aarch64-use-FMA-and-increase-vector-factor-to-4.patch
Type: application/octet-stream
Size: 9592 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20191204/d9bcd137/attachment.obj>