[FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Mon Aug 19 12:26:46 EEST 2024

On Sun, 18 Aug 2024, Ramiro Polla wrote:

> I had tested the real world case on the A76, but not on the A53. I
> spent a couple of hours with perf trying to find the source of the
> discrepancy but I couldn't find anything conclusive. I need to learn
> more about how to test cache misses.

Nah, I guess that's a bit overkill...

> I just tested again with the following command:
> $ taskset -c 2 ./ffmpeg_g -benchmark -f lavfi -i
> "testsrc2=size=1920x1080" -vcodec mpeg4 -q 31 -vframes 100 -f rawvideo
> -y /dev/null
>
> The entire test was about 1% faster unrolled on A53, but about 1%
> slower unrolled on A76 (I had the Raspberry Pi 5 in mind for these
> optimizations, so I preferred choosing the version that was faster on
> the A76).

> I wonder if there is any way we could check at runtime.

There are indeed often cases where functions could be tuned differently 
for older/newer or in-order/out-of-order cores. In most cases, trying to 
specialize things is a bit waste and overkill though - in most cases, I'd 
just suggest going with a compromise.

(Sometimes, different kinds of tunings can be applied if you use e.g. the 
flag dotprod to differentiate between older and newer cores. But it's 
seldom worth the extra effort to do that.)

Right, so looking at your unrolled case, you've done a full unroll. That's 
probably also a bit overkill.

The in-order cores really hate tight loops where almost everything has a 
sequential dependency on the previous instruction - so the general rule of 
thumb is that you'll want to unroll by a factor of 2, unless the algorithm 
itself has enough complexity that there's two separate dependency chains 
interlinked.

Also, from your unrolled version, there's a slight bug in it:

> +        add             x2, x0, w1, sxtw
> +        lsl             w1, w1, #1

If the stride is a negative number, the first sxtw does the right thing, 
but the "lsl w1, w1, #1" will zero out the upper half of the register.

So for that, you'd still need to keep the "sxtw x1, w1" instruction, and 
do the lsl on x1 instead. It is actually possible to merge it into one 
instruction though, with "sbfiz x1, x1, #1, #32", if I read the docs 
right. But that's a much more uncommon instruction...

As for optimal performance here - I tried something like this:

         movi            v0.16b, #0
         add             x2, x0, w1, sxtw
         sbfiz           x1, x1, #1, #32
         mov             w3, #16

1:
         ld1             {v1.16b}, [x0], x1
         ld1             {v2.16b}, [x2], x1
         subs            w3, w3, #2
         uadalp          v0.8h, v1.16b
         uadalp          v0.8h, v2.16b
         b.ne            1b

         uaddlv          s0, v0.8h
         fmov            w0, s0

         ret

With this, I'm down from your 120 cycles on the A53 originally, to 78.7 
cycles now. Your fully unrolled version seemed to run in 72 cycles on the 
A53, so that's obviously even faster, but I think this kind of tradeoff 
might be the sweet spot. What does such a version give you in terms of 
real world speed?

On this version, you can also note that the two sequential uadalp may seem 
a little potentially problematic. I did try using two separate accumulator 
registers, accumulating into v0 and v1 separately here, and only summing 
them at the end. That didn't make any difference, so the A53 may 
potentially have a special case where two such sequential accumulations 
into the same register doesn't incur the extra full latency. (The A53 does 
have such a case for "mla" at least.)

// Martin