[FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1
Martin Storsjö
martin at martin.st
Mon Aug 19 12:26:46 EEST 2024
On Sun, 18 Aug 2024, Ramiro Polla wrote:
> I had tested the real world case on the A76, but not on the A53. I
> spent a couple of hours with perf trying to find the source of the
> discrepancy but I couldn't find anything conclusive. I need to learn
> more about how to test cache misses.
Nah, I guess that's a bit overkill...
> I just tested again with the following command:
> $ taskset -c 2 ./ffmpeg_g -benchmark -f lavfi -i
> "testsrc2=size=1920x1080" -vcodec mpeg4 -q 31 -vframes 100 -f rawvideo
> -y /dev/null
>
> The entire test was about 1% faster unrolled on A53, but about 1%
> slower unrolled on A76 (I had the Raspberry Pi 5 in mind for these
> optimizations, so I preferred choosing the version that was faster on
> the A76).
> I wonder if there is any way we could check at runtime.
There are indeed often cases where functions could be tuned differently
for older/newer or in-order/out-of-order cores. In most cases, trying to
specialize things is a bit waste and overkill though - in most cases, I'd
just suggest going with a compromise.
(Sometimes, different kinds of tunings can be applied if you use e.g. the
flag dotprod to differentiate between older and newer cores. But it's
seldom worth the extra effort to do that.)
Right, so looking at your unrolled case, you've done a full unroll. That's
probably also a bit overkill.
The in-order cores really hate tight loops where almost everything has a
sequential dependency on the previous instruction - so the general rule of
thumb is that you'll want to unroll by a factor of 2, unless the algorithm
itself has enough complexity that there's two separate dependency chains
interlinked.
Also, from your unrolled version, there's a slight bug in it:
> + add x2, x0, w1, sxtw
> + lsl w1, w1, #1
If the stride is a negative number, the first sxtw does the right thing,
but the "lsl w1, w1, #1" will zero out the upper half of the register.
So for that, you'd still need to keep the "sxtw x1, w1" instruction, and
do the lsl on x1 instead. It is actually possible to merge it into one
instruction though, with "sbfiz x1, x1, #1, #32", if I read the docs
right. But that's a much more uncommon instruction...
As for optimal performance here - I tried something like this:
movi v0.16b, #0
add x2, x0, w1, sxtw
sbfiz x1, x1, #1, #32
mov w3, #16
1:
ld1 {v1.16b}, [x0], x1
ld1 {v2.16b}, [x2], x1
subs w3, w3, #2
uadalp v0.8h, v1.16b
uadalp v0.8h, v2.16b
b.ne 1b
uaddlv s0, v0.8h
fmov w0, s0
ret
With this, I'm down from your 120 cycles on the A53 originally, to 78.7
cycles now. Your fully unrolled version seemed to run in 72 cycles on the
A53, so that's obviously even faster, but I think this kind of tradeoff
might be the sweet spot. What does such a version give you in terms of
real world speed?
On this version, you can also note that the two sequential uadalp may seem
a little potentially problematic. I did try using two separate accumulator
registers, accumulating into v0 and v1 separately here, and only summing
them at the end. That didn't make any difference, so the A53 may
potentially have a special case where two such sequential accumulations
into the same register doesn't incur the extra full latency. (The A53 does
have such a case for "mla" at least.)
// Martin
More information about the ffmpeg-devel
mailing list