[FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1
Martin Storsjö
martin at martin.st
Mon Aug 19 15:08:50 EEST 2024
On Mon, 19 Aug 2024, Ramiro Polla wrote:
>> If the stride is a negative number, the first sxtw does the right thing,
>> but the "lsl w1, w1, #1" will zero out the upper half of the register.
>
> I'll start adding negative stride tests to checkasm to spot these bugs.
That's probably useful. The other alternative is to transition these cases
to use ptrdiff_t for the stride, which should be register sized, so most
of the sign extension issues around strides go away. (We've transitioned
lots of preexisting DSP interfaces already, so doing that here would just
be the next logical step. But at times, this may require marginal
touch-ups to existing assembly, or at least allows getting rid of such
sign extensions later.)
>> With this, I'm down from your 120 cycles on the A53 originally, to 78.7
>> cycles now. Your fully unrolled version seemed to run in 72 cycles on the
>> A53, so that's obviously even faster, but I think this kind of tradeoff
>> might be the sweet spot. What does such a version give you in terms of
>> real world speed?
>
> This version is around 0.5% slower overall on the A76. Very roughly
> these are the total times taken by pix_sum and pix_norm1 with the
> different implementations on A76:
> c: ~5%
> fully unrolled: ~3%
> unroll 2: 2.5%
> tight loop: 2%
Ok. Given the tradeoff between various different cores (including ones not
tested here), do you think this version would be a reasonable compromise
(giving almost ideal results on in-order cores, and not too much slowdown
on out-of-order cores in this benchmark)?
// Martin
More information about the ffmpeg-devel
mailing list