[FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Mon Aug 19 15:08:50 EEST 2024

On Mon, 19 Aug 2024, Ramiro Polla wrote:

>> If the stride is a negative number, the first sxtw does the right thing,
>> but the "lsl w1, w1, #1" will zero out the upper half of the register.
>
> I'll start adding negative stride tests to checkasm to spot these bugs.

That's probably useful. The other alternative is to transition these cases 
to use ptrdiff_t for the stride, which should be register sized, so most 
of the sign extension issues around strides go away. (We've transitioned 
lots of preexisting DSP interfaces already, so doing that here would just 
be the next logical step. But at times, this may require marginal 
touch-ups to existing assembly, or at least allows getting rid of such 
sign extensions later.)

>> With this, I'm down from your 120 cycles on the A53 originally, to 78.7
>> cycles now. Your fully unrolled version seemed to run in 72 cycles on the
>> A53, so that's obviously even faster, but I think this kind of tradeoff
>> might be the sweet spot. What does such a version give you in terms of
>> real world speed?
>
> This version is around 0.5% slower overall on the A76. Very roughly
> these are the total times taken by pix_sum and pix_norm1 with the
> different implementations on A76:
> c: ~5%
> fully unrolled: ~3%
> unroll 2: 2.5%
> tight loop: 2%

Ok. Given the tradeoff between various different cores (including ones not 
tested here), do you think this version would be a reasonable compromise 
(giving almost ideal results on in-order cores, and not too much slowdown 
on out-of-order cores in this benchmark)?

// Martin