[FFmpeg-devel] [PATCH 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths

Fri Apr 1 00:42:29 EEST 2022

On Thu, 31 Mar 2022, Ben Avison wrote:

> On 30/03/2022 15:14, Martin Storsjö wrote:
>> On Fri, 25 Mar 2022, Ben Avison wrote:
>>> +// Clamp 16-bit signed block coefficients to signed 8-bit (biased by 128)
>>> +// On entry:
>>> +//   x0 -> array of 64x 16-bit coefficients
>>> +//   x1 -> 8-bit results
>>> +//   x2 = row stride for results, bytes
>>> +function ff_put_signed_pixels_clamped_neon, export=1
>>> +        ld1             {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64
>>> +        movi            v4.8b, #128
>>> +        ld1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x0]
>>> +        sqxtn           v0.8b, v0.8h
>>> +        sqxtn           v1.8b, v1.8h
>>> +        sqxtn           v2.8b, v2.8h
>>> +        sqxtn           v3.8b, v3.8h
>>> +        sqxtn           v5.8b, v16.8h
>>> +        add             v0.8b, v0.8b, v4.8b
>> 
>> Here you could save 4 add instructions with sqxtn2 and adding .16b vectors, 
>> but I'm not sure if it's wortwhile. (It reduces the checkasm numbers by 0.7 
>> for Cortex A72, by 0.3 for A73, but increases the runtime by 1.0 on A53.) 
>> Stranegely enough, I get much smaller numbers on my A72 than you got.
>
> That's weird. As you say, it should be independent of clock-frequency. FWIW, 
> I'm benchmarking on a Raspberry Pi 4; I'd assume all its board variants' 
> Cortex-A72 cores are of identical revision.
>
> Now I run it again, I'm getting these figures:
>
> idctdsp.add_pixels_clamped_c: 313.3
> idctdsp.add_pixels_clamped_neon: 24.3
> idctdsp.put_pixels_clamped_c: 220.3
> idctdsp.put_pixels_clamped_neon: 15.5
> idctdsp.put_signed_pixels_clamped_c: 210.5
> idctdsp.put_signed_pixels_clamped_neon: 19.5
>
> which is more in line with what you see! I am getting a lot of variability 
> between runs though - from a small sample, I'm seeing add_pixels_clamped_neon 
> coming out as anything from 21 to 30, which is well above the sort of 
> differences you're seeing between alternate implementations.

That's indeed weird. I don't have a Raspberry Pi 4 myself though, but for 
functions in this size range on the devboards I test on, I get essentially 
perfectly stable numbers each time - which is great for empirically 
testing different implementation strategies.

> This sort of case is always going to be difficult to schedule optimally for 
> multiple core - factors like how much dual-issuing is possible, latency 
> before values can be used, load speed and the granularity of scoreboarding 
> parts of vectors, all vary widely.

Yup, indeed. In most cases, an implementation that is good for one core is 
usually decent for the other ones as well, but sometimes it ends up a 
compromise, where optimizing for one makes things worse for another one. 
As long as the chosen implementation isn't very suboptimal for some common 
cores, it probably doesn't matter much though.

// Martin