[FFmpeg-devel] [aarch64] yuv2planeX - unroll outer loop by 4 to increase performance by 6.3%

Tue Aug 18 21:11:30 EEST 2020

Hi,

Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache
accesses by 7.5%.
The values loaded for the filter are used in the 4 unrolled iterations and
avoids reloading 3 times the same values.
The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal
instance with the following command:
$ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i
testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -

before: 1551591469      cache-references
after:  1436140431      cache-references

before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537
min:0.012975
after:  [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164
min:0.012158

Ok to commit?

Thanks,
Sebastian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-aarch64-yuv2planeX-unroll-outer-loop-by-4-increases-.patch
Type: application/octet-stream
Size: 7398 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20200818/8ee4545b/attachment.obj>