[FFmpeg-devel] [aarch64] yuv2planeX - unroll outer loop by 4 to increase performance by 6.3%
Sebastian Pop
sebpop at gmail.com
Tue Aug 18 21:11:30 EEST 2020
Hi,
Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache
accesses by 7.5%.
The values loaded for the filter are used in the 4 unrolled iterations and
avoids reloading 3 times the same values.
The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal
instance with the following command:
$ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i
testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
before: 1551591469 cache-references
after: 1436140431 cache-references
before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537
min:0.012975
after: [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164
min:0.012158
Ok to commit?
Thanks,
Sebastian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-aarch64-yuv2planeX-unroll-outer-loop-by-4-increases-.patch
Type: application/octet-stream
Size: 7398 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20200818/8ee4545b/attachment.obj>
More information about the ffmpeg-devel
mailing list