[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.
Henrik Gramner
henrik at gramner.com
Tue Nov 17 19:50:42 EET 2020
On Mon, Nov 16, 2020 at 11:03 AM Alan Kelly
<alankelly-at-google.com at ffmpeg.org> wrote:
> +cglobal yuv2yuvX, 6, 7, 16, filter, filterSize, dest, dstW, dither, offset, src
Only 8 xmm registers are used, so 8 should be used instead of 16 here.
Otherwise it causes unnecessary spilling of registers on 64-bit
Windows.
> +%if ARCH_X86_64
> +%define ptr_size 8
[...]
> +%else
> +%define ptr_size 4
The predefined variable gprsize already exists for this purpose, so
that can be used instead.
> + movq xmm3, [ditherq]
If vpbroadcastq m3, [ditherq] is used for AVX2 here, then the following
> + vperm2i128 m3, m3, m3, 0
instruction can be eliminated.
> + punpcklwd m1, m1
> + punpckldq m1, m1
Can be replaced with pshuflw m1, m1, q0000
>+ mov srcq, [filterSizeq]
>+ test srcd, srcd
test srcq, srcq should be used here, since the lower 32 bits of a
valid pointer could randomly happen to be zero on a 64-bit system.
> + REP_RET
Since non-temporal stores are being used, this should be replaced with
sfence
RET
to guarantee proper memory ordering semantics in multi-threaded use
cases. Things will usually work fine without it, but may potentially
break in "fun to debug" ways.
More information about the ffmpeg-devel
mailing list