[FFmpeg-devel] [PATCH] avfilter/vf_overlay: add x86 SIMD for yuv444 format when main stream has no alpha
Henrik Gramner
henrik at gramner.com
Mon Apr 30 21:50:21 EEST 2018
On Mon, Apr 30, 2018 at 6:17 PM, Paul B Mahol <onemda at gmail.com> wrote:
> + .loop0:
> + movu m1, [dq + xq]
> + movu m2, [aq + xq]
> + movu m3, [sq + xq]
> +
> + pshufb m1, [pb_b2dw]
> + pshufb m2, [pb_b2dw]
> + pshufb m3, [pb_b2dw]
> + mova m4, [pd_255]
> + psubd m4, m2
> + pmulld m1, m4
> + pmulld m3, m2
> + paddd m1, m3
> + paddd m1, [pd_128]
> + pmulld m1, [pd_257]
> + psrad m1, 16
> + pshufb m1, [pb_dw2b]
> + movd [dq+xq], m1
> + add xq, mmsize / 4
Unpacking to dwords seems inefficient when you could do something like
this (untested):
mova m3, [pw_255]
mova m4, [pw_128]
mova m5, [pw_257]
.loop0:
pmovzxbw m0, [sq + xq]
pmovzxbw m2, [aq + xq]
pmovzxbw m1, [dq + xq]
pmullw m0, m2
pxor m2, m3
pmullw m1, m2
paddw m0, m4
paddw m0, m1
pmulhuw m0, m5
packuswb m0, m0
movq [dq+xq], m0
add xq, mmsize / 2
which does twice as much per iteration. Also note that pmulld is slow
on most CPUs.
> + .loop1:
> + xor tq, tq
> + xor uq, uq
> + xor vq, vq
> + mov rd, 255
> + mov tb, [aq + xq]
> + neg tb
> + add rb, tb
> + mov ub, [sq + xq]
> + neg tb
> + imul ud, td
> + mov vb, [dq + xq]
> + imul rd, vd
> + add rd, ud
> + add rd, 128
> + imul rd, 257
> + sar rd, 16
> + mov [dq + xq], rb
> + add xq, 1
> + cmp xq, wq
> + jl .loop1
Is doing the tail in scalar necessary? E.g. can you pad the buffers so
that reading/writing past the end is OK and just run the SIMD loop?
If that's impossible it'd probably be better to do a separate SIMD
loop and pinsr/pextr input/output pixels depending on the number of
elements left.
More information about the ffmpeg-devel
mailing list