[FFmpeg-devel] [PATCH] avfilter/vf_maskedmerge: add SIMD for maskedmerge with 8 bit depth input
Henrik Gramner
henrik at gramner.com
Thu Oct 1 21:10:19 CEST 2015
On Thu, Oct 1, 2015 at 8:42 PM, Paul B Mahol <onemda at gmail.com> wrote:
> diff --git a/libavfilter/vf_maskedmerge.c b/libavfilter/vf_maskedmerge.c
> if (desc->comp[0].depth == 8)
> s->maskedmerge = maskedmerge8;
> else
> s->maskedmerge = maskedmerge16;
>
> + if (ARCH_X86)
> + ff_maskedmerge_init_x86(s);
> +
Create a new function ff_maskedmerge_init() and move the above code
there, that will make it easier to add a unit test.
> diff --git a/libavfilter/x86/vf_maskedmerge.asm b/libavfilter/x86/vf_maskedmerge.asm
> + mova m5, [pw_128]
> + mova m2, [pw_256]
> + pxor m6, m6
Nit: Reorganize your registers so you get those constants in m4, m5,
m6. It will make the code easier to follow IMO.
> + mov r10q, 0
Xor a register with itself instead of using mov to zero a register.
There's also no need to use the q suffix for plain register names, r10
is enough.
> + movh m0, [bsrcq + x]
> + movh m1, [osrcq + x]
> + movh m3, [msrcq + x]
[...]
> + punpcklbw m0, m6
> + punpcklbw m1, m6
> + punpcklbw m3, m6
You could also make an SSE4 version that uses pmovzxbw.
> + paddw m1, m5
> + psrlw m1, 8
I believe you could also make an SSSE3 version that uses pmulhrsw
instead of add + shift.
> + add r10q, mmsize / 2
> + cmp r10q, wq
> + jl .loop
There's a trick you could do here that might be faster:
1) Add w to bsrc, osrc, msrc and dst and then negate w in the
beginning of the function.
2) Initialize r10 to w instead of 0 at the beginning of each .nextrow iteration
3) You can now drop the cmp, the add will be enough to set the right
flags for the branch
I also encourage you to write a checkasm unit test, that will make it
easier to both benchmark and verify the correctness of your code.
More information about the ffmpeg-devel
mailing list