[FFmpeg-devel] [PATCH] avfilter/vf_maskedmerge: add SIMD for maskedmerge with 8 bit depth input

Thu Oct 1 21:10:19 CEST 2015

On Thu, Oct 1, 2015 at 8:42 PM, Paul B Mahol <onemda at gmail.com> wrote:
> diff --git a/libavfilter/vf_maskedmerge.c b/libavfilter/vf_maskedmerge.c

>      if (desc->comp[0].depth == 8)
>          s->maskedmerge = maskedmerge8;
>      else
>          s->maskedmerge = maskedmerge16;
>
> +    if (ARCH_X86)
> +        ff_maskedmerge_init_x86(s);
> +

Create a new function ff_maskedmerge_init() and move the above code
there, that will make it easier to add a unit test.

> diff --git a/libavfilter/x86/vf_maskedmerge.asm b/libavfilter/x86/vf_maskedmerge.asm

> +    mova m5, [pw_128]
> +    mova m2, [pw_256]
> +    pxor m6, m6

Nit: Reorganize your registers so you get those constants in m4, m5,
m6. It will make the code easier to follow IMO.

> +    mov r10q, 0

Xor a register with itself instead of using mov to zero a register.
There's also no need to use the q suffix for plain register names, r10
is enough.

> +        movh m0, [bsrcq + x]
> +        movh m1, [osrcq + x]
> +        movh m3, [msrcq + x]
[...]
> +        punpcklbw m0, m6
> +        punpcklbw m1, m6
> +        punpcklbw m3, m6

You could also make an SSE4 version that uses pmovzxbw.

> +        paddw m1, m5
> +        psrlw m1, 8

I believe you could also make an SSSE3 version that uses pmulhrsw
instead of add + shift.

> +        add r10q, mmsize / 2
> +        cmp r10q, wq
> +    jl .loop

There's a trick you could do here that might be faster:
1) Add w to bsrc, osrc, msrc and dst and then negate w in the
beginning of the function.
2) Initialize r10 to w instead of 0 at the beginning of each .nextrow iteration
3) You can now drop the cmp, the add will be enough to set the right
flags for the branch

I also encourage you to write a checkasm unit test, that will make it
easier to both benchmark and verify the correctness of your code.