[FFmpeg-devel] [PATCH v7] libavfilter/x86/vf_convolution: add sobel filter optimization and unit test with intel AVX512 VNNI

Mon Nov 14 14:54:15 EET 2022

On 11/4/2022 5:29 AM, bin.wang-at-intel.com at ffmpeg.org wrote:
> +.loop2:
> +    xor  rd, rd
> +    pxor m4, m4
> +
> +    ;Gx
> +    SOBEL_MUL 0, data_n1
> +    SOBEL_MUL 1, data_n2
> +    SOBEL_MUL 2, data_n1
> +    SOBEL_ADD 6
> +    SOBEL_MUL 7, data_p2
> +    SOBEL_ADD 8
> +
> +    cvtsi2ss xmm4, rd
> +    mulss    xmm4, xmm4
> +
> +    xor rd, rd
> +    ;Gy
> +    SOBEL_MUL 0, data_n1
> +    SOBEL_ADD 2
> +    SOBEL_MUL 3, data_n2
> +    SOBEL_MUL 5, data_p2
> +    SOBEL_MUL 6, data_n1
> +    SOBEL_ADD 8
> +
> +    cvtsi2ss  xmm5, rd
> +    fmaddss xmm4, xmm5, xmm5, xmm4
> +
> +    sqrtps    xmm4, xmm4
> +    fmaddss   xmm4, xmm4, xmm0, xmm1     ;sum = sum * rdiv + bias

By using xmm# you're not taking into account any x86inc SWAPing, so this 
is using xmm0 and xmm1 where the single scalar float input arguments 
reside (at least on unix64), instead of xm0 and xm1 (xmm16 and xmm17) 
where the broadcasted scalars were stored.
This, again, only worked by chance on unix64 because you're using scalar 
fmadd, and shouldn't work at all on win64.

Also, all these as is are being encoded as VEX, not EVEX, but it should 
be fine leaving them untouched instead of using xm#, since they will be 
shorter (five bytes instead of six for some) by using the lower, non 
callee-saved regs.

> +    cvttps2dq xmm4, xmm4     ; trunc to integer
> +    packssdw  xmm4, xmm4
> +    packuswb  xmm4, xmm4
> +    movd      rd, xmm4
> +    mov       [dstq + xq], rb
> +
> +    add xq, 1
> +    cmp xq, widthq
> +    jl .loop2
> +.end:
> +    RET