[FFmpeg-devel] [PATCH] lavfi/removegrain: add x86 and x86_64 SSE2 functions

Tue Jul 14 23:23:59 CEST 2015

On 14/07/15 3:54 PM, James Darnley wrote:
> On 2015-07-11 18:34, James Almer wrote:
>> On 11/07/15 10:40 AM, James Darnley wrote:
>>> @@ -566,7 +554,19 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
>>>          }
>>>  
>>>          *dst++ = *src++;
>>> -        for (x = 1; x < s->planewidth[i] - 1; x++) {
>>> +
>>> +        if (s->fl[i]) {
>>> +            int w_asm = (s->planewidth[i] - 2) & ~15;
>>
>> I don't thinks this belongs here. If you add an AVX2 version as you intended,
>> wouldn't this need to be 31?
>>
>> A wrapper function to have this code in the x86 folder may be the best option.
> 
> I was considering having the init_x86 function set a variable in the
> context to do this math based on how many pixels are actually done per
> iteration, meaning:
> - 8 for the unpacked words and sse2
> - 16 for the packed bytes and see2 or future unpacked words and avx2
> - 32 for the future packed bytes and avx2
> 
> That would let it be simply re-used if someone wanted to try writing a
> line function in C or other platform assembly.
> 
>>> new file mode 100644
>>> index 0000000..5e1feea
>>> --- /dev/null
>>> +++ b/libavfilter/x86/vf_removegrain.asm
>>> @@ -0,0 +1,1215 @@
>>> +;*****************************************************************************
>>> +;* x86-optimized functions for yadif filter
>>> +;*
>>> +;* Copyright (C) 2015 James Darnley
>>> +;*
>>> +;* This file is part of FFmpeg.
>>> +;*
>>> +;* TODO: gpl text goes here.
>>> +;*****************************************************************************
>>> +
>>> +; column: -1  0 +1
>>> +; row -1: a1 a2 a3
>>> +; row  0: a4  c a5
>>> +; row +1: a6 a7 a8
>>> +
>>> +%include "libavutil/x86/x86util.asm"
>>> +
>>> +SECTION_RODATA
>>
>> Declare it SECTION_RODATA 32 now so you don't forget when you add AVX2 versions.
> 
> Okay, I'm confused by this.  It doesn't align the constants you define
> on 32 byte boundaries if you happen to define the data incorrectly so...
>  Why does this matter?  Or what does it do?
> 

SECTION_RODATA 32 translates into "SECTION .rodata align=32", which according to the 
documentation makes sure the constants are aligned on a 32-byte boundary.
I noticed it doesn't do much if the constants are "out of order" if you will, but if
you're defining them right and you're going to use 32-byte movas with them, then i
don't see a reason to not declare the section properly.

>>> +
>>> +; The loop doesn't need to do all the iterations.  It could stop when the right
>>> +; pixels are in the right registers.
>>> +%macro SORT_SQUARE 0
>>> +    %assign k 7
>>> +    %rep 7
>>> +        %assign i 1
>>> +        %assign j 2
>>> +        %rep k
>>> +            SORT_PAIR m %+ i , m %+ j , m9
>>> +            %assign i i+1
>>> +            %assign j j+1
>>> +        %endrep
>>> +        %assign k k-1
>>> +    %endrep
>>> +%endmacro
>>> +
>>> +; %1 dest simd register
>>> +; %2 source (simd register/memory)
>>> +; %3 temp simd register
>>> +%macro ABS_DIFF 3
>>> +    mova %3, %2
>>> +    psubusb %3, %1
>>> +    psubusb %1, %2
>>> +    por %1, %3
>>> +%endmacro
>>> +
>>> +; %1 dest simd register
>>> +; %2 source (simd register/memory)
>>> +; %3 temp simd register
>>> +%macro ABS_DIFF_W 3
>>> +    mova %3, %2
>>> +    psubusw %3, %1
>>> +    psubusw %1, %2
>>> +    por %1, %3
>>> +%endmacro
>>
>> No way to achieve this using the pabs* ssse3 instructions?
> 
> Maybe.  I looked them up and see that I would need a subtraction anyway.
>  I wonder if I could measure a speed change.  I need to check the
> behaviour of them together with the subtract instructions.

For that matter, since in many cases you're calling two or more ABS_DIFF_W one after
the other, you could combine the instructions to minimize dependencies regardless of
using pabs or psubs + por like you're doing right now.
It may end up being faster that way.

> 
>>> +; %1 simd register that hold the mask and will hold the result
>>> +; %2 simd register that holds the "true" values
>>> +; %3 location of the "false" values (simd register/memory)
>>> +%macro BLEND 3 ; mask, true, false
>>> +    pand %2, %1
>>> +    pandn %1, %3
>>> +    por %1, %2
>>> +%endmacro
>>
>> This doesn't follow the sse4 blend instructions' semantic, so it will make adding AVX2
>> versions harder.
>>
>> Try instead
>>
>> %macro BLEND 3 ; false, true, mask
>>     pand      %2, %3
>>     pandn     %3, %1
>>     por       %3, %2
>>     SWAP      %1, %3
>> %endmacro
>>
>> Which will let you use "vpblendvb %1, %1, %2, %3" for the avx2 version.
>> SSE4's pblendvb is kinda crappy, requiring the mask to be in xmm0, so
>> adding an SSE4 version may not be worth it.
> 
> I applied your patches that you sent on IRC (this and the sort macro
> one).  Do you want me to put your name in the copyright header?
> 

Not really. small changes like those are IMO not enough to get credit in the header.
If you want you can add a "With contributions by" line in the commit message.

>>> diff --git a/libavfilter/x86/vf_removegrain_init.c b/libavfilter/x86/vf_removegrain_init.c
>>> new file mode 100644
>>> index 0000000..450185f
>>> --- /dev/null
>>> +++ b/libavfilter/x86/vf_removegrain_init.c
>>> @@ -0,0 +1,128 @@
>>> +#include "libavutil/attributes.h"
>>> +#include "libavutil/cpu.h"
>>> +#include "libavutil/x86/cpu.h"
>>> +#include "libavfilter/removegrain.h"
>>> +
>>> +void ff_rg_fl_mode_1_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_10_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_11_12_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_13_14_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_19_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_20_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_21_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_22_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +#if ARCH_X86_64
>>> +void ff_rg_fl_mode_2_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_3_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_4_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_5_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_6_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_7_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_8_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_9_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_15_16_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_17_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_18_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_23_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +void ff_rg_fl_mode_24_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int pixels);
>>> +#endif
> 
> A general question.  Is it preferred to have the function declarations
> and assignments out of order to reduce the number of #if ARCH_X86_64
> conditions?

IMO yes. The cleaner the better.

> 
>>> +av_cold void ff_removegrain_init_x86(RemoveGrainContext *rg)
>>> +{
>>> +    int cpu_flags = av_get_cpu_flags();
>>> +    int i;
>>> +
>>> +    for (i = 0; i < rg->nb_planes; i++) {
>>> +        switch (rg->mode[i]) {
>>> +            case 1:
>>> +                if (EXTERNAL_SSE2(cpu_flags))
>>
>> I think it will be much cleaner and simpler if you put the switch inside a single
>> EXTERNAL_SSE2 branch, rather than checking it for every mode.
> 
> You're probably right.  I was going to see what it looks like after AVX2
> and clean it up then but I can squash some changes into this patch if
> people prefer it.
> 
> 
> 
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>