[FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)
    Henrik Gramner 
    henrik at gramner.com
       
    Wed Jan 17 17:01:54 EET 2018
    
    
  
On Tue, Jan 16, 2018 at 11:33 PM, Martin Vignali
<martin.vignali at gmail.com> wrote:
> BLEND_INIT grainextract, 4
You could also try doing twice as much per iteration which might be
more efficient, especially in avx2 since it avoids cross-lane
shuffles. Applies to some other ones as well.
E.g. something like:
    pxor           m4, m4
    VBROADCASTI128 m5, [pw_128]
.loop:
    movu           m1, [topq + xq]
    movu           m3, [bottomq + xq]
    punpcklbw      m0, m1, m4
    punpckhbw      m1, m4
    punpcklbw      m2, m3, m4
    punpckhbw      m3, m4
    paddw          m0, m5
    paddw          m1, m5
    psubw          m0, m2
    psubw          m1, m3
    packuswb       m0, m1
    mova  [dstq + xq], m0
    add            xq, mmsize
    jl .loop
> BLEND_INIT average, 3
pavgb should probably be more efficient than unpacking to words. It
does round up so some bitflipping shenanigans might be required if you
want to round down.
E.g. something like:
    pcmpeqb        m2, m2
.loop:
    movu           m0, [topq + xq]
    movu           m1, [bottomq + xq]
    pxor           m0, m2
    pxor           m1, m2
    pavgb          m0, m1
    pxor           m0, m2
    mova  [dstq + xq], m0
    add            xq, mmsize
    jl .loop
(optionally combining movu+pxor into a 3-arg pxor with avx since
memory operands can be unaligned in VEX-encoded instructions).
    
    
More information about the ffmpeg-devel
mailing list