[FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)
Henrik Gramner
henrik at gramner.com
Wed Jan 17 17:01:54 EET 2018
On Tue, Jan 16, 2018 at 11:33 PM, Martin Vignali
<martin.vignali at gmail.com> wrote:
> BLEND_INIT grainextract, 4
You could also try doing twice as much per iteration which might be
more efficient, especially in avx2 since it avoids cross-lane
shuffles. Applies to some other ones as well.
E.g. something like:
pxor m4, m4
VBROADCASTI128 m5, [pw_128]
.loop:
movu m1, [topq + xq]
movu m3, [bottomq + xq]
punpcklbw m0, m1, m4
punpckhbw m1, m4
punpcklbw m2, m3, m4
punpckhbw m3, m4
paddw m0, m5
paddw m1, m5
psubw m0, m2
psubw m1, m3
packuswb m0, m1
mova [dstq + xq], m0
add xq, mmsize
jl .loop
> BLEND_INIT average, 3
pavgb should probably be more efficient than unpacking to words. It
does round up so some bitflipping shenanigans might be required if you
want to round down.
E.g. something like:
pcmpeqb m2, m2
.loop:
movu m0, [topq + xq]
movu m1, [bottomq + xq]
pxor m0, m2
pxor m1, m2
pavgb m0, m1
pxor m0, m2
mova [dstq + xq], m0
add xq, mmsize
jl .loop
(optionally combining movu+pxor into a 3-arg pxor with avx since
memory operands can be unaligned in VEX-encoded instructions).
More information about the ffmpeg-devel
mailing list