[FFmpeg-devel] [PATCH] vp9: 16bpp tm/dc/h/v intra pred simd (mostly sse2) functions.
Ronald S. Bultje
rsbultje at gmail.com
Sat Oct 3 02:12:15 CEST 2015
Hi,
On Fri, Oct 2, 2015 at 5:31 PM, Henrik Gramner <henrik at gramner.com> wrote:
> On Fri, Sep 25, 2015 at 11:24 PM, Ronald S. Bultje <rsbultje at gmail.com>
> wrote:
> > +++ b/libavcodec/x86/vp9intrapred_16bpp.asm
>
> > +cglobal vp9_ipred_v_4x4_16, 2, 4, 1, dst, stride, l, a
> > +cglobal vp9_ipred_v_8x8_16, 2, 4, 1, dst, stride, l, a
> > +cglobal vp9_ipred_v_16x16_16, 2, 4, 2, dst, stride, l, a
> > +cglobal vp9_ipred_v_32x32_16, 2, 4, 4, dst, stride, l, a
>
> Those look pretty generic. Isn't some H.264 pred very similar if not
> identical? I didn't check, but if they are you can just use those
> instead.
Well, they prototype is different. For H/V, it's not critical, but for the
directional ones, the edge handling is very quirky so I wanted to do that
in C, so l/a are arguments instead of part of the source buffer.
(And because we do in-loop filtering, doing V as-is from h264 won't work,
since a can be post-loopfilter, whereas in h264 it's required to be pre-,
and we don't swap in vp9.)
> +cglobal vp9_ipred_h_8x8_16, 3, 4, 5, dst, stride, l, a
>
> Seemed a bit inefficient so i rewrote it. Around 2x as fast and fewer regs:
>
> cglobal vp9_ipred_h_8x8_16, 3, 3, 4, dst, stride, l, a
> mova m2, [lq]
> DEFINE_ARGS dst, stride, stride3
> lea stride3q, [strideq*3]
> punpckhwd m3, m2, m2
> pshufd m0, m3, q3333
> pshufd m1, m3, q2222
> mova [dstq+strideq*0], m0
> mova [dstq+strideq*1], m1
> pshufd m0, m3, q1111
> pshufd m1, m3, q0000
> mova [dstq+strideq*2], m0
> mova [dstq+stride3q ], m1
> lea dstq, [dstq+strideq*4]
> punpcklwd m2, m2
> pshufd m0, m2, q3333
> pshufd m1, m2, q2222
> mova [dstq+strideq*0], m0
> mova [dstq+strideq*1], m1
> pshufd m0, m2, q1111
> pshufd m1, m2, q0000
> mova [dstq+strideq*2], m0
> mova [dstq+stride3q ], m1
> RET
>
> > +cglobal vp9_ipred_h_16x16_16, 3, 4, 6, dst, stride, l, a
> > +cglobal vp9_ipred_h_32x32_16, 3, 5, 8, dst, stride, l, a
>
> Should be possible to change those to be more similar to the 8x8 above.
>
> > +cglobal vp9_ipred_dc_4x4_16, 4, 4, 2, dst, stride, l, a
> [...]
> > + pshufw m1, m0, q3232
> > + paddd m0, m1
> > + paddd m0, [pd_4]
>
> Swap the last two rows to allow the shuffle and the pd_4 add to
> execute in parallel. The same issue exists in pretty much every other
> dc function as well.
>
> > +cglobal vp9_ipred_dc_32x32_16, 4, 4, 2, dst, stride, l, a
> [...]
> > +.loop:
> > + mova [dstq+strideq*0+ 0], m0
> > + mova [dstq+strideq*0+16], m0
> > + mova [dstq+strideq*0+32], m0
> > + mova [dstq+strideq*0+48], m0
> > + mova [dstq+strideq*1+ 0], m0
> > + mova [dstq+strideq*1+16], m0
> > + mova [dstq+strideq*1+32], m0
> > + mova [dstq+strideq*1+48], m0
> > + mova [dstq+strideq*2+ 0], m0
> > + mova [dstq+strideq*2+16], m0
> > + mova [dstq+strideq*2+32], m0
> > + mova [dstq+strideq*2+48], m0
> > + mova [dstq+stride3q + 0], m0
> > + mova [dstq+stride3q +16], m0
> > + mova [dstq+stride3q +32], m0
> > + mova [dstq+stride3q +48], m0
> > + lea dstq, [dstq+strideq*4]
> > + dec cntd
> > + jg .loop
>
> Cut the number of stores per iteration in half and double the number
> of iterations instead.
>
> > +cglobal vp9_ipred_dc_%1_32x32_16, 4, 4, 2, dst, stride, l, a
> [...]
> > +.loop:
> > + mova [dstq+strideq*0+ 0], m0
> > + mova [dstq+strideq*0+16], m0
> > + mova [dstq+strideq*0+32], m0
> > + mova [dstq+strideq*0+48], m0
> > + mova [dstq+strideq*1+ 0], m0
> > + mova [dstq+strideq*1+16], m0
> > + mova [dstq+strideq*1+32], m0
> > + mova [dstq+strideq*1+48], m0
> > + mova [dstq+strideq*2+ 0], m0
> > + mova [dstq+strideq*2+16], m0
> > + mova [dstq+strideq*2+32], m0
> > + mova [dstq+strideq*2+48], m0
> > + mova [dstq+stride3q + 0], m0
> > + mova [dstq+stride3q +16], m0
> > + mova [dstq+stride3q +32], m0
> > + mova [dstq+stride3q +48], m0
> > + lea dstq, [dstq+strideq*4]
> > + dec cntd
> > + jg .loop
>
> Ditto.
>
> > +cglobal vp9_ipred_tm_4x4_10, 4, 4, 6, dst, stride, l, a
> [...]
> > + movd m0, [aq-2]
> > + pshufw m0, m0, q0000
>
> Unaligned load penalty, either movd from [aq-4] or pshufw directly from
> [aq-8].
>
> > +cglobal vp9_ipred_tm_8x8_10, 4, 4, 8, dst, stride, l, a
> [...]
> > + movd m0, [aq-2]
> > + pshuflw m0, m0, q0000
>
> Ditto, except you don't want to pshuflw directly from memory in this
> case unlike with MMX. You can use vpbroadcastw instead though if you
> want to write AVX2. This issue exists in multiple other places as
> well.
>
> > + pshufhw m0, m4, q3333
> > + pshufhw m1, m4, q2222
> > + pshufhw m2, m4, q1111
> > + pshufhw m3, m4, q0000
> > + punpckhqdq m0, m0
> > + punpckhqdq m1, m1
> > + punpckhqdq m2, m2
> > + punpckhqdq m3, m3
>
> Use punpckhwd + pshufd instead, same as in vp9_ipred_h_8x8_16 above.
All done.
Ronald
More information about the ffmpeg-devel
mailing list