[FFmpeg-devel] [PATCH] avcodec/vp9: add vp9_idct_idct_4x4_add_ssse3
Ronald S. Bultje
rsbultje at gmail.com
Tue Oct 29 11:36:16 CET 2013
Hi,
Nice work overall. Some suggestions for testing:
On Mon, Oct 28, 2013 at 3:56 PM, Clément Bœsch <u at pkh.me> wrote:
>
> +; (a*x + b*y + round) >> shift
> +%macro VP9_MULSUB_2W_2X 6 ; dst1, dst2, src (unchanged), round, coefs1,
> coefs2
> + movq m%1, [%5]
> + movq m%2, [%6]
> + pmaddwd m%1, m%3
> + pmaddwd m%2, m%3
> + paddd m%1, m%4
> + paddd m%2, m%4
> + psrad m%1, 14
> + psrad m%2, 14
> +%endmacro
> +
> +%macro VP9_IDCT4_1D 0
> + SUMSUB_BA w, 2, 0, 4
> + movq m4, [pw_11585x2]
> + pmulhrsw m0, m4 ; m0=t1
> + pmulhrsw m2, m4 ; m2=t0
> + movq m6, m3
> + punpckhwd m3, m1
> + VP9_MULSUB_2W_2X 4, 5, 3, 7, pw_t2_coef, pw_t3_coef
> + punpcklwd m6, m1
> + VP9_MULSUB_2W_2X 1, 3, 6, 7, pw_t2_coef, pw_t3_coef
+ packssdw m1, m4 ; m1=t2
> + packssdw m3, m5 ; m3=t3
>
So what you're doing here is to split 8 words over 2 registers so we can
paired multiplications etc; I wonder whether it'd be faster if (at least
for the full idct), we moved to XMM registers so this would all be a single
register, and the 2 halves could both be done in a single vp9_mulsub_2w_2x.
You can do INIT_XMM ssse3 and INIT_MMX ssse3 inside functions to switch
between the two. Just make sure you manually backup xmm6-7 for Win64
(there's a utility function for that in x86inc.asm, ask if you need help).
+%macro VP9_STORE_2X 2
> + movd m6, [dstq]
> + movd m7, [dstq+strideq]
> + punpcklbw m6, m4
> + punpcklbw m7, m4
> + paddw m6, %1
> + paddw m7, %2
> + packuswb m6, m4
> + packuswb m7, m4
> + movd [dstq], m6
> + movd [dstq+strideq], m7
> +%endmacro
>
Here too, using XMM could save you work. You can do 2 4-byte elements per
register so actually 4 rows at once if you pair it like this. And, as
Kieran mentioned, the zeroing itself could be 2 calls to movdqa instead of
4 to movq. So perhaps for the full IDCT, XMM does make sense?
Ronald
More information about the ffmpeg-devel
mailing list