[FFmpeg-devel] [rfc] qualification task: SSE2 IDCT

Sun Mar 30 15:05:38 CEST 2008

On Sun, Mar 30, 2008 at 03:37:28AM -0400, Alexander Strange wrote:
[...]
> DECLARE_ASM_CONST(16, vector_int16_t, row_coeffs[6]) = {
>     {C4, 0, C4, 0, C4, 0, C4, 0}, //c0
>     {(1 << (ROW_SHIFT - 1)), 0, (1 << (ROW_SHIFT - 1)), 0, (1 << (ROW_SHIFT - 1)), 0, (1 << (ROW_SHIFT - 1)), 0}, //dc_add_row
>     {C2, 0, C6, 0, -C6, -1, -C2, -1}, //c2
>     {C1, C3, C3, -C7, C5, -C1, C7, -C5}, //c1+c3
>     {C4, C6, -C4, -C2, -C4, C2, C4, -C6}, //c4+c6
>     {C5, C7, -C1, -C5, C7, C3, C3, -C1}}; //c5+c7

vertical align

> 
> /*!
>  * @brief SSE2 simple_idct first pass, on rows
>  * @sa idctRowCondDC()
>  * An SSE2 version of idctRowCondDC() from simple_idct.c.
>  * The row is stored in xmm0, and each set of four multiply-
>  * add operations is converted to one pmaddwd.
>  */
> static inline void simple_idct_sse2_row(int16_t *dct)
> {
>     int16_t *row = dct;
>     asm volatile (
>     "movq "MANGLE(m1000)", %%mm2\n"
>     "pxor %%xmm1, %%xmm1\n"
>     "lea 128(%0), %%"REG_c"\n"

align like:
"movq   "MANGLE(m1000)", %%mm2      \n"
"pxor   %%xmm1,          %%xmm1     \n"
"lea    128(%0),         %%"REG_c"  \n"

makes the code easier to read ...

>     ".align 4,0x90\n"
>     "0:\n" 
>     "movdqa (%0), %%xmm0\n"
>     "movq (%0), %%mm1\n" // mask out the DC and check if it's zero
>     "movq %%mm2, %%mm0\n" // on core 2 this is actually faster in MMX than SSE
>     "pandn %%mm1, %%mm0\n" 
>     "por 8(%0), %%mm0\n"
>     "packssdw %%mm0, %%mm0\n"
>     "movd %%mm0, %%eax\n"
>     "testl %%eax, %%eax\n"
>     "jnz 1f\n"
>     "pshuflw $0, %%xmm0, %%xmm0\n" // skip the whole IDCT if all AC values are zero
>     "pshufd $0, %%xmm0, %%xmm0\n"
>     "psllw $3, %%xmm0\n"
>     "movdqa %%xmm0, (%0)\n"
> #if 0
>     "jmp 3f\n"
> #else
>     "add $16, %0\n"
>     "cmp %%"REG_c", %0\n"

>     "jb 0b\n"
>     "jmp 4f\n"

hmm, somehow i do not think the code blocks are reasonable ordered.

> #endif
>     ".align 4,0x90\n"
>     "1:\n"
>     "pshuflw $0, %%xmm0, %%xmm2\n" 
>     "punpcklwd %%xmm1, %%xmm2\n" // (i0, 0,...)
>     "pmaddwd "MANGLE(row_coeffs)", %%xmm2\n"
>     "paddd "MANGLE(row_coeffs+16)", %%xmm2\n" // dc offset
>     "pshuflw $170, %%xmm0, %%xmm3\n" 
>     "punpcklwd %%xmm1, %%xmm3\n" // (i2, 0,...)
>     "pmaddwd "MANGLE(row_coeffs+32)", %%xmm3\n"
>     "paddd %%xmm3, %%xmm2\n"
>     "pshuflw $221, %%xmm0, %%xmm3\n" 
>     "pshufd $0, %%xmm3, %%xmm3\n" // (i1, i3,...)
>     "pmaddwd "MANGLE(row_coeffs+48)", %%xmm3\n"
>     "movq 8(%0), %%mm0\n"
>     "packssdw %%mm0, %%mm0\n"
>     "movd %%mm0, %%eax\n"
>     "testl %%eax, %%eax\n"
>     "jz 2f\n" // check if the last half of the AC is zero

umm, this code needs work, alot of work!
First and most important, do not try to implement the C IDCT in SSE2
line per line, this is useless. The c code is not ordered appropriately
for SIMD. Also the C code is slow so more branches are better, in SIMD
code branches are much more expensive because the idct is faster.
You have too many branches and especially too many unneeded ones.
Also read the simple_idct_mmx code! Its a much better reference for a
SSE2 implementation.

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Complexity theory is the science of finding the exact solution to an
approximation. Benchmarking OTOH is finding an approximation of the exact
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080330/b2f5cacd/attachment.pgp>