[MPlayer-dev-eng] MPlayer and gcc ia32 intrinsics

Thu Apr 27 15:05:07 CEST 2006

Hi,

2005/11/23, Michael Niedermayer <michaelni at gmx.at>:
> Hi
>
> On Tue, Nov 22, 2005 at 11:09:42PM +0100, Aurelien Jacobs wrote:
> > On Tue, 22 Nov 2005 16:54:52 -0500
> > Jason Tackaberry <tack at sault.org> wrote:
> >
> > IIRC gcc is somewhat buggy about intrinsics and sometimes produce
> > very slow code.
>
> yes, heres a example:
>  typedef short mmxw  __attribute__ ((mode(V4HI)));
>  typedef int   mmxdw __attribute__ ((mode(V2SI)));
>
> mmxdw dw;
> mmxw w;
>
> void test(){
>    w+=w;
>    dw= (mmxdw)w;
> }
>
> gcc 3.4.0:
>        movq    w, %mm1
>        psllw   $1, %mm1
>        movq    %mm1, w
>        movq    w, %mm0
>        movq    %mm0, dw
>        ret
>
> human:
>        movq w, %mm1
>        paddw %mm1,%mm1
>        movq %mm1, w
>        movq %mm1,dw
>        ret
>
> gcc-4.1.0:
> test:   subl    $20, %esp
>        movl    w, %eax
>        movl    w+4, %edx
>        movl    %ebx, 8(%esp)
>        movl    %esi, 12(%esp)
>        movl    %eax, (%esp)
>        movl    %edx, 4(%esp)
>        movswl  (%esp),%esi
>        movl    %edi, 16(%esp)
>        movswl  4(%esp),%ecx
>        movswl  2(%esp),%edi
>        movswl  6(%esp),%ebx
>        addl    %esi, %esi
>        addl    %ecx, %ecx
>        movzwl  %si, %esi
>        sall    $17, %edi
>        movzwl  %cx, %ecx
>        sall    $17, %ebx
>        movl    %edi, %eax
>        movl    16(%esp), %edi
>        movl    %ebx, %edx
>        orl     %esi, %eax
>        movl    8(%esp), %ebx
>        orl     %ecx, %edx
>        movl    12(%esp), %esi
>        movl    %eax, w
>        movl    %edx, w+4
>        movl    w, %eax
>        movl    w+4, %edx
>        movl    %eax, dw
>        movl    %edx, dw+4
>        addl    $20, %esp
>        ret
>
>
> gcc 4.1.0/20051113 with x87/mmx mode switch patch produces:
> test:   movq    w, %mm0
>        paddw   %mm0, %mm0
>        movq    %mm0, w
>        movl    w, %eax
>        movl    w+4, %edx
>        movl    %eax, dw
>        movl    %edx, dw+4
>        emms
>        ret
> note, in this case there are partial memory stalls which are VERY slow, about
> 10-20 cpu cycles
> i think this demonstrates the problem
>
> examples taken from
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14552
>
> [...]

Maybe this isn't what Jason meant. This is gcc's vector extension.
IA32 intrinsics are like this (taken from a SSE3 optimized fft routine
from ffmpeg):

                __m128 a, b, c, t1, t2;

                a = *(__m128 *)p;
                b = *(__m128 *)q;

                /* complex mul */
                c = *(__m128 *)cptr;
                /* cre*re cre*im */
                t1 = _mm_mul_ps(b, _mm_moveldup_ps(c));
                /* cim*im cim*re */
                t2 = _mm_mul_ps(_mm_shuffle_ps(b, b, _MM_SHUFFLE(2, 3,
0, 1)), _mm_movehdup_ps(c));
                /* cre*re-cim*im cre*im+cim*re */
                b = _mm_addsub_ps(t1, t2);
                /* butterfly */
                *(__m128 *)p = _mm_add_ps(a, b);
                *(__m128 *)q = _mm_sub_ps(a, b);

Intrinsics produce as good code as inline asm, and it's (IMHO) easier
to read and maintain. Besides, codes written by intrinsics may get
optimized even more when compiled in x86_64 because of more xmm
registers. I personally encourage the use of intrinsics in future cpu
specific routines.

--
Zuxy
Beauty is truth,
While truth is beauty.
PGP KeyID: E8555ED6