[FFmpeg-devel] Amazing intrinsics improvments in gcc 4

Wed Mar 19 20:01:27 CET 2008

On Mar 19, 2008, at 2:43 PM, Michael Niedermayer wrote:
> On Wed, Mar 19, 2008 at 07:21:14PM +0100, Luca Barbato wrote:
>> Michael Niedermayer wrote:
>>> I thought some people here would be interrested as there were  
>>> various claims
>>> on gccs abilities and improvments posted here lately ...
>>
>> ------- Comment #23 From Uros Bizjak 2008-03-19 10:45 -------
>>
>> As said in PR 19161:
>>
>> The LCM infrastructure doesn't support mode switching in the way that
>> would be
>> usable for emms. Additionally, there are MANY problems expected  
>> when sharing
>> x87 and MMX registers (i.e. handling of uninitialized x87 registers  
>> at the
>> beginning of the function - this is the reason we don't implement x87
>> register
>> passing ABI).
>>
>> Automatic MMX vectorization is not exactly a much usable feature
>> nowadays (we
>> have SSE that works quite well here). Due to recent changes in MMX  
>> register
>> allocation area, excellent code is produced using MMX intrinsics, I'm
>> closing
>> this bug as WONTFIX.
>>
>> Also, auto-vectorization would produce either MMX or SSE code, but  
>> not
>> both of
>> them:
>>
>> #define UNITS_PER_SIMD_WORD (TARGET_SSE ? 16 : UNITS_PER_WORD)
>>
>> Seems Uros is fighting your battle and providing some interesting  
>> code.
>>
>> Still, the root of the problem is that x86 sucks.
>
> No, the root of the problem is that gcc devels are idiots
> gcc has no business putting emms anywhere, thats the programmers job
> same as with free().
> If i do write SIMD code i do know what iam doing and do know i might  
> have
> to execute emms, i absolutely dont want gcc to guess it behind my  
> back.
>
> Also if i explicitly force gcc to use paddw:
> void test(){
>    w= __builtin_ia32_paddw(w,w);
>    dw= (mmxdw)w;
> }
> -----
> gcc-4.3 -mtune=pentium3 -march=pentium3 -fomit-frame-pointer -S -O3
> generates:
>        subl    $12, %esp
>        movq    w, %mm0
>        movq    %mm0, (%esp)
>        paddw   %mm0, %mm0
>        movq    %mm0, w
>        movl    w, %eax
>        movl    w+4, %edx
>        movl    %eax, dw
>        movl    %edx, dw+4
>        addl    $12, %esp
>        ret
> -----
> compared to
> gcc-3.4 -mtune=pentium3 -march=pentium3 -fomit-frame-pointer -S -O3
>        movq    w, %mm1
>        paddw   %mm1, %mm1
>        movq    %mm1, w
>        movq    w, %mm0
>        movq    %mm0, dw
>        ret
>
> So where is that "excellent code is produced using MMX intrinsics" ???

It's in gcc 4.4:
gcc version 4.4.0 20080318 (experimental) (GCC)
         subl    $12, %esp
         movq    _w, %mm0
         paddw   %mm0, %mm0
         movq    %mm0, _w
         movq    _w, %mm0
         movq    %mm0, _dw
         addl    $12, %esp
         ret

Actually, it was fixed because someone converted dsputil code into  
intrinsics and complained on the mailing list that the result was  
terrible.

For the version with +, it uses mm0 but not paddw - isn't that just as  
unsafe?