[FFmpeg-devel] Mixed data type in SIMD code?

Wed Mar 5 10:05:27 CET 2008

On Wed, 5 Mar 2008, Zuxy Meng wrote:
>>
>> It looks to me that
>>
>> +        OP2(movhlps,  6,0, 7,1)\
>> +        OP2(addsd,    6,0, 7,1)\
>> +        "movsd   %%xmm0,    %2  \n\t"\
>> +        "movsd   %%xmm1,  8+%2  \n\t"\
>>
>> can be optimized to
>>
>>          haddpd %%xmm7, %%xmm6\n\t
>>          movapd %%xmm6, %2\n\t
>>
>> when SSE3 is available.
>
> Benchmarking only this piece of code (6 inst. SSE vs 2 inst SSE3), on
> a K8 SSE3 is merely about 1% faster but on a Prescott SSE3 is 85%
> faster. Don't have access to any Core 2 though.

The integer versions of hadd* are slower than any multi-instruction sse2 
code on core2, so I didn't think to try it for float.
Sure enough, my code takes 4 cycles while haddpd;movapd takes 5.
movapd;punpckl;punpckh;addpd;movapd saves 1 instruction, but still takes 
4 cycles.
(I measuring throughput of a bunch of copies. Which isn't really what 
matters when there only is one copy, but I can't think of a better 
metric.)

Anyway, this code runs once as compared to the loop above with >1000 
iterations. I'd optimize it if there were a purely better solution, but 
not if it requires multiple versions.

--Loren Merritt