[FFmpeg-devel] Mixed data type in SIMD code?
Loren Merritt
lorenm
Wed Mar 5 10:05:27 CET 2008
On Wed, 5 Mar 2008, Zuxy Meng wrote:
>>
>> It looks to me that
>>
>> + OP2(movhlps, 6,0, 7,1)\
>> + OP2(addsd, 6,0, 7,1)\
>> + "movsd %%xmm0, %2 \n\t"\
>> + "movsd %%xmm1, 8+%2 \n\t"\
>>
>> can be optimized to
>>
>> haddpd %%xmm7, %%xmm6\n\t
>> movapd %%xmm6, %2\n\t
>>
>> when SSE3 is available.
>
> Benchmarking only this piece of code (6 inst. SSE vs 2 inst SSE3), on
> a K8 SSE3 is merely about 1% faster but on a Prescott SSE3 is 85%
> faster. Don't have access to any Core 2 though.
The integer versions of hadd* are slower than any multi-instruction sse2
code on core2, so I didn't think to try it for float.
Sure enough, my code takes 4 cycles while haddpd;movapd takes 5.
movapd;punpckl;punpckh;addpd;movapd saves 1 instruction, but still takes
4 cycles.
(I measuring throughput of a bunch of copies. Which isn't really what
matters when there only is one copy, but I can't think of a better
metric.)
Anyway, this code runs once as compared to the loop above with >1000
iterations. I'd optimize it if there were a purely better solution, but
not if it requires multiple versions.
--Loren Merritt
More information about the ffmpeg-devel
mailing list