[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Mon Apr 7 14:21:51 CEST 2008

On Sun, 6 Apr 2008, Pascal Massimino wrote:
> On Sun, Apr 6, 2008 at 6:14 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>
>>>     "movdqa   %%xmm2, ("dct")         \n\t" \
>>>     "movdqa   %%xmm3, %%xmm2          \n\t" \
>>>     "psubsw   %%xmm6, %%xmm3          \n\t" \
>>>     "paddsw   %%xmm2, %%xmm6          \n\t" \
>>>     "movdqa   %%xmm6, %%xmm2          \n\t" \
>>>     "psubsw   %%xmm7, %%xmm6          \n\t" \
>>>     "paddsw   %%xmm2, %%xmm7          \n\t" \
>>>     "movdqa   %%xmm3, %%xmm2          \n\t" \
>>>     "psubsw   %%xmm5, %%xmm3          \n\t" \
>>>     "paddsw   %%xmm2, %%xmm5          \n\t" \
>>>     "movdqa   %%xmm5, %%xmm2          \n\t" \
>>>     "psubsw   %%xmm0, %%xmm5          \n\t" \
>>>     "paddsw   %%xmm2, %%xmm0          \n\t" \
>>>     "movdqa   %%xmm3, %%xmm2          \n\t" \
>>>     "psubsw   %%xmm4, %%xmm3          \n\t" \
>>>     "paddsw   %%xmm2, %%xmm4          \n\t" \
>>>     "movdqa  ("dct"), %%xmm2          \n\t" \
>>
>> i suspect this can be written without the load/store by using
>> add,add,sub buterflies (of course only if it is faster)
>
>  iirc, i tried that and it's the same ticks count using the add,add,sub
> butterfly. Plus, i may be wrong, but i recall that the saturations used
> with the 'regular' mov,add,sub butterfly helps for nasty corner cases of
> overflow.

iirc, mov,add,sub is faster on core2 (if you have a spare register at 
least), same speed on k8, and slower on p4 (due to its ridiculous mov).

--Loren Merritt