[FFmpeg-devel] [PATCH] SSE2 Xvid idct
Loren Merritt
lorenm
Mon Apr 7 14:21:51 CEST 2008
On Sun, 6 Apr 2008, Pascal Massimino wrote:
> On Sun, Apr 6, 2008 at 6:14 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>
>>> "movdqa %%xmm2, ("dct") \n\t" \
>>> "movdqa %%xmm3, %%xmm2 \n\t" \
>>> "psubsw %%xmm6, %%xmm3 \n\t" \
>>> "paddsw %%xmm2, %%xmm6 \n\t" \
>>> "movdqa %%xmm6, %%xmm2 \n\t" \
>>> "psubsw %%xmm7, %%xmm6 \n\t" \
>>> "paddsw %%xmm2, %%xmm7 \n\t" \
>>> "movdqa %%xmm3, %%xmm2 \n\t" \
>>> "psubsw %%xmm5, %%xmm3 \n\t" \
>>> "paddsw %%xmm2, %%xmm5 \n\t" \
>>> "movdqa %%xmm5, %%xmm2 \n\t" \
>>> "psubsw %%xmm0, %%xmm5 \n\t" \
>>> "paddsw %%xmm2, %%xmm0 \n\t" \
>>> "movdqa %%xmm3, %%xmm2 \n\t" \
>>> "psubsw %%xmm4, %%xmm3 \n\t" \
>>> "paddsw %%xmm2, %%xmm4 \n\t" \
>>> "movdqa ("dct"), %%xmm2 \n\t" \
>>
>> i suspect this can be written without the load/store by using
>> add,add,sub buterflies (of course only if it is faster)
>
> iirc, i tried that and it's the same ticks count using the add,add,sub
> butterfly. Plus, i may be wrong, but i recall that the saturations used
> with the 'regular' mov,add,sub butterfly helps for nasty corner cases of
> overflow.
iirc, mov,add,sub is faster on core2 (if you have a spare register at
least), same speed on k8, and slower on p4 (due to its ridiculous mov).
--Loren Merritt
More information about the ffmpeg-devel
mailing list