[FFmpeg-devel] [PATCH] SSE2 Xvid idct
Michael Niedermayer
michaelni
Sun Apr 6 23:10:50 CEST 2008
On Sun, Apr 06, 2008 at 09:39:57PM +0200, Pascal Massimino wrote:
> Hi,
>
> On Sun, Apr 6, 2008 at 6:14 PM, Michael Niedermayer <michaelni at gmx.at>
> wrote:
>
> >
> > > skal agreed it could be under LGPL in the last thread.
> >
> yep
>
>
> >
> > [...]
> > > #define SKIP_ROW_CHECK(src) \
> > > "movq "src", %%mm0 \n\t" \
> > > "por 8+"src", %%mm0 \n\t" \
> > > "packssdw %%mm0, %%mm0 \n\t" \
> > > "movd %%mm0, %%eax \n\t" \
> > > "testl %%eax, %%eax \n\t" \
> > > "jz 1f \n\t"
> >
> > You could try to check pairs of rows, this might be faster for some rows.
> > Also the code should be interleaved not form such nasty dependancy chains
> > you do have enogh mmx registers.
>
>
> just a quick note: you can try doing the same with
> some 'pmovmskb mmreg, eax' instructions.
> However, this is a complex instruction and the speed gain
> is not necessarily obvious.
Great idea, i think it could be faster (with SSE registers) due to the
slowness of packssdw.
psadbw could be tried as well as alternative.
>
>
> >
> > [...]
> > > "movdqa %%xmm2, ("dct") \n\t" \
> > > "movdqa %%xmm3, %%xmm2 \n\t" \
> > > "psubsw %%xmm6, %%xmm3 \n\t" \
> > > "paddsw %%xmm2, %%xmm6 \n\t" \
> > > "movdqa %%xmm6, %%xmm2 \n\t" \
> > > "psubsw %%xmm7, %%xmm6 \n\t" \
> > > "paddsw %%xmm2, %%xmm7 \n\t" \
> > > "movdqa %%xmm3, %%xmm2 \n\t" \
> > > "psubsw %%xmm5, %%xmm3 \n\t" \
> > > "paddsw %%xmm2, %%xmm5 \n\t" \
> > > "movdqa %%xmm5, %%xmm2 \n\t" \
> > > "psubsw %%xmm0, %%xmm5 \n\t" \
> > > "paddsw %%xmm2, %%xmm0 \n\t" \
> > > "movdqa %%xmm3, %%xmm2 \n\t" \
> > > "psubsw %%xmm4, %%xmm3 \n\t" \
> > > "paddsw %%xmm2, %%xmm4 \n\t" \
> > > "movdqa ("dct"), %%xmm2 \n\t" \
> >
> > i suspect this can be written without the load/store by using
> > add,add,sub buterflies (of course only if it is faster)
>
>
> iirc, i tried that and it's the same ticks count using the add,add,sub
> butterfly. Plus, i may be wrong, but i recall that the saturations used
> with the 'regular' mov,add,sub butterfly helps for nasty corner cases of
> overflow.
hmm, i dont see how
The output of the IDCT is approximately within +-255, and due to rounding and
quantization it can be more, IIRC some standard specified +-384
Now if we assume -384 .. +384 output then traceing backward
we would have -24576 ... +24576 before the >>6 and similarly
before any butterflies.
1. So none of the saturation cases in the current butterflies should ever
trigger.
2. Due to 1. they are equivalent to paddw/psubw
3. As twos complement numbers form a abelian group in respect to paddw/psubw
we can apply the associative, kommutative, inverse, identity laws without
concern.
4. B= a + (-b)
B= (a + 0) + (-b) (identity)
B= (a + (a + (-a))) + (-b) (inverse)
B= ((a + a) + (-a)) + (-b) (associative)
B= (a + a) + ((-a) + (-b)) (associative)
B= (a + a) + (-(a + b)) ("product" of inverese)
at that point we just have
A= a + b
t= a + a
B= t - A = a - b
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
I have never wished to cater to the crowd; for what I know they do not
approve, and what they approve I do not know. -- Epicurus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080406/de15bb2b/attachment.pgp>
More information about the ffmpeg-devel
mailing list