[Ffmpeg-devel] [RFC] VC1 Transform in AltiVec

Wed Jul 19 06:23:59 CEST 2006

On Tue, Jul 18, 2006 at 12:05:58PM +0200, Michael Niedermayer wrote:
> Hi
> 
> On Tue, Jul 18, 2006 at 06:46:23AM +0300, Kostya wrote:
> > Here is my first attept to optimize something with processor-specific instructions.
> > A patch to vc1.c provided.
> > 
> > Please note that:
> > a) It is AltiVec-only, so don't try to compile on x86 or machine without AltiVec support
> > b) It's just a hack to demonstrate it works, in future this will go to ppc/vc1_altivec.c
> > 
> > TRANSPOSE8() macro was taken from ppc/mpegvideo_altivec.c
> > 
> > I'd like to hear from people who know this stuff if I took the right approach (and further
> > suggestions of optimization).
> > 
> > MMX version will follow.
> 
> > --- vc1_svn.c	2006-07-16 07:47:53.000000000 +0300
> > +++ vc1.c	2006-07-17 19:09:12.000000000 +0300
> > @@ -716,6 +716,192 @@
> >      return 0;
> >  }
> >  
> > +#define TRANSPOSE8(a,b,c,d,e,f,g,h) \
> > +do { \
> > +    __typeof__(a)  _A1, _B1, _C1, _D1, _E1, _F1, _G1, _H1; \
> > +    __typeof__(a)  _A2, _B2, _C2, _D2, _E2, _F2, _G2, _H2; \
> 
> stuff beginning with _ is reserved in C ...

As I stated that's not my code. And looks like it is used to declare variables with the same type
as macro arguments.

> 
[...]
> 
> > +    ssrc7 = vec_ld(112, block);
> > +
> > +    TRANSPOSE8(ssrc0, ssrc1, ssrc2, ssrc3, ssrc4, ssrc5, ssrc6, ssrc7);
> 
> the TRANSPOSE is unneeded, the scantables can be transposed to get the same
> effect

I'm not sure about this. Looks like to be the simplest way to do horizontal
transform with AltiVec.
> 
> 
[...]
> > +
> > +    STEP8(s0, s1, s2, s3, s4, s5, s6, s7, vec_4);
> > +    SHIFT_HOR(s0, s1, s2, s3, s4, s5, s6, s7);
> > +    STEP8(s8, s9, sA, sB, sC, sD, sE, sF, vec_4);
> > +    SHIFT_HOR(s8, s9, sA, sB, sC, sD, sE, sF);
> 
> the horizontal transform fits in 16bit as is so no unpack/pack is needed

Oh, that's nice.

[...]
> > +    sA = vec_unpackh(ssrc2);
> > +    sB = vec_unpackh(ssrc3);
> > +    sC = vec_unpackh(ssrc4);
> > +    sD = vec_unpackh(ssrc5);
> > +    sE = vec_unpackh(ssrc6);
> > +    sF = vec_unpackh(ssrc7);
> > +    STEP8(s0, s1, s2, s3, s4, s5, s6, s7, vec_4);
> > +    SHIFT_VERT(s0, s1, s2, s3, s4, s5, s6, s7);
> > +    STEP8(s8, s9, sA, sB, sC, sD, sE, sF, vec_4);
> > +    SHIFT_VERT(s8, s9, sA, sB, sC, sD, sE, sF);
> 
> the vertical transform can also be done in 16bit though its a little trickier
> 
>             t1 = 6 * (src[ 0] + src[32]);
>             t2 = 6 * (src[ 0] - src[32]);
>             t3 = 8 * src[16] +  3 * src[48];
>             t4 = 3 * src[16] -  8 * src[48];
> 
>             t5 = t1 + t3;
>             t6 = t2 + t4;
>             t7 = t2 - t4;
>             t8 = t1 - t3;
> 
>             t1 = (8 * src[ 8] + 8 * src[24] + 4 * src[40] + 2 * src[56]) + ((- src[24] + src[40])>>1);
>             t2 = (8 * src[ 8] - 2 * src[24] - 8 * src[40] - 4 * src[56]) + ((- src[ 8] - src[56])>>1);
>             t3 = (4 * src[ 8] - 8 * src[24] + 2 * src[40] + 8 * src[56]) + ((  src[ 8] - src[56])>>1);
>             t4 = (2 * src[ 8] - 4 * src[24] + 8 * src[40] - 8 * src[56]) + ((- src[24] - src[40])>>1);
> 
>             dst[ 0] = (t5 + t1 + 32) >> 6;
>             dst[ 8] = (t6 + t2 + 32) >> 6;
>             dst[16] = (t7 + t3 + 32) >> 6;
>             dst[24] = (t8 + t4 + 32) >> 6;
>             dst[32] = (t8 - t4 + 32) >> 6;
>             dst[40] = (t7 - t3 + 32) >> 6;
>             dst[48] = (t6 - t2 + 32) >> 6;
>             dst[56] = (t5 - t1 + 32) >> 6;
> 
> its also interresting to note that microsoft must be aware of this due to the
> way rounding is done on the second half of coeffs but they apparently 
> dont mention it in the spec ... i am wondering what other stuff they have
> hidden ...
> 
> and the + 32 can be added to t1/t2 instead of the end

Well, here is my version converted back to C:

t1 = ((src[0] + src[4]) << 2) * 3 + 4;
t2 = ((src[0] - src[4]) << 2) * 3 + 4;
t3 = ((src[6] * 3) << 1) + (src[2] << 4);
t4 = ((src[2] * 3) << 1) - (src[6] << 4);

t5 = t1 + t3;
t6 = t2 + t4;
t7 = t2 - t4;
t8 = t1 - t3;

// t1 = 16 * src[1] + 15 * src[3] + 9 * src[5] + 4 * src[7]
t1 = ((((((src[1] + src[3]) << 1) + src[5]) << 1) + src[7]) << 2) + src[5] - src[3];
...etc

> 
> [...]
> -- 
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> In the past you could go to a library and read, borrow or copy any book
> Today you'd get arrested for mere telling someone where the library is
>