[Ffmpeg-devel] a little optim for a SSE version of H263_LOOP_FILTER

Sun Nov 12 21:15:13 CET 2006

   Hi Konstantin and all,

> On Fri, Nov 10, 2006 at 11:48:16PM +0100, skal wrote:
> >    btw, while i have the mike:
> > 
> >    seems to me the following replacement functions for 
> >    vc1_v_overlap_c() and vc1_h_overlap_c() in vc1dsp.c:31
> >    are likely to be faster (and bitwise equivalent of course)
> > 
> > static void vc1_v_overlap_c(uint8_t* src, int stride, int rnd)
> > {
> >     int i;
> >     for(i = 0; i < 8; i++) {
> >         const int a = src[-2*stride];
> >         const int b = src[-stride];
> >         const int c = src[0];
> >         const int d = src[stride];
> >         const int d1 = ( a-d       + 3 + rnd ) >> 3;
> >         const int d2 = ( a-d + b-c + 4 - rnd ) >> 3;
> >         src[-2*stride] = clip_uint8(a-d1);
> >         src[-stride]   = clip_uint8(b+d2);
> >         src[0]         = clip_uint8(c-d2);
> >         src[stride]    = clip_uint8(d+d1);
> >         src++;
> >     }
> > }
> > 
> >    but i might of course be wrong...
> 
> They are almost correct (it should be read 'b-d2' and 'c+d2' instead)

   oh! you're right. Typo.

> - except the rounding:
> original:
>  4-rnd
>  3+rnd
>  4-rnd
>  3+rnd
> yours:
>  -3-rnd
>  -4-rnd
>  4+rnd
>  3+rnd

   hmm... i don't think so. The minus sign ("-d1") has its importance here.

   Btw, it's pretty obvious new values for 'a' and 'd' don't need [0..255] clipping
   since the kernel only has positive coeffs.
   And it's also obvious no update is needed if d1 or d2 are null.

e.g. =>

static void vc1_v_overlap_c(uint8_t* src, int stride, int rnd)
{
    int i;
    for(i = 0; i < 8; i++) {
        const int a = src[-2*stride];
        const int b = src[-stride];
        const int c = src[0];
        const int d = src[stride];
        const int d1 = ( a-d       + 3 + rnd ) >> 3;
        const int d2 = ( a-d + b-c + 4 - rnd ) >> 3;
        if (d1) {
          src[-2*stride] = a-d1;
          src[stride]    = d+d1;
        }
        if (d2) {
          src[-stride]   = clip_uint8(b-d2);
          src[0]         = clip_uint8(c+d2);
        }
        src++;
    }
}

   bye!

Skal

for the record, let's be pragmatic:

void Test_Overlap()
{
  int rnd, a,b,c,d;
  for(rnd=0; rnd<=1; ++rnd) {
    for(a=0; a<256; ++a) {
      for(b=0; b<256; ++b) {
        for(c=b; c<256; ++c) {
          for(d=a; d<256; ++d) {
            const int v1 = (7*a + d + 4 - rnd) >> 3;
            const int v2 = (-a + 7*b + c + d + 3 + rnd) >> 3;
            const int v3 = (a + b + 7*c - d + 4 - rnd) >> 3;
            const int v4 = (a + 7*d + 3 + rnd) >> 3;
            const int d1 = ( a-d       + 3 + rnd ) >> 3;
            const int d2 = ( a-d + b-c + 4 - rnd ) >> 3;
            const int w1 = a-d1;
            const int w2 = b-d2;
            const int w3 = c+d2;
            const int w4 = d+d1;
            assert(v1==w1);
            assert(v2==w2);
            assert(v3==w3);
            assert(v4==w4);
          }
        }
      }
      printf(".");
    }
  }
}