[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#8

Thu Aug 30 20:59:25 CEST 2007

Hi

On Thu, Aug 30, 2007 at 08:42:51PM +0200, Balatoni Denes wrote:
> Hi!
> 
> New patch attached.
> 
> On Thursday 30 August 2007 01:25, Michael Niedermayer wrote:
> > > @@ -4045,6 +4049,13 @@
> > >    int accel = vis_level ();
> > >
> > >    if (accel & ACCEL_SPARC_VIS) {
> > > +      if(avctx->idct_algo==FF_IDCT_SIMPLEVIS){
> > > +                c->idct_put = ff_simple_idct_put_vis;
> > > +                c->idct_add = ff_simple_idct_add_vis;
> > > +                c->idct     = ff_simple_idct_vis;
> > > +                c->idct_permutation_type = FF_TRANSPOSE_IDCT_PERM;
> > > +      }
> > > +
> >
> > this should be 4 spaces indented
> 
> Yes, sorry about that.
> 
> 
> > > +        "fbe 3f                        \n\t"\
> > > +        "nop                           \n\t"\
> >
> > you can move a instruction into the nop slot, its always executed if the
> > annul bit is not set according to docs so the fpadd16 %%f26, %%f2, %%f26
> > from above would be a choice
> > this applies to all the other nop as well
> 
> Ok, I did this.
> 
> > > +    /* 2. column */\
> > > +        "for %%f4, %%f6, %%f60         \n\t"\
> > > +        "fcmpd %%fcc0, %%f62, %%f60    \n\t"\
> >
> > the for and fcmpd can be moved up (with some distance from each other
> > so to avoid the 10 cycle stall (you said all instructions have a latency
> > of 6 on the US T2) this should cause theres nothing touching any of
> > f4,f6,f60,f62,fcc above so this should work
> [...]
> > > +    /* 3. column */\
> > > +        "3:                             \n\t"\
> > > +        "for %%f8, %%f10, %%f60         \n\t"\
> > > +        "fcmpd %%fcc0, %%f62, %%f60     \n\t"\
> >
> > the for and fcmp can similarely be moved up, you have to switch to fcc1
> > though to avoid a conflict with the above ones
> > this applies to the other for/fcmpd as well
> 
> You were right, all four floating point condition registers can be used - I 
> misunderstood the documentation. Now everything is moved up, and this did 
> lead to a measurable 3% speedup (as it should have) on "my" UltraSPARC IIIi!
> 
> > [...]
> >
> > > +        TRANSPOSE
> > > +        IDCT4ROWS
> > > +        SCALEROWS
> > > +        PUTPIXELSCLAMPED("0")
> > > +        LOAD("%2+64")
> > > +        TRANSPOSE
> > > +        IDCT4ROWS
> > > +        SCALEROWS
> > > +        PUTPIXELSCLAMPED("4")
> >
> > the SCALEROWS is unneeded, the fpack16 can do the downshift and a single
> > addition to the 0,0 coefficient before the idct or first column after the
> > transpose can compensate for the rounding difference
> >
> >
> > [...]
> >
> > > +        TRANSPOSE
> > > +        IDCT4ROWS
> > > +        SCALEROWS
> > > +        ADDPIXELSCLAMPED("0")
> > > +        LOAD("%2+64")
> > > +        TRANSPOSE
> > > +        IDCT4ROWS
> > > +        SCALEROWS
> > > +        ADDPIXELSCLAMPED("4")
> >
> > same here, the SCALEROWS can be avoided by changing the shift used in
> > fpack16 and the expansion value for the added pixels as well as adding a
> > bias with a single instruction further above
> 
> Ok, I did this too. I missed this before somehow.

patch ok :)

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The greatest way to live with honor in this world is to be what we pretend
to be. -- Socrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070830/9524399b/attachment.pgp>