[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#8
Michael Niedermayer
michaelni
Thu Aug 30 20:59:25 CEST 2007
Hi
On Thu, Aug 30, 2007 at 08:42:51PM +0200, Balatoni Denes wrote:
> Hi!
>
> New patch attached.
>
> On Thursday 30 August 2007 01:25, Michael Niedermayer wrote:
> > > @@ -4045,6 +4049,13 @@
> > > int accel = vis_level ();
> > >
> > > if (accel & ACCEL_SPARC_VIS) {
> > > + if(avctx->idct_algo==FF_IDCT_SIMPLEVIS){
> > > + c->idct_put = ff_simple_idct_put_vis;
> > > + c->idct_add = ff_simple_idct_add_vis;
> > > + c->idct = ff_simple_idct_vis;
> > > + c->idct_permutation_type = FF_TRANSPOSE_IDCT_PERM;
> > > + }
> > > +
> >
> > this should be 4 spaces indented
>
> Yes, sorry about that.
>
>
> > > + "fbe 3f \n\t"\
> > > + "nop \n\t"\
> >
> > you can move a instruction into the nop slot, its always executed if the
> > annul bit is not set according to docs so the fpadd16 %%f26, %%f2, %%f26
> > from above would be a choice
> > this applies to all the other nop as well
>
> Ok, I did this.
>
> > > + /* 2. column */\
> > > + "for %%f4, %%f6, %%f60 \n\t"\
> > > + "fcmpd %%fcc0, %%f62, %%f60 \n\t"\
> >
> > the for and fcmpd can be moved up (with some distance from each other
> > so to avoid the 10 cycle stall (you said all instructions have a latency
> > of 6 on the US T2) this should cause theres nothing touching any of
> > f4,f6,f60,f62,fcc above so this should work
> [...]
> > > + /* 3. column */\
> > > + "3: \n\t"\
> > > + "for %%f8, %%f10, %%f60 \n\t"\
> > > + "fcmpd %%fcc0, %%f62, %%f60 \n\t"\
> >
> > the for and fcmp can similarely be moved up, you have to switch to fcc1
> > though to avoid a conflict with the above ones
> > this applies to the other for/fcmpd as well
>
> You were right, all four floating point condition registers can be used - I
> misunderstood the documentation. Now everything is moved up, and this did
> lead to a measurable 3% speedup (as it should have) on "my" UltraSPARC IIIi!
>
> > [...]
> >
> > > + TRANSPOSE
> > > + IDCT4ROWS
> > > + SCALEROWS
> > > + PUTPIXELSCLAMPED("0")
> > > + LOAD("%2+64")
> > > + TRANSPOSE
> > > + IDCT4ROWS
> > > + SCALEROWS
> > > + PUTPIXELSCLAMPED("4")
> >
> > the SCALEROWS is unneeded, the fpack16 can do the downshift and a single
> > addition to the 0,0 coefficient before the idct or first column after the
> > transpose can compensate for the rounding difference
> >
> >
> > [...]
> >
> > > + TRANSPOSE
> > > + IDCT4ROWS
> > > + SCALEROWS
> > > + ADDPIXELSCLAMPED("0")
> > > + LOAD("%2+64")
> > > + TRANSPOSE
> > > + IDCT4ROWS
> > > + SCALEROWS
> > > + ADDPIXELSCLAMPED("4")
> >
> > same here, the SCALEROWS can be avoided by changing the shift used in
> > fpack16 and the expansion value for the added pixels as well as adding a
> > bias with a single instruction further above
>
> Ok, I did this too. I missed this before somehow.
patch ok :)
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
The greatest way to live with honor in this world is to be what we pretend
to be. -- Socrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070830/9524399b/attachment.pgp>
More information about the ffmpeg-devel
mailing list