[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#7
Michael Niedermayer
michaelni
Thu Aug 30 01:25:32 CEST 2007
Hi
On Wed, Aug 29, 2007 at 11:37:19PM +0200, Balatoni Denes wrote:
> Hi!
>
> Wednesday 29 August 2007 00:13-kor Michael Niedermayer ezt ?rta:
> > On Tue, Aug 28, 2007 at 10:38:23PM +0200, Balatoni Denes wrote:
> > > > > you are forgetting that theres also 25% between the horizontal and
> > > > > vertical idcts which can be reused with no store/load and no changes
> > > > > to the registers
> > > >
> > > > Indeed, I didn't take that into account. So if I fix that 25% and the
> > > > clamping part, will you accept the patch?
> > >
> > > Better yet: that would be 4 instructions. How about I gain 4 clocks in
> > > some other way instead - how, let it be my secret. Okay?
> >
> > hmm no but you have to do that secret optimization too now at minimum for
> > it to be considered for svn
> >
> > let me remind you, code has to be optimal to be accepted
> >
> > ill investigate the register shortage vs. avoidable load/stores vs. latency
> > after (the unlikely) case that you do correct the undisputed
> > suboptimalities
>
> Here is a new patch. I fixed all "undisputed suboptimalities". I also
> elminitad many adds, as you suggested before, because I found that gcc
> optimized away all unneeded prologue and epilogue code around the asm block.
> I also eliminated the temporary 128 byte storage where it is not needed.
:)
some more hard to dispute ideas below ...
[...]
> @@ -4045,6 +4049,13 @@
> int accel = vis_level ();
>
> if (accel & ACCEL_SPARC_VIS) {
> + if(avctx->idct_algo==FF_IDCT_SIMPLEVIS){
> + c->idct_put = ff_simple_idct_put_vis;
> + c->idct_add = ff_simple_idct_add_vis;
> + c->idct = ff_simple_idct_vis;
> + c->idct_permutation_type = FF_TRANSPOSE_IDCT_PERM;
> + }
> +
this should be 4 spaces indented
[...]
> +#define IDCT4ROWS \
> + /* 1. column */\
> + "fmul8ulx16 %%f0, %%f38, %%f58 \n\t"\
> + "fmul8ulx16 %%f2, %%f32, %%f18 \n\t"\
> + "fmul8ulx16 %%f2, %%f36, %%f22 \n\t"\
> + "fmul8ulx16 %%f2, %%f40, %%f26 \n\t"\
> + "fmul8ulx16 %%f2, %%f44, %%f30 \n\t"\
> +\
> + "fmul8sux16 %%f0, %%f38, %%f48 \n\t"\
> + "fmul8sux16 %%f2, %%f32, %%f50 \n\t"\
> + "fmul8sux16 %%f2, %%f36, %%f52 \n\t"\
> + "fmul8sux16 %%f2, %%f40, %%f54 \n\t"\
> + "fmul8sux16 %%f2, %%f44, %%f56 \n\t"\
> +\
> + "fpadd16 %%f48, %%f58, %%f58 \n\t"\
> + "fpadd16 %%f50, %%f18, %%f18 \n\t"\
> + "fpadd16 %%f52, %%f22, %%f22 \n\t"\
> + "fpadd16 %%f54, %%f26, %%f26 \n\t"\
> + "fpadd16 %%f56, %%f30, %%f30 \n\t"\
> +\
> + "fpadd16 %%f58, %%f0, %%f16 \n\t"\
> + "fpadd16 %%f58, %%f0, %%f20 \n\t"\
> + "fpadd16 %%f58, %%f0, %%f24 \n\t"\
> + "fpadd16 %%f58, %%f0, %%f28 \n\t"\
> + "fpadd16 %%f18, %%f2, %%f18 \n\t"\
> + "fpadd16 %%f22, %%f2, %%f22 \n\t"\
> + "fpadd16 %%f26, %%f2, %%f26 \n\t"\
> + /* 2. column */\
> + "for %%f4, %%f6, %%f60 \n\t"\
> + "fcmpd %%fcc0, %%f62, %%f60 \n\t"\
the for and fcmpd can be moved up (with some distance from each other
so to avoid the 10 cycle stall (you said all instructions have a latency
of 6 on the US T2) this should cause theres nothing touching any of
f4,f6,f60,f62,fcc above so this should work
> + "fbe 3f \n\t"\
> + "nop \n\t"\
you can move a instruction into the nop slot, its always executed if the annul
bit is not set according to docs so the fpadd16 %%f26, %%f2, %%f26 from
above would be a choice
this applies to all the other nop as well
> + "fmul8ulx16 %%f4, %%f34, %%f48 \n\t"\
> + "fmul8ulx16 %%f4, %%f42, %%f50 \n\t"\
> + "fmul8ulx16 %%f6, %%f36, %%f52 \n\t"\
> + "fmul8ulx16 %%f6, %%f44, %%f54 \n\t"\
> + "fmul8ulx16 %%f6, %%f32, %%f56 \n\t"\
> + "fmul8ulx16 %%f6, %%f40, %%f58 \n\t"\
> +\
> + "fpadd16 %%f16, %%f48, %%f16 \n\t"\
> + "fpadd16 %%f20, %%f50, %%f20 \n\t"\
> + "fpsub16 %%f24, %%f50, %%f24 \n\t"\
> + "fpsub16 %%f28, %%f48, %%f28 \n\t"\
> + "fpadd16 %%f18, %%f52, %%f18 \n\t"\
> + "fpsub16 %%f22, %%f54, %%f22 \n\t"\
> + "fpsub16 %%f26, %%f56, %%f26 \n\t"\
> + "fpsub16 %%f30, %%f58, %%f30 \n\t"\
> +\
> + "fmul8sux16 %%f4, %%f34, %%f48 \n\t"\
> + "fmul8sux16 %%f4, %%f42, %%f50 \n\t"\
> + "fmul8sux16 %%f6, %%f36, %%f52 \n\t"\
> + "fmul8sux16 %%f6, %%f44, %%f54 \n\t"\
> + "fmul8sux16 %%f6, %%f32, %%f56 \n\t"\
> + "fmul8sux16 %%f6, %%f40, %%f58 \n\t"\
> +\
> + "fpadd16 %%f16, %%f48, %%f16 \n\t"\
> + "fpadd16 %%f20, %%f50, %%f20 \n\t"\
> + "fpsub16 %%f24, %%f50, %%f24 \n\t"\
> + "fpsub16 %%f28, %%f48, %%f28 \n\t"\
> + "fpadd16 %%f18, %%f52, %%f18 \n\t"\
> + "fpsub16 %%f22, %%f54, %%f22 \n\t"\
> + "fpsub16 %%f26, %%f56, %%f26 \n\t"\
> + "fpsub16 %%f30, %%f58, %%f30 \n\t"\
> +\
> + "fpadd16 %%f16, %%f4, %%f16 \n\t"\
> + "fpsub16 %%f28, %%f4, %%f28 \n\t"\
> + "fpadd16 %%f18, %%f6, %%f18 \n\t"\
> + "fpsub16 %%f26, %%f6, %%f26 \n\t"\
> + "fpsub16 %%f30, %%f6, %%f30 \n\t"\
> + /* 3. column */\
> + "3: \n\t"\
> + "for %%f8, %%f10, %%f60 \n\t"\
> + "fcmpd %%fcc0, %%f62, %%f60 \n\t"\
the for and fcmp can similarely be moved up, you have to switch to fcc1 though
to avoid a conflict with the above ones
this applies to the other for/fcmpd as well
[...]
> + TRANSPOSE
> + IDCT4ROWS
> + SCALEROWS
> + PUTPIXELSCLAMPED("0")
> + LOAD("%2+64")
> + TRANSPOSE
> + IDCT4ROWS
> + SCALEROWS
> + PUTPIXELSCLAMPED("4")
the SCALEROWS is unneeded, the fpack16 can do the downshift and a single
addition to the 0,0 coefficient before the idct or first column after the
transpose can compensate for the rounding difference
[...]
> + TRANSPOSE
> + IDCT4ROWS
> + SCALEROWS
> + ADDPIXELSCLAMPED("0")
> + LOAD("%2+64")
> + TRANSPOSE
> + IDCT4ROWS
> + SCALEROWS
> + ADDPIXELSCLAMPED("4")
same here, the SCALEROWS can be avoided by changing the shift used in fpack16
and the expansion value for the added pixels as well as adding a bias with a
single instruction further above
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070830/a5524abc/attachment.pgp>
More information about the ffmpeg-devel
mailing list