[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try #4
Michael Niedermayer
michaelni
Fri Aug 24 01:07:11 CEST 2007
Hi
On Fri, Aug 24, 2007 at 12:32:42AM +0200, Balatoni Denes wrote:
> Hi Michael!
>
> Well, honestly, I don't want to spend more time on this - at least for now. I
> have already spent like one and a half week with this, when I should have
didnt you just yesterday or was it today say that you spend a week on it, how
can you after a few hours have spend another half week on it?
> been doing my job. Today all I did at work was fixing it according to your
> suggestions, and I want to do something else now (if all else fails, my
> job;) ). To clarify, this is not my dayjob, hacking ffmpeg, and I am not a
> university student anymore either (unfortunatelly).
> I would be happy if the idct finally went to svn, and I would be sad if it
> didn't (after spending so much time with it), but this was what I could do
> for now.
well your time was not wasted, someone else can take it and finish the work
if you dont ...
>
> Anyway, I answer to some of your new suggestions, where appropriate.
>
> Thursday 23 August 2007 23:14-kor Michael Niedermayer ezt ?rta:
> > ok ill accept the fpadd16 for now if you move them after the for /fcmpd as
> > they arent needed before and after that they would often not be executed :)
> >
>
> That would be right in the middle of the idct4row macro. So there would be
> insane amounts of code duplication - well unless, I cut up idct4row by
> columns. If this was the last show stopper, I would do it, though ;)
all you have to do is to put a ADDBLAH macro after the branches in the
idct4row
then define it appropriately before each idct4row
[...]
> > anyway, please try to negate the high 8 bit or the coeffs and
> > try
> > fmul8sucks16
> > fmul8ulcks16
> > fpsub16
> >
> > maybe this reduces the rounding errors and its the same amount
> > of code ...
>
> I see what you mean - and a good idea, which didn't occur to me - but I am
> skeptical. The problem is, as I see it, is because of the two instructions,
> the error can be 1 (vs. 0.5 if it was one instruction with rounding). Now
> playing with the signs doesn't change this unfortunatelly.
its an idea, testing takes 5minutes for someone with a sparc, its just a
matter of *-1 the 8bits and running dct-test
a theoretical analysis of the rounding behavior would take more time
>
> > i also reject code which can be changed to achive a 0.1% overall speedup
> > (not rejecting it is like accepting a 0.1% slowdown ...)
>
> No because then it is 0% speedup vs. X-0.1% speedup, where X > 0.1.
no not at all, as long as i reject the code someone will continue to try
to get it into svn, thus inceasing the achieved speedup but as soon as
i accept it noone will bother to improve it
or would you? ;)
>
> > > multiplies. So with the simple_idct algorithm, I don't expect major
> > > speedups.
> >
> > hmm what happens if you just round the coeffs down to 8bit? / throw the
> > half the multiplies out, the code as it is is inaccurate anyway due to
> > sparc misdesign maybe the loss from 8bit coeffs isnt that big ...
>
> The current situtation is not that bad (of course the idct is switched of by
> default, but for just viewing a movie it should be mostly adequate). 8bit
> coeffs imho would make the situation very bad though.
again its a matter of <5min to test this (hey just add +128>>8 to the coeffs)
you needed more time to write this mail
>
> > > + asm volatile(
> > > + "ldd [%2], %%f32 \n\t"
> > > + "ldd [%2+8], %%f34 \n\t"
> > > + "ldd [%2+16], %%f36 \n\t"
> > > + "ldd [%2+24], %%f38 \n\t"
> > > + "ldd [%2+32], %%f40 \n\t"
> > > + "ldd [%2+40], %%f42 \n\t"
> > > + "ldd [%2+48], %%f44 \n\t"
> > > +
> > > + // shift right 16-4=12
> > > + LOADSCALE("%3+0")
> > > + IDCT4ROWS
> > > + STOREROWS("%4+0")
> > > + LOADSCALE("%3+8")
> > > + IDCT4ROWS
> > > + STOREROWS("%4+8")
> > > +
> > > + // shift right 16+4
> > > + LOADTRANS("%4+0")
> > > + IDCT4ROWS
> > > + SCALEROWS
> > > + STOREROWS("%3+0")
> > > + LOADTRANS("%4+64")
> > > + IDCT4ROWS
> > > + SCALEROWS
> > > + STOREROWS("%3+8")
> >
> > it seems that the sparc has twice as many registers as what would be needed
> > to keep the coefficients in registers between teh rows and columns
> > transforms that would avoid 32 instructions ...
>
> They are kept in registers, nothing ever touches the coefficients.
STOREROWS writes them, LOADTRANS reads them, maybe i should have called
them something else than coefficients but thats not really changing anything
>
> > is fpmerge (with 0) faster, slower or the same speed as fmul8x16? what
> > about fpexpand, both could be used here alternatively
> > btw do you have some document which lists the speed if all these
> > instructions?
>
> Same speed, troughput 1/clock, latency 4 (on ultrasparc3, 6 on ultrasparc T2).
> These numbers can be found in the CPU user manuals.
> http://www.sun.com/processors/documentation.html
thanks
>
> > > + "ldd [%0], %%f32 \n\t"
> > > + "ldd [%0+8], %%f34 \n\t"
> > > + "ldd [%0+16], %%f36 \n\t"
> > > + "ldd [%0+24], %%f38 \n\t"
> > > + "ldd [%0+32], %%f40 \n\t"
> > > + "ldd [%0+40], %%f42 \n\t"
> > > + "ldd [%0+48], %%f44 \n\t"
> > > + "ldd [%0+56], %%f46 \n\t"
> > > + "ldd [%0+64], %%f48 \n\t"
> > > + "ldd [%0+72], %%f50 \n\t"
> > > + "ldd [%0+80], %%f52 \n\t"
> > > + "ldd [%0+88], %%f54 \n\t"
> > > + "ldd [%0+96], %%f56 \n\t"
> > > + "ldd [%0+104], %%f58 \n\t"
> > > + "ldd [%0+112], %%f60 \n\t"
> > > + "ldd [%0+120], %%f62 \n\t"
> > > + "fpadd16 %%f0, %%f32, %%f0 \n\t"
> > > + "fpadd16 %%f2, %%f34, %%f2 \n\t"
> > > + "fpadd16 %%f4, %%f36, %%f4 \n\t"
> > > + "fpadd16 %%f6, %%f38, %%f6 \n\t"
> > > + "fpadd16 %%f8, %%f40, %%f8 \n\t"
> > > + "fpadd16 %%f10, %%f42, %%f10 \n\t"
> > > + "fpadd16 %%f12, %%f44, %%f12 \n\t"
> > > + "fpadd16 %%f14, %%f46, %%f14 \n\t"
> > > + "fpadd16 %%f16, %%f48, %%f16 \n\t"
> > > + "fpadd16 %%f18, %%f50, %%f18 \n\t"
> > > + "fpadd16 %%f20, %%f52, %%f20 \n\t"
> > > + "fpadd16 %%f22, %%f54, %%f22 \n\t"
> > > + "fpadd16 %%f24, %%f56, %%f24 \n\t"
> > > + "fpadd16 %%f26, %%f58, %%f26 \n\t"
> > > + "fpadd16 %%f28, %%f60, %%f28 \n\t"
> > > + "fpadd16 %%f30, %%f62, %%f30 \n\t"
> > > + "fpack16 %%f0, %%f0 \n\t"
> > > + "fpack16 %%f4, %%f4 \n\t"
> > > + "fpack16 %%f8, %%f8 \n\t"
> > > + "fpack16 %%f12, %%f12 \n\t"
> > > + "fpack16 %%f16, %%f16 \n\t"
> > > + "fpack16 %%f20, %%f20 \n\t"
> > > + "fpack16 %%f24, %%f24 \n\t"
> > > + "fpack16 %%f28, %%f28 \n\t"
> > > + "fpack16 %%f2, %%f1 \n\t"
> > > + "fpack16 %%f6, %%f5 \n\t"
> > > + "fpack16 %%f10, %%f9 \n\t"
> > > + "fpack16 %%f14, %%f13 \n\t"
> > > + "fpack16 %%f18, %%f17 \n\t"
> > > + "fpack16 %%f22, %%f21 \n\t"
> > > + "fpack16 %%f26, %%f25 \n\t"
> > > + "fpack16 %%f30, %%f29 \n\t"
> > > + "sub %1, %4, %1 \n\t"
> > > + "std %%f0, [%1] \n\t"
> > > + "add %1, %2, %1 \n\t"
> > > + "std %%f4, [%1] \n\t"
> > > + "add %1, %2, %1 \n\t"
> > > + "std %%f8, [%1] \n\t"
> > > + "add %1, %2, %1 \n\t"
> > > + "std %%f12, [%1] \n\t"
> > > + "add %1, %2, %1 \n\t"
> > > + "std %%f16, [%1] \n\t"
> > > + "add %1, %2, %1 \n\t"
> > > + "std %%f20, [%1] \n\t"
> > > + "add %1, %2, %1 \n\t"
> > > + "std %%f24, [%1] \n\t"
> > > + "add %1, %2, %1 \n\t"
> > > + "std %%f28, [%1] \n\t"
> >
> > as the sparc has a shitload of registers you could use a new register for
> > each result of the adds used during reading and then reuse these during
> > writing
> > avoids 8 instructions
>
> This could be done. But I am afraid - didn't check - gcc is already sticking
> some crazy prologues and epiloges to the asm section because of the lot of
> input and output constrains(and btw, I give output registers too, because I
> am afraid gcc miscompiles the input, it the output is not given).
hmm, ok i cant comment on this so ill leave that though in principle you could
write it as .s to bypass gcc retardedness
>
> > > + : "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4), "=r"
> > > (out5) + : "0" (block), "1" (pixels), "2" (line_size), "3"
> > > (expand), "4" (line_size*7) + );
> > > +}
> > > +
> > > +void ff_simple_idct_put_vis(uint8_t *dest, int line_size, DCTELEM *data)
> > > { + ff_simple_idct_vis(data);
> > > + ff_put_pixels_clamped_vis(data, dest, line_size);
> > > +}
> > > +
> > > +void ff_simple_idct_add_vis(uint8_t *dest, int line_size, DCTELEM *data)
> > > { + ff_simple_idct_vis(data);
> > > + ff_add_pixels_clamped_vis(data, dest, line_size);
> > > +}
> >
> > the idct should not store the output in memory but leave it in registers
> > the ff_simple_idct_put/add then should call the idct (or have it inlined)
> > and the clamping code should just work with the registers
> > this avoids another 32 instructions
>
> This could be done too. Although there would be some great code duplication
> (unless one uses macros),
yes a lot of things are bad unless done properly whoever finishes the work
will have to do it properly ...
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
I have often repented speaking, but never of holding my tongue.
-- Xenocrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070824/a22ba4f1/attachment.pgp>
More information about the ffmpeg-devel
mailing list