[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try #4

Fri Aug 24 01:07:11 CEST 2007

Hi

On Fri, Aug 24, 2007 at 12:32:42AM +0200, Balatoni Denes wrote:
> Hi Michael!
> 
> Well, honestly, I don't want to spend more time on this - at least for now. I 
> have already spent like one and a half week with this, when I should have 

didnt you just yesterday or was it today say that you spend a week on it, how
can you after a few hours have spend another half week on it?

> been doing my job. Today all I did at work was fixing it according to your 
> suggestions, and I want to do something else now (if all else fails, my 
> job;) ). To clarify, this is not my dayjob, hacking ffmpeg, and I am not a 
> university student anymore either (unfortunatelly).
> I would be happy if the idct finally went to svn, and I would be sad if it 
> didn't (after spending so much time with it), but this was what I could do 
> for now. 

well your time was not wasted, someone else can take it and finish the work
if you dont ...

> 
> Anyway, I answer to some of your new suggestions, where appropriate.
> 
> Thursday 23 August 2007 23:14-kor Michael Niedermayer ezt ?rta:
> > ok ill accept the fpadd16 for now if you move them after the for /fcmpd as
> > they arent needed before and after that they would often not be executed :)
> >
> 
> That would be right in the middle of the idct4row macro. So there would be 
> insane amounts of code duplication - well unless, I cut up idct4row by 
> columns. If this was the last show stopper, I would do it, though ;)

all you have to do is to put a ADDBLAH macro after the branches in the
idct4row
then define it appropriately before each idct4row

[...]
> > anyway, please try to negate the high 8 bit or the coeffs and
> > try
> > fmul8sucks16
> > fmul8ulcks16
> > fpsub16
> >
> > maybe this reduces the rounding errors and its the same amount
> > of code ...
> 
> I see what you mean - and a good idea, which didn't occur to me - but I am 
> skeptical. The problem is, as I see it, is because of the two instructions, 
> the error can be 1 (vs. 0.5 if it was one instruction with rounding). Now 
> playing with the signs doesn't change this unfortunatelly.

its an idea, testing takes 5minutes for someone with a sparc, its just a
matter of *-1 the 8bits and running dct-test
a theoretical analysis of the rounding behavior would take more time

> 
> > i also reject code which can be changed to achive a 0.1% overall speedup
> > (not rejecting it is like accepting a 0.1% slowdown ...)
> 
> No because then it is 0% speedup vs. X-0.1% speedup, where X > 0.1.

no not at all, as long as i reject the code someone will continue to try
to get it into svn, thus inceasing the achieved speedup but as soon as
i accept it noone will bother to improve it

or would you? ;)

> 
> > > multiplies. So with the simple_idct algorithm, I don't expect major
> > > speedups.
> >
> > hmm what happens if you just round the coeffs down to 8bit? / throw the
> > half the multiplies out, the code as it is is inaccurate anyway due to
> > sparc misdesign maybe the loss from 8bit coeffs isnt that big ...
> 
> The current situtation is not that bad (of course the idct is switched of by 
> default, but for just viewing a movie it should be mostly adequate). 8bit 
> coeffs imho would make the situation very bad though.

again its a matter of <5min to test this (hey just add +128>>8 to the coeffs)
you needed more time to write this mail

> 
> > > +    asm volatile(
> > > +        "ldd [%2], %%f32         \n\t"
> > > +        "ldd [%2+8], %%f34       \n\t"
> > > +        "ldd [%2+16], %%f36      \n\t"
> > > +        "ldd [%2+24], %%f38      \n\t"
> > > +        "ldd [%2+32], %%f40      \n\t"
> > > +        "ldd [%2+40], %%f42      \n\t"
> > > +        "ldd [%2+48], %%f44      \n\t"
> > > +
> > > +        // shift right 16-4=12
> > > +        LOADSCALE("%3+0")
> > > +        IDCT4ROWS
> > > +        STOREROWS("%4+0")
> > > +        LOADSCALE("%3+8")
> > > +        IDCT4ROWS
> > > +        STOREROWS("%4+8")
> > > +
> > > +        // shift right 16+4
> > > +        LOADTRANS("%4+0")
> > > +        IDCT4ROWS
> > > +        SCALEROWS
> > > +        STOREROWS("%3+0")
> > > +        LOADTRANS("%4+64")
> > > +        IDCT4ROWS
> > > +        SCALEROWS
> > > +        STOREROWS("%3+8")
> >
> > it seems that the sparc has twice as many registers as what would be needed
> > to keep the coefficients in registers between teh rows and columns
> > transforms that would avoid 32 instructions ...
> 
> They are kept in registers, nothing ever touches the coefficients.

STOREROWS writes them, LOADTRANS reads them, maybe i should have called
them something else than coefficients but thats not really changing anything

> 
> > is fpmerge (with 0) faster, slower or the same speed as fmul8x16? what
> > about fpexpand, both could be used here alternatively
> > btw do you have some document which lists the speed if all these
> > instructions?
> 
> Same speed, troughput 1/clock, latency 4 (on ultrasparc3, 6 on ultrasparc T2).
> These numbers can be found in the CPU user manuals.
> http://www.sun.com/processors/documentation.html

thanks

> 
> > > +        "ldd [%0], %%f32      \n\t"
> > > +        "ldd [%0+8], %%f34    \n\t"
> > > +        "ldd [%0+16], %%f36   \n\t"
> > > +        "ldd [%0+24], %%f38   \n\t"
> > > +        "ldd [%0+32], %%f40   \n\t"
> > > +        "ldd [%0+40], %%f42   \n\t"
> > > +        "ldd [%0+48], %%f44   \n\t"
> > > +        "ldd [%0+56], %%f46   \n\t"
> > > +        "ldd [%0+64], %%f48   \n\t"
> > > +        "ldd [%0+72], %%f50   \n\t"
> > > +        "ldd [%0+80], %%f52   \n\t"
> > > +        "ldd [%0+88], %%f54   \n\t"
> > > +        "ldd [%0+96], %%f56   \n\t"
> > > +        "ldd [%0+104], %%f58  \n\t"
> > > +        "ldd [%0+112], %%f60  \n\t"
> > > +        "ldd [%0+120], %%f62  \n\t"
> > > +        "fpadd16 %%f0, %%f32, %%f0   \n\t"
> > > +        "fpadd16 %%f2, %%f34, %%f2   \n\t"
> > > +        "fpadd16 %%f4, %%f36, %%f4   \n\t"
> > > +        "fpadd16 %%f6, %%f38, %%f6   \n\t"
> > > +        "fpadd16 %%f8, %%f40, %%f8   \n\t"
> > > +        "fpadd16 %%f10, %%f42, %%f10 \n\t"
> > > +        "fpadd16 %%f12, %%f44, %%f12 \n\t"
> > > +        "fpadd16 %%f14, %%f46, %%f14 \n\t"
> > > +        "fpadd16 %%f16, %%f48, %%f16 \n\t"
> > > +        "fpadd16 %%f18, %%f50, %%f18 \n\t"
> > > +        "fpadd16 %%f20, %%f52, %%f20 \n\t"
> > > +        "fpadd16 %%f22, %%f54, %%f22 \n\t"
> > > +        "fpadd16 %%f24, %%f56, %%f24 \n\t"
> > > +        "fpadd16 %%f26, %%f58, %%f26 \n\t"
> > > +        "fpadd16 %%f28, %%f60, %%f28 \n\t"
> > > +        "fpadd16 %%f30, %%f62, %%f30 \n\t"
> > > +        "fpack16 %%f0, %%f0   \n\t"
> > > +        "fpack16 %%f4, %%f4   \n\t"
> > > +        "fpack16 %%f8, %%f8   \n\t"
> > > +        "fpack16 %%f12, %%f12 \n\t"
> > > +        "fpack16 %%f16, %%f16 \n\t"
> > > +        "fpack16 %%f20, %%f20 \n\t"
> > > +        "fpack16 %%f24, %%f24 \n\t"
> > > +        "fpack16 %%f28, %%f28 \n\t"
> > > +        "fpack16 %%f2, %%f1   \n\t"
> > > +        "fpack16 %%f6, %%f5   \n\t"
> > > +        "fpack16 %%f10, %%f9  \n\t"
> > > +        "fpack16 %%f14, %%f13 \n\t"
> > > +        "fpack16 %%f18, %%f17 \n\t"
> > > +        "fpack16 %%f22, %%f21 \n\t"
> > > +        "fpack16 %%f26, %%f25 \n\t"
> > > +        "fpack16 %%f30, %%f29 \n\t"
> > > +        "sub %1, %4, %1       \n\t"
> > > +        "std %%f0, [%1]       \n\t"
> > > +        "add %1, %2, %1       \n\t"
> > > +        "std %%f4, [%1]       \n\t"
> > > +        "add %1, %2, %1       \n\t"
> > > +        "std %%f8, [%1]       \n\t"
> > > +        "add %1, %2, %1       \n\t"
> > > +        "std %%f12, [%1]      \n\t"
> > > +        "add %1, %2, %1       \n\t"
> > > +        "std %%f16, [%1]      \n\t"
> > > +        "add %1, %2, %1       \n\t"
> > > +        "std %%f20, [%1]      \n\t"
> > > +        "add %1, %2, %1       \n\t"
> > > +        "std %%f24, [%1]      \n\t"
> > > +        "add %1, %2, %1       \n\t"
> > > +        "std %%f28, [%1]      \n\t"
> >
> > as the sparc has a shitload of registers you could use a new register for
> > each result of the adds used during reading and then reuse these during
> > writing
> > avoids 8 instructions
> 
> This could be done. But I am afraid - didn't check - gcc is already sticking 
> some crazy prologues and epiloges to the asm section because of the lot of 
> input and output constrains(and btw, I give output registers too, because I 
> am afraid gcc miscompiles the input, it the output is not given).

hmm, ok i cant comment on this so ill leave that though in principle you could
write it as .s to bypass gcc retardedness

> 
> > > +        : "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4), "=r"
> > > (out5) +        : "0" (block), "1" (pixels), "2" (line_size), "3"
> > > (expand), "4" (line_size*7) +    );
> > > +}
> > > +
> > > +void ff_simple_idct_put_vis(uint8_t *dest, int line_size, DCTELEM *data)
> > > { +    ff_simple_idct_vis(data);
> > > +    ff_put_pixels_clamped_vis(data, dest, line_size);
> > > +}
> > > +
> > > +void ff_simple_idct_add_vis(uint8_t *dest, int line_size, DCTELEM *data)
> > > { +    ff_simple_idct_vis(data);
> > > +    ff_add_pixels_clamped_vis(data, dest, line_size);
> > > +}
> >
> > the idct should not store the output in memory but leave it in registers
> > the ff_simple_idct_put/add then should call the idct (or have it inlined)
> > and the clamping code should just work with the registers
> > this avoids another 32 instructions
> 
> This could be done too. Although there would be some great code duplication 
> (unless one uses macros), 

yes a lot of things are bad unless done properly whoever finishes the work
will have to do it properly ...

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I have often repented speaking, but never of holding my tongue.
-- Xenocrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070824/a22ba4f1/attachment.pgp>