[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#6
Michael Niedermayer
michaelni
Tue Aug 28 21:33:46 CEST 2007
Hi
On Tue, Aug 28, 2007 at 07:44:56PM +0200, Balatoni Denes wrote:
> Hi Michael!
>
> Tuesday 28 August 2007 06:04-kor Michael Niedermayer ezt ?rta:
> > > You forgot to give a good reason, because your argument seems flawed.
> >
> > the code is suboptimal speedwise and you try to convice me that it cant
> > be improved instead of trying to improve the code
> > your code does alot of stores which are followed by loads many of them
> > can be avoided with no changes to the available registers yet you dont
> > you rather concentrate on arguing what in your oppinion cant be done
>
> Are you saying, that if I add/put_clamped the result in the second half of the
> transform directly, instead of just storing it for a later add/put_clamped
> operation, than you would accept my patch? If your answer is yes, than it is
> feasible, and I think I will do that.
you are forgetting that theres also 25% between the horizontal and vertical
idcts which can be reused with no store/load and no changes to the registers
>
> > > Ok, I understand what you mean. I did some calculations. On the ultrasparc
> > > III
> > > (4 clock latency) about 14 clocks would be spent waiting - that's not too
> > > bad, that's still an 18 clock speed improvement. However on the ultrasparc
> > > T2
> > > (Niagara 2, 6 clock latency) about 36 clocks would be spent waiting - that
> > > would be slower than before the rewrite. So it's a bad idea.
> >
> > well and what if you combine the code for 2 columns? that is 2 even ones
> > or 2 odd ones not even odd mix ...
>
> Ok, I underestimated the speed loss on UltraSPARC T2 (US III was fine IMO),
> because I forgot to count the odd coulmns. Second, I think there are not
> enough registers to properly calculate two odd columns at once (8 registers
> would be needed for that).
how common is the US T2 ? if its a rare and old CPU i dont see a reason to
care about it ...
also you seem to ignore that most of the 8x8 blocks will have nearly all
of their elements 0 so a slowdown in code which is only executed for non
zero parts does not weight the same as a slowdown in code which is
always executed
> And I also realized, that the transpose operation
> needs 32 32bit registers (that is, all 32 bit registers are needed), which
> means that sometime half the data has to be stored in 64 bit registers, and
> then moved to 32 bit registers before transpose, which is an additional 8
> instructions. So with these taken into account, I believe (and I tried to
> make an accurate estimate) the rewrite would still be a few clocks slower on
> the US T2.
1. load left half of the 8x8 block in 8 64 bit registers
2. do idct of that into 8 2x32bit registers
3. transpose these
a. 8 2x32 -> 8 2x32
b. 8 2x32 -> 8 2x32
c. 8 2x32 -> 8 64bit
4. load right half of the 8x8 block in 8 64 bit registers
5. do idct of that into 8 2x32bit registers
(here all 32bit registers are available for the transpose)
6. transpose these
a. 8 2x32 -> 8 2x32
b. 8 2x32 -> 8 2x32
c. 8 2x32 -> 8 64bit
...
so there are no additional 8 instructions at least i cant see where ...
>
> Also, in the rewritten code, source and destination registers would always be
> changing between the 1/4 transformations, so it would be a convoluted mess.
> Also writing the code would not be very easy, (because each register have to
> be handpicked from what is available at any given time, I think there
> wouldn't really be a simple pattern - like it is now - to what register is
> used when).
>
> > It is dangerous to be right in matters on which the established
> > authorities are wrong. -- Voltaire
>
> So I should start to be afraid now ? :)
no, you arent right :)
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Breaking DRM is a little like attempting to break through a door even
though the window is wide open and the only thing in the house is a bunch
of things you dont want and which you would get tomorrow for free anyway
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070828/7d34d1b1/attachment.pgp>
More information about the ffmpeg-devel
mailing list