[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#6

Tue Aug 28 19:44:56 CEST 2007

Hi Michael!

Tuesday 28 August 2007 06:04-kor Michael Niedermayer ezt ?rta:
> > You forgot to give a good reason, because your argument seems flawed.
> 
> the code is suboptimal speedwise and you try to convice me that it cant
> be improved instead of trying to improve the code
> your code does alot of stores which are followed by loads many of them
> can be avoided with no changes to the available registers yet you dont
> you rather concentrate on arguing what in your oppinion cant be done

Are you saying, that if I add/put_clamped the result in the second half of the 
transform directly, instead of just storing it for a later add/put_clamped 
operation, than you would accept my patch? If your answer is yes, than it is 
feasible, and I think I will do that.

> > Ok, I understand what you mean. I did some calculations. On the ultrasparc
> > III  
> > (4 clock latency) about 14 clocks would be spent waiting - that's not too 
> > bad, that's still an 18 clock speed improvement. However on the ultrasparc
> > T2  
> > (Niagara 2, 6 clock latency) about 36 clocks would be spent waiting - that 
> > would be slower than before the rewrite. So it's a bad idea.
> 
> well and what if you combine the code for 2 columns? that is 2 even ones
> or 2 odd ones not even odd mix ...

Ok, I underestimated the speed loss on UltraSPARC T2 (US III was fine IMO), 
because I forgot to count the odd coulmns. Second, I think there are not 
enough registers to properly calculate two odd columns at once (8 registers 
would be needed for that). And I also realized, that the transpose operation 
needs 32 32bit registers (that is, all 32 bit registers are needed), which 
means that sometime half the data has to be stored in 64 bit registers, and 
then moved to 32 bit registers before transpose, which is an additional 8 
instructions. So with these taken into account, I believe (and I tried to 
make an accurate estimate) the rewrite would still be a few clocks slower on 
the US T2. 

Also, in the rewritten code, source and destination registers would always be 
changing between the 1/4 transformations, so it would be a convoluted mess. 
Also writing the code would not be very easy, (because each register have to 
be handpicked from what is available at any given time, I think there 
wouldn't really be a simple pattern - like it is now - to what register is 
used when).

> It is dangerous to be right in matters on which the established
> authorities are wrong. -- Voltaire

So I should start to be afraid now ? :)

bye
Denes