[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try#6

Mon Aug 27 22:21:40 CEST 2007

Hi!

I hope I am not filtered to your trash folder yet, Michael ;)

Here is a new patch, try#6. It is more accurate now (albeit a tiny bit 
slower), almost - but unfortunatelly only almost - passing ieee-1180.
Here is dct-test output:

  144   101   394   286   421   176    11   126
   88   109   152    79    99    93   166    94
  233   140    94   134    83   131    57    74
  185   100   121    74    85   117    84    70
  207   134    74   129   129    89    66    88
  112    74    74    94    75    80   124   103
  126   144   130   111   120    82    87   104
   96   119    72    83    66   133   146    97
IDCT SIMPLE-VIS: err_inf=1 err2=0.02248828 syserr=0.02105000 maxout=260 
blockSumErr=8

Btw, I tested Walken's idct, here is the dct-test output:

  92   156  -255   132  -189   129    -9   110
 -242    90   -97   100  -105    84  -104    80
  -89   157  -141    66  -165   107  -108    89
 -142   116   -92    81   -66   107  -102    67
 -163   128  -157    87   -92    59  -104    50
  -93   142   -78   123   -60   111  -107   109
 -149   148   -49    98   -43    64   -87    82
 -165   115  -115    66  -115    86  -112    96
IDCT WALKEN-VIS: err_inf=1 err2=0.02248906 syserr=0.01275000 maxout=260 
blockSumErr=8

So it is a bit more accurate, and indeed it kind of passes ieee-1180.

There were three outstanding issues:

1.)
> > ok, but then you should move the for up so its not immedeatly before
> > a fcmpd using its result
>
> Ok, done.

Well, I moved them back, because it broke sparse matrices.

2.)
> > there are 32 64bit registers these should be enough to do the idct
> > without an intermediate store-load
> > the whole 8x8 block needs 16registers, 7 for the constant coefficients
> > that leaves 9 available
>
> It would be slower. In it's current form of the idct, there are 8
> independent VIS instructions after each other, so the instruction latency
> is not a problem. If you only use 9 registers, than good luck with latency.

Indeed. Calculating the first column part would take at least 30 clocks more 
because of latency, because there would be only one register for intermediate 
results. Calculating the second column would take at lest 10 clocks more, and 
by this time we are slower than before, as the gain from all this wourk would 
have been about 32 clocks.

3.)
> > the idct should not store the output in memory but leave it in registers
> > the ff_simple_idct_put/add then should call the idct (or have it inlined)
> > and the clamping code should just work with the registers
> > this avoids another 32 instructions

Although it could be done, it is quite some work (and as always, relatively 
little benefit), and more complexity in the code. It's really not worth it, 
though we might not agree on this point.

So this was my last attempt at trying to get this code into SVN. If doesn't 
get in now, than let's be realistic Michael, it never will - because there 
really are very few people interested in developping SPARC VIS assembly - 
like there was no original VIS code in ffmpeg before I came, only parts 
copypasted from libmpeg2, and most of the things are just not optimized for 
SPARC.

I do believe this contribution would be beneficial to ffmpeg, because the C 
idct is much slower, and the mlib idct sometimes makes the picture turn pink 
(or causes other artifacts).

Anyhow, do as you wish, I am off to have dinner
bye
Denes

-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple_idct_vis_try6.diff
Type: text/x-diff
Size: 22586 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070827/2a0c15b1/attachment.diff>