[FFmpeg-devel] [PATCH] SPARC VIS simple_idct try #2
Balatoni Denes
dbalatoni
Wed Aug 22 20:53:21 CEST 2007
Hi!
New patch attached, which hopefully addresses your concerns. Replies to the
issues below.
Wednesday 22 August 2007 03:47-kor Michael Niedermayer ezt ?rta:
> On Tue, Aug 21, 2007 at 11:37:38PM +0200, Balatoni Denes wrote:
> > side. Although the mlib version is faster, this is more accurate - and I
> > honestly don't know why it is not faster on the sparc than it is, it
> > should
> > be according to my estimates, but it is not.
>
> it would be very interresting to find out why its not faster ...
I benchmarked it now with gethrtime(), and apparently it is as fast as it
could be (cca. 1 clock/instruction). So probably I underestimated the speed
of the C version.
> > Also, although it's the same algorithm as simple_idct, it might be less
> > accurate on rare occasions, because of the primitive slow split 8+8 bit
> > multply operation in VIS. I am hoping the idct won't overflow, too.
> what happens with the regression tests if its forced to be used where the
> normal C simple idct normally is?
>
> and what does dct-test.c output for the idct?
I won't try the regression test, as it would be very complicated - ffmpeg
doesn't support solaris 8 (I don't have acces to anything else), so I am
using an older hacked version of mplayer.
But I did try dct-test, here is the output:
------------
ednebal mwux119> ./dct-test -i 0
ffmpeg DCT/IDCT test
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
IDCT REF-DBL: err_inf=0 err2=0.00000000 syserr=0.00000000 maxout=260
blockSumErr=0
IDCT REF-DBL: 50.2 kdct/s
-4 -24 -1 -15 4 24 -29 -47
-35 -13 21 -2 24 8 -9 19
30 -17 -38 7 -18 43 -28 -5
31 -31 -13 -9 23 5 16 11
-16 20 -43 18 4 -11 -9 0
-3 -6 -9 -3 -5 4 -25 11
-43 10 -6 -7 24 22 -12 23
4 -23 -11 0 10 11 13 24
IDCT INT: err_inf=1 err2=0.01606250 syserr=0.00235000 maxout=260 blockSumErr=6
IDCT INT: 629.0 kdct/s
-1 13 15 -2 30 9 10 -11
-22 -27 9 11 15 -1 31 8
2 -2 4 -21 0 9 -21 -13
21 -17 -10 12 4 1 9 -1
-8 4 -5 -6 20 -3 12 0
21 -15 9 18 -9 19 -14 3
1 22 -10 -6 11 -6 6 1
21 18 -7 -11 2 17 16 -8
IDCT SIMPLE-C: err_inf=1 err2=0.00840078 syserr=0.00155000 maxout=260
blockSumErr=5
IDCT SIMPLE-C: 463.2 kdct/s
651 363 821 1416 1116 686 64 492
247 380 736 237 441 323 534 420
1094 716 359 676 442 484 371 426
726 296 437 302 399 380 413 411
762 622 357 548 476 429 350 463
451 386 407 382 437 386 415 391
384 579 421 523 482 492 403 454
304 455 360 399 344 407 396 375
IDCT SIMPLE-VIS: err_inf=1 err2=0.06097266 syserr=0.07080000 maxout=259
blockSumErr=12
IDCT SIMPLE-VIS: 1024.6 kdct/s
2810 1725 3883 3476 1445 2977 1362 1943
1263 1244 3402 2061 3266 2497 2936 2741
3763 3555 1640 2765 1864 2615 2039 2391
3648 1897 2893 2304 2447 2391 2482 2613
1796 3132 2055 2615 2115 2426 2246 2373
2690 2483 2701 2527 2544 2556 2554 2520
1361 2738 2237 2556 2252 2520 2360 2349
1882 2594 2395 2605 2279 2759 2274 2578
IDCT MLIBidct: err_inf=3 err2=0.38151094 syserr=0.19415000 maxout=260
blockSumErr=40
IDCT MLIBidct: 2380.8 kdct/s
------------------
So it is less accurate than simple_idct, but definietly more accurate than the
mlib version. I think I also can confirm that the inaccuracy is from the
split two instruction multiply+rounding - because I have an earlier version
of simple_idct_vis that uses 32 bit for the calculations (and hence does only
two rows parallel, so it's half as fast, only a bit faster than the C
version), which has the same good accuracy as simple_idct. As the main
algorithm is the same, I think the conclusion, that the two instruction
multiply+rounding is guilty, is correct.
> > + 6393, 6393, 6393, 6393
> > +};
>
> const static
Fixed.
> also you permute the input explicitly instead of setting
> idct_permutation_type
> properly
>
> please dont sumbit trash
I don't think it is trash. Again, there is no left shift. It could be
substituted with 4 additions and no fpackfix mess, but than it would only be
marginally faster, and there would be more code in this file (as this
permutating is still needed for the other half of the idct), in other words it
would be more complex for little gain. So I prefer this version.
> > +void ff_simple_idct_add_vis(uint8_t *dest, int line_size, DCTELEM *data)
> > {
> > + ff_simple_idct_vis(data);
> > + ff_add_pixels_clamped_vis(data, dest, line_size);
> > +}
>
> check that gcc inlines these 4 calls, if not do something so it does, they
> should be inlined
I checked it, it didn't, now - thanks to the "inline" keyword - it does, I
checked it too.
Thanks for reviewing
bye
Denes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple_idct_vis_try3.diff
Type: text/x-diff
Size: 21821 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070822/1724ee15/attachment.diff>
More information about the ffmpeg-devel
mailing list