[MPlayer-dev-eng] [PATCH] (new version) AltiVec: dct64 for mp3lib, IMDCT for liba52, detection code
Romain Dolbeau
dolbeau at irisa.fr
Sun Jan 19 15:23:45 CET 2003
Daniel Egger wrote:
> Sure, but the top functions are certainly not what I would have
> expected. For instance this is the profile on a G4 with linux
> playing a MPEG4 with mp3 audio:
>
> Flat profile:
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls ms/call ms/call name
> 21.50 10.00 10.00 45664560 0.00 0.00 idctSparseColAdd
> 9.16 14.26 4.26 47547936 0.00 0.00 idctRowCondDC
> 6.73 17.39 3.13 2759489 0.00 0.00 put_pixels8_xy2_c
> 6.64 20.48 3.09 2716259 0.00 0.00 put_no_rnd_pixels8_xy2_c
> 5.40 22.99 2.51 19075470 0.00 0.00 mpeg4_decode_block
> 5.29 25.45 2.46 406296 0.01 0.01 synth_1to1
> 3.46 27.06 1.61 3397968 0.00 0.00 ff_h263_decode_mb
> 2.86 28.39 1.33 3397968 0.00 0.01 MPV_decode_mb
> 2.69 29.64 1.25 729 1.71 1.71 play
> 2.39 30.75 1.11 2798586 0.00 0.00 mpeg_motion
> 2.13 31.74 0.99 406296 0.00 0.00 dct64_1
> 2.04 32.69 0.95 3320142 0.00 0.00 MPV_motion
> 1.93 33.59 0.90 5708070 0.00 0.00 simple_idct_add
> 1.85 34.45 0.86 9409352 0.00 0.00 h263_decode_motion
> 1.72 35.25 0.80 261652 0.00 0.00 put_pixels16_x2_c
> 1.66 36.02 0.77 485686 0.00 0.00 dct36
> 1.66 36.79 0.77 256612 0.00 0.00 put_no_rnd_pixels16_x2_c
> 1.57 37.52 0.73 338964 0.00 0.00 ff_emulated_edge_mc
>
> As you can see, the top offender is the iDCT which is exactly what one
> would expect because it's really computing intensive, the mp3 iDCT has
> far less data to compute then the video one (ratio of bandwidth 1/7)
> and thus is quite a bit below. The put_* functions perform the already
> mentioned MC and are quite intensive because of their memory touching
> nature.
The guys who did the IDCT did an incredible job, the AltiVec
version is _much_ faster than simple_idct. Thanks, folks :-)
idct_add_altivec takes up ~ 1102 CPU cycles when
using the reference C code (i.e. call to simple_idct_add),
and is down to 139 CPU cycles with the AltiVec code,
a near 8x improvements (for in-L1 execution)
idct_put_altivec goes from ~ 936 CPU cycles down to 117.
This code alone takes 32% of your computation, so
the code would take 28% less time to run with
the AltiVec IDCT, and the IDCT would take about
5.55% of the total execution time. At that point,
synth_1to1 and dct64_1 seem to take a lot of time :-)
(all number were taken on my 667 Mhz L3-less tibook,
using PMC1 set to CPU cycles and PMC2 set to L1
cache misses, from an 8Mb sample:
Stream #0.0: Video: msmpeg4, 320x240, 29.97 fps, 800 kb/s
Stream #0.1: Audio: mp3, 44100 Hz, mono, 47 kb/s)
--
Romain Dolbeau
More information about the MPlayer-dev-eng
mailing list