[MPlayer-dev-eng] [PATCH] (new version) AltiVec: dct64 for mp3lib, IMDCT for liba52, detection code

Romain Dolbeau dolbeau at irisa.fr
Sun Jan 19 15:23:45 CET 2003


Daniel Egger wrote:

> Sure, but the top functions are certainly not what I would have
> expected. For instance this is the profile on a G4 with linux
> playing a MPEG4 with mp3 audio:
> 
> Flat profile:
> 
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls  ms/call  ms/call  name    
>  21.50     10.00    10.00 45664560     0.00     0.00  idctSparseColAdd
>   9.16     14.26     4.26 47547936     0.00     0.00  idctRowCondDC
>   6.73     17.39     3.13  2759489     0.00     0.00  put_pixels8_xy2_c
>   6.64     20.48     3.09  2716259     0.00     0.00  put_no_rnd_pixels8_xy2_c
>   5.40     22.99     2.51 19075470     0.00     0.00  mpeg4_decode_block
>   5.29     25.45     2.46   406296     0.01     0.01  synth_1to1
>   3.46     27.06     1.61  3397968     0.00     0.00  ff_h263_decode_mb
>   2.86     28.39     1.33  3397968     0.00     0.01  MPV_decode_mb
>   2.69     29.64     1.25      729     1.71     1.71  play
>   2.39     30.75     1.11  2798586     0.00     0.00  mpeg_motion
>   2.13     31.74     0.99   406296     0.00     0.00  dct64_1
>   2.04     32.69     0.95  3320142     0.00     0.00  MPV_motion
>   1.93     33.59     0.90  5708070     0.00     0.00  simple_idct_add
>   1.85     34.45     0.86  9409352     0.00     0.00  h263_decode_motion
>   1.72     35.25     0.80   261652     0.00     0.00  put_pixels16_x2_c
>   1.66     36.02     0.77   485686     0.00     0.00  dct36
>   1.66     36.79     0.77   256612     0.00     0.00  put_no_rnd_pixels16_x2_c
>   1.57     37.52     0.73   338964     0.00     0.00  ff_emulated_edge_mc
> 
> As you can see, the top offender is the iDCT which is exactly what one
> would expect because it's really computing intensive, the mp3 iDCT has
> far less data to compute then the video one (ratio of bandwidth 1/7)
> and thus is quite a bit below. The put_* functions perform the already
> mentioned MC and are quite intensive because of their memory touching
> nature. 

The guys who did the IDCT did an incredible job, the AltiVec
version is _much_ faster than simple_idct. Thanks, folks :-)

idct_add_altivec takes up ~ 1102 CPU cycles when
using the reference C code (i.e. call to simple_idct_add),
and is down to 139 CPU cycles with the AltiVec code,
a near 8x improvements (for in-L1 execution)

idct_put_altivec goes from ~ 936 CPU cycles down to 117.

This code alone takes 32% of your computation, so
the code would take 28% less time to run with
the AltiVec IDCT, and the IDCT would take about
5.55% of the total execution time. At that point,
synth_1to1 and dct64_1 seem to take a lot of time :-)

(all number were taken on my 667 Mhz L3-less tibook,
using PMC1 set to CPU cycles and PMC2 set to L1
cache misses, from an 8Mb sample:
   Stream #0.0: Video: msmpeg4, 320x240, 29.97 fps, 800 kb/s
   Stream #0.1: Audio: mp3, 44100 Hz, mono, 47 kb/s)


-- 
Romain Dolbeau



More information about the MPlayer-dev-eng mailing list