[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations

Mon Sep 1 01:05:08 CEST 2008

On Sun, Aug 31, 2008 at 07:57:37PM -0300, Ramiro Polla wrote:
> On Sun, Aug 31, 2008 at 7:51 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> > On Sun, Aug 31, 2008 at 04:42:35PM +0200, Michael Niedermayer wrote:
> >> On Sun, Aug 31, 2008 at 01:18:14PM +0300, Siarhei Siamashka wrote:
> >> > On Sunday 31 August 2008, Michael Niedermayer wrote:
> >> > > On Sat, Aug 30, 2008 at 11:42:31PM +0300, Siarhei Siamashka wrote:
> >> > > > On Saturday 30 August 2008, Loren Merritt wrote:
> >> > > > > On Sat, 30 Aug 2008, Siarhei Siamashka wrote:
> >> > > > > > This trivial patch improves overall vorbis decoding performance by
> >> > > > > > ~3% on Pentium-M with gcc 4.2.3
> >> > > > >
> >> > > > > vorbis_residue_decode_type# are superfluous. Just inline
> >> > > > > vorbis_residue_decode_internal into vorbis_residue_decode.
> >> > > >
> >> > > > Theoretically they are superfluous (inlining
> >> > > > vorbis_residue_decode_internal into vorbis_residue_decode was the first
> >> > > > thing that I tried). But in practice code is consistently faster this
> >> > > > way. Probably it is easier for gcc to optimize 3 independent functions
> >> > > > than everything bundled into a huge one. Let me know if you get different
> >> > > > results.
> >> > >
> >> > > well, I do
> >> > >
> >> > > [...]
> >> > >
> >> > > > --------------------
> >> > > > callgrind simulation for './ffmpeg_g.1huge' (L1 data cache is 32K):
> >> > > > I   refs:      85,817,091
> >> > > > D   refs:      43,457,905  (28,888,575 rd + 14,569,330 wr)
> >> > > > D1  misses:       785,564  (   583,645 rd +    201,919 wr)
> >> > > > D1  miss rate:        1.8% (       2.0%   +        1.3%  )
> >> > > > callgrind simulation for './ffmpeg_g.3func' (L1 data cache is 32K):
> >> > > > I   refs:      85,085,997
> >> > > > D   refs:      42,653,212  (28,454,961 rd + 14,198,251 wr)
> >> > > > D1  misses:       782,978  (   581,685 rd +    201,293 wr)
> >> > > > D1  miss rate:        1.8% (       2.0%   +        1.4%  )
> >> > > >
> >> > > > The difference is visible both for the total number of instructions and
> >> > > > for the number of memory accesses.
> >> > >
> >> > > loren:
> >> > > I   refs:      5,663,789,738
> >> > > I1  misses:        3,515,218
> >> > > I1  miss rate:          0.06%
> >> > > D   refs:      1,889,318,408  (1,365,757,445 rd   + 523,560,963 wr)
> >> > > D1  misses:       32,073,499  (   22,443,938 rd   +   9,629,561 wr)
> >> > > D1  miss rate:           1.6% (          1.6%     +         1.8%  )
> >> > >
> >> > > siar:
> >> > > I   refs:      5,670,795,747
> >> > > I1  misses:        3,488,120
> >> > > I1  miss rate:          0.06%
> >> > > D   refs:      1,896,279,210  (1,372,731,243 rd   + 523,547,967 wr)
> >> > > D1  misses:       32,096,476  (   22,464,805 rd   +   9,631,671 wr)
> >> > > D1  miss rate:           1.6% (          1.6%     +         1.8%  )
> >> >
> >> > Took time to compile/install gcc 4.3.2 and also got similar results. What's
> >> > more important, the fastest build generated by gcc 4.3.2 (all inlined) was
> >> > better than the fastest build generated by 4.2.3 (dummy functions). This
> >> > really makes the choice quite obvious :)
> >> >
> >> > > Ill commit the clean version without the dummy functions in a day or 2
> >> > > unless someone objects / has some idea of how to improve it.
> >> >
> >> > I also tried to benchmark the variants where 'vlen' is also inlined as
> >> > constants 128 and 1024 which are quite typical (with the hope that it could
> >> > save 1 extra register for gcc in the inner loop) but effect on the
> >> > performance was minimal.
> >> >
> >> > Regarding 'vorbis_residue_decode' function, it probably makes sense to
> >> > optimize these loops:
> >> >
> >> > if(dim==2) {
> >> >     for(k=0;k<step;++k) {
> >> >         coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 2;
> >> >         vec[voffs+k     ]+=codebook.codevectors[coffs  ];  // FPMATH
> >> >         vec[voffs+k+vlen]+=codebook.codevectors[coffs+1];  // FPMATH
> >> >     }
> >> > } else if(dim==4) {
> >> >     for(k=0;k<step;++k, voffs+=2) {
> >> >         coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 4;
> >> >         vec[voffs       ]+=codebook.codevectors[coffs  ];  // FPMATH
> >> >         vec[voffs+1     ]+=codebook.codevectors[coffs+2];  // FPMATH
> >> >         vec[voffs+vlen  ]+=codebook.codevectors[coffs+1];  // FPMATH
> >> >         vec[voffs+vlen+1]+=codebook.codevectors[coffs+3];  // FPMATH
> >> >     }
> >> > } ...
> >> >
> >> > 'get_vlc2' call could be replaced with some GET_VLC/GET_RL_VLC variant
> >> > so that the number of intermediate excessive UPDATE_CACHE operations is
> >> > minimized.
> >>
> >> These are all nice ideas but they arent really related to the change here
> >> so patch welcome
> >
> > applied
> 
> Sorry for noticing after it was applied, but isn't this the kind of
> code that should have a special #ifdef CONFIG_SMALL case?

CONFIG_SMALL disables always_inline so its just a if/else if/else and a
extra function call. Iam not sure if the ifdefery we would need is worth
to avoid that.

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Frequently ignored awnser#1 FFmpeg bugs should be sent to our bugtracker. User
questions about the command line tools should be sent to the ffmpeg-user ML.
And questions about how to use libav* should be sent to the libav-user ML.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080901/a945acf7/attachment.pgp>