[FFmpeg-devel] [PATCH] Further optimization of base64 decode using AV_WB32.

Sat Jan 21 18:39:29 CET 2012

On Sat, Jan 21, 2012 at 06:30:48PM +0100, Reimar Döffinger wrote:
> On Sat, Jan 21, 2012 at 06:13:19PM +0100, Reimar Döffinger wrote:
> > On Sat, Jan 21, 2012 at 05:56:32PM +0100, Michael Niedermayer wrote:
> > > On Sat, Jan 21, 2012 at 05:52:27PM +0100, Reimar Döffinger wrote:
> > > > This is somewhat questionable.
> > > > The biggest issue is that av_bswap32 is not replaced
> > > > with our asm version on gcc 4.5 or newer.
> > > > This causes gcc to generate horrible code that is slower
> > > > than the unoptimized variant.
> > > > Old:                                  248852 decicycles
> > > > New with gcc's attempt at av_bswap32: 256576 decicycles
> > > > New with our bswap32:                 200260 decicycles
> > > [...]
> > > > diff --git a/libavutil/x86/bswap.h b/libavutil/x86/bswap.h
> > > > index 52ffb4d..aa39d97 100644
> > > > --- a/libavutil/x86/bswap.h
> > > > +++ b/libavutil/x86/bswap.h
> > > > @@ -37,7 +37,7 @@ static av_always_inline av_const unsigned av_bswap16(unsigned x)
> > > >  }
> > > >  #endif /* !AV_GCC_VERSION_AT_LEAST(4,1) */
> > > >  
> > > > -#if !AV_GCC_VERSION_AT_LEAST(4,5)
> > > > +#if 1 || !AV_GCC_VERSION_AT_LEAST(4,5)
> > > >  #define av_bswap32 av_bswap32
> > > >  static av_always_inline av_const uint32_t av_bswap32(uint32_t x)
> > > >  {
> > > 
> > > also make sure -cpu/arch/tune is set so gcc is allowed to use bswap
> > > (its 486+) so not possible for gcc to use on strict x86
> > 
> > It is a x86_64 build, so I'd hope that gcc will not try to "optimize"
> > of 486 on that...
> 
> gcc version is actually 4.6.2 and it fails to use the bswap instruction
> regardless whether I use no extra options, -march=native, -m32, -m32
> -march=native.
> In all cases the code without our inline bswap is significantly slower
> (ca. 20%).
> I have no idea where the claim that gcc would recognize the bswap comes
> from (hm, I haven't tested if the << 8 confuses it though, will now).

Yes, only completely removing the shift fixes it.
One option would be to make the table 16 bit to avoid that shift.
However my tests show that even though this saves the shift instruction
the code does not become any faster in 64 bit mode and only maybe 2% in
32 bit mode (except of course for unbreaking the compiler), so it
seems quite wasteful.