[FFmpeg-devel] [PATCH v3] SH4 mpegaudio decoder optimizations

Mon Dec 27 01:08:08 CET 2010

On Sun, Dec 26, 2010 at 07:49:29PM +0100, Guennadi Liakhovetski wrote:
> On Sun, 26 Dec 2010, Michael Niedermayer wrote:
> 
> > On Fri, Dec 17, 2010 at 05:54:56PM +0100, Guennadi Liakhovetski wrote:
> > > Hi all
> > > 
> > > This is version 3 of the following patch:
> > > 
> > > This patch implements several mpegaudio optimizations for SH4 FPU-enabled 
> > > SoCs. Verified to provide more than 40% acceleration, when decoding a 
> > > 128kbps stereo mp3 audio file to /dev/null.
> > > 
> > > Changes listed in the patch header, thanks to all who commented!
> > > 
> > > Thanks
> > > Guennadi
> > > ---
> > > Guennadi Liakhovetski, Ph.D.
> > > Freelance Open-Source Software Developer
> > > http://www.open-technology.de/
> > >  mathops.h      |    2 +
> > >  mpegaudiodec.c |   26 ++++++++++++---
> > >  sh4/mathops.h  |   97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 120 insertions(+), 5 deletions(-)
> > > ad802c254a9f25cf487eba7d3cb584112c6c59b9  ff_mpegaudio_sh4-v3.patch
> > > From: Guennadi Liakhovetski <g.liakhovetski at gmx.de>
> > > Subject: [PATCH v3] SH4 mpegaudio decoder optimizations
> > > 
> > > This patch implements several mpegaudio optimizations for SH4 FPU-enabled SoCs.
> > > Verified to provide more than 40% acceleration, when decoding a 128kbps stereo
> > > mp3 audio file to /dev/null.
> > > ---
> > > 
> > > v3:
> > > 1. added a "&" constraint to input-output assembly parameters
> > > 
> > > v2:
> > > 1. fixed copyright year to 2010
> > > 2. consistent lower-case register names
> > > 3. joined inline assembly macros into one
> > > 
> > > diff --git a/libavcodec/mathops.h b/libavcodec/mathops.h
> > > index 4d88ed1..b234ae4 100644
> > > --- a/libavcodec/mathops.h
> > > +++ b/libavcodec/mathops.h
> > > @@ -34,6 +34,8 @@
> > >  #   include "mips/mathops.h"
> > >  #elif ARCH_PPC
> > >  #   include "ppc/mathops.h"
> > > +#elif ARCH_SH4
> > > +#   include "sh4/mathops.h"
> > >  #elif ARCH_X86
> > >  #   include "x86/mathops.h"
> > >  #endif
> > > diff --git a/libavcodec/mpegaudiodec.c b/libavcodec/mpegaudiodec.c
> > > index 769be89..513fa1b 100644
> > > --- a/libavcodec/mpegaudiodec.c
> > > +++ b/libavcodec/mpegaudiodec.c
> > > @@ -584,6 +584,22 @@ static inline int round_sample(int64_t *sum)
> > >      op(sum, (w)[7 * 64], (p)[7 * 64]);    \
> > >  }
> > >  
> > > +#ifndef SUM8_MACS
> > > +#define SUM8_MACS(sum, w, p) SUM8(MACS, sum, w, p)
> > > +#endif
> > > +
> > > +#ifndef SUM8_MLSS
> > > +#define SUM8_MLSS(sum, w, p) SUM8(MLSS, sum, w, p)
> > > +#endif
> > > +
> > > +#ifndef SUM8P2_MACS_MLSS
> > > +#define SUM8P2_MACS_MLSS(sum, sum2, w, w2, p) SUM8P2(sum, MACS, sum2, MLSS, w, w2, p)
> > > +#endif
> > > +
> > > +#ifndef SUM8P2_MLSS_MLSS
> > > +#define SUM8P2_MLSS_MLSS(sum, sum2, w, w2, p) SUM8P2(sum, MLSS, sum2, MLSS, w, w2, p)
> > > +#endif
> > > +
> > >  #define SUM8P2(sum1, op1, sum2, op2, w1, w2, p) \
> > >  {                                               \
> > >      INTFLOAT tmp;\
> > > @@ -666,9 +682,9 @@ static void apply_window_mp3_c(MPA_INT *synth_buf, MPA_INT *window,
> > >  
> > >      sum = *dither_state;
> > >      p = synth_buf + 16;
> > > -    SUM8(MACS, sum, w, p);
> > > +    SUM8_MACS(sum, w, p);
> > >      p = synth_buf + 48;
> > > -    SUM8(MLSS, sum, w + 32, p);
> > > +    SUM8_MLSS(sum, w + 32, p);
> > >      *samples = round_sample(&sum);
> > >      samples += incr;
> > >      w++;
> > 
> > This is ugly
> > You should implement the API not change it in one file and implement the
> > modified API
> 
> This doesn't change any API. This just adds a couple of macros completely 
> internal to this file, no existing users are affected.

We have SUM8() (existing API)
You add SUM8_MACS() / SUM8_MLSS() which are a differnt API to do the same
You optimnize your API and switch mpegaudiodec to the new API

Thus we have 2 different APIs and only one of them has your optimizations,
this case is quite bad and not something i want to have to maintain

> 
> > [...]
> > > +
> > > +#define SH_SUM8_MACS(sum, w, p)                      \
> > > +do {                                                 \
> > > +    register const int32_t *wp = (w), *pp = (p);     \
> > > +    register int32_t tmp = 63 * 4;                   \
> > > +    union {int64_t x; int32_t u32[2];} u =           \
> > > +                    {.x = (sum),};                   \
> > > +    __asm__ volatile(                                \
> > > +    "    lds %[hi], mach\n"                          \
> > > +    "    lds %[lo], macl\n"                          \
> > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 0 */           \
> > > +    "    add %[tmp], %[wp]\n"                        \
> > > +    "    add %[tmp], %[pp]\n"                        \
> > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 1 */           \
> > > +    "    add %[tmp], %[wp]\n"                        \
> > > +    "    add %[tmp], %[pp]\n"                        \
> > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 2 */           \
> > > +    "    add %[tmp], %[wp]\n"                        \
> > > +    "    add %[tmp], %[pp]\n"                        \
> > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 3 */           \
> > > +    "    add %[tmp], %[wp]\n"                        \
> > > +    "    add %[tmp], %[pp]\n"                        \
> > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 4 */           \
> > > +    "    add %[tmp], %[wp]\n"                        \
> > > +    "    add %[tmp], %[pp]\n"                        \
> > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 5 */           \
> > > +    "    add %[tmp], %[wp]\n"                        \
> > > +    "    add %[tmp], %[pp]\n"                        \
> > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 6 */           \
> > > +    "    add %[tmp], %[wp]\n"                        \
> > > +    "    add %[tmp], %[pp]\n"                        \
> > > +    "    mac.l @%[wp]+, @%[pp]+\n" /* 7 */           \
> > > +    "    sts mach, %[hi]\n"                          \
> > > +    "    sts macl, %[lo]"                            \
> > > +    : [hi] "+&r" (u.u32[1]),                         \
> > > +      [lo] "+&r" (u.u32[0]),                         \
> > > +      [wp] "+&r" (wp),                               \
> > > +      [pp] "+&r" (pp)                                \
> > > +    : [tmp] "r" (tmp)                                \
> > > +    : "mach", "macl");                               \
> > > +    sum = u.x;                                       \
> > > +} while (0)
> > 
> > Write the whole loop in asm, leaving it to gcc is not a good idea it can just
> > mess up.
> 
> Sorry, what do you mean? how can it "just mess up?"

Compilers turn C code into asm,they arent very good at it, which is why you can
make things 40% faster. They arent better at doing loops thus speed critical
ones should better be written inside asm instead of C loop around asm

> Actually, what loop do
> you mean? You don't mean the "do {...} while (0)" construct, do you?

this: (iam pretty sure you can do better than gcc)
    for(j=1;j<16;j++) {
        sum2 = 0;
        p = synth_buf + 16 + j;
        SUM8P2(sum, MACS, sum2, MLSS, w, w2, p);
        p = synth_buf + 48 - j;
        SUM8P2(sum, MLSS, sum2, MLSS, w + 32, w2 + 32, p);

        *samples = round_sample(&sum);
        samples += incr;
        sum += sum2;
        *samples2 = round_sample(&sum);
        samples2 -= incr;
        w++;
        w2--;
    }

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

If you really think that XML is the answer, then you definitly missunderstood
the question -- Attila Kinali
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20101227/bb522702/attachment.pgp>