[FFmpeg-devel] MMX accelerated DSP functions for VC1/WMV3 decoders

Sun Jul 1 07:04:59 CEST 2007

Hi,

2007/7/1, Christophe GISQUET <christophe.gisquet at free.fr>:
> Hi,
>
> Michael Niedermayer a ?crit :
> >> +#if defined(CONFIG_VC1_DECODER) || defined(CONFIG_WMV3_DECODER)
> >> +extern void ff_vc1dsp_init_mmx(DSPContext* dsp, AVCodecContext *avctx);
> >> +#endif
> >> +
> >
> > the #if is unneeded
>
> Indeed, even if defined, the symbol won't be used when those conditions
> are not met.
>
> > [...]
> >> +     "psllw     $1, %%mm1               \n\t"                   \
> >> +     "psllw     $1, %%mm2               \n\t"                   \
> >
> > paddw
>
> Is that always faster?

According to Intel & AMD's manuals, same latency on P6/Pentium 4/Core
2/K7/K8/K10, more throughput on Core 2. So paddw is good.

> > duplicating each filter 4 times with macros is unacceptable
> > the overhead for 2 calls is not that big
>
> OK. If I understand right the plan, you want instead 4 functions to be
> created, one per shift position. If we say that {1,2,3}/4 shift code
> sizes are N (and neglect the no-shift code size), we currently have a
> total code size of:
> 3*3*(N+N) + 2*3*(N+epsilon) + epsilon? ~ 24N
> Your plan is to get it to 3*N + epsilon ~ 3N
>
> However, with this, the same function is used for vertical and
> horizontal filtering. The tap offsets is no longer known at compilation,
> hence we have a more complex addressing pattern (of the [eax+ecx+N]
> kind) and a register less. And I probably have to rewrite part of the
> macros.
>
> What would you say about having 1 vertical function and 1 horizontal for
> {1,2,3}/4 shift positions? This should double the code size compared to
> your plan, but looks much simpler to me.

A bit off topic: currently not all MMX acceleration in ffmpeg have an
SSE2 equivalence. Of course SSE2 isn't always faster, especially when
the alignment isn't guaranteed, but it's the way of the future with
wider execution unit and faster unaligned access in Core 2 and K10. I
can always help do MMX->SSE2 translation but few beyond that (I'm not
a codec expert). Of course as the author who really understands what
you're doing you can do better than that. So would u mind providing an
SSE2 optimization at the very beginning?

-- 
Zuxy
Beauty is truth,
While truth is beauty.
PGP KeyID: E8555ED6