[Ffmpeg-devel] [patch] Snow - add_yblock x86 asm
Yartrebo
yartrebo
Sun Apr 10 06:58:46 CEST 2005
On Sat, 2005-04-09 at 16:40 +0200, Michael Niedermayer wrote:
> Hi
>
> On Saturday 09 April 2005 16:05, Yartrebo wrote:
> >
> > okay, if you like it that way. Could you tell me how to do the run-time
> > code selection and how to place it in dsputil?
>
> just look at libavcodec/{,i386/}dsputil*.{c,h}, its very simple
> theres a DSPContext with function pointers, which get set to the specific
> implementation like plain c or mmx
> the actual implementations should also be in their own files as dsputil*.c is
> already quite big (snow_dsp.c and snow_dsp_mmx.c might be possible names)
>
I would really like to not put it in dsputil, and here are my reasons:
- They are snow specific and would require a large and awkward
interface with zero chance of reuse.
- Performance is substantially compromised. On my slow P4, the 8x8 SSE2
function takes nary 420 clocks to finish. A large indirect function call
will do a serious number on the execution time.
- I have them cleanly tucked away. In all, doing
it my way adds 7 lines to snow.c (versus 4 or so for dsputil).
- Run-time checking and compile checking are both supported.
- This implementation is much more efficient. Overhead is a handful of
clocks instead of possibly over 100 for a large (9 or so parameters)
indirect function call. My implementation is 100% pure preprocessor code
if you ignore comments (7 preprocessor lines in snow.c, about 640 in
snow_mmx_sse2.h, and about a dozen in mmx.h). The code is directly inserted
into add_yblock_buffered() and results in a much smaller executable size AND
a faster executable.
- It's clean. dsputil really doesn't need more garbage.
On a similar note, I propose keeping the name snow_mmx_sse2.h for the
implementation, as I don't want it to be confused with dsputil.
>
> >
> >
> > LOG2_OBMC_MAX is kept to 6 for the C because you wanted the minimum bits
> > possible for the general implementation.
>
> i just wanted to keep the constants fit into 6bits not necessary keep the
> actual table unchanged, i have no problem with multiplying them with 4 or
> 1024 or whatever as long as it doesnt change the bitstream / we can reverse
> it if its slower on some cpu, without breaking compatibility
>
fixed. constants are now all multiplied by four, and MAX_OBMC_LOG is now 8.
>
> [...]
> > > > +#define add_yblock_bw_8_obmc_16_mmx \
> > > > + pxor_r2r(mm7, mm7);\
> > > > + pcmpeqd_r2r(mm4, mm4);\
> > > > + pslld_i2r(31, mm4);\
> > >
> > > please dont mix mmx.h with asm() style
> > >
> > > > + "movq (%%esi), %%xmm0; \n\t"\
> > > > + "movl 12(%%eax), %%edx; \n\t"\
> > >
> > > i dont think this will work on x86-64
> >
> > Unfortunately, none of my pointer code will work on x86-64. To be
> > honest, I thought that mplayer assumed that int == * == int32_t == 4 in
> > a lot of places and that mplayer would be forever 32 bit.
> >
> > Would you have any suggestions as to how to make code run in both 32 bit
> > and 64 bit modes without much or any loss of performance?
>
> please see the x86-64 patch, which should explain the problem better
> http://www1.mplayerhq.hu/cgi-bin/cvsweb.cgi/ffmpeg/libavcodec/i386/dsputil_mmx.c.diff?r1=1.87&r2=1.88&cvsroot=FFMpeg
>
fixed, hopefully. I need someone with an AMD-64 machine to test it.
Also, I've taken the liberty to improve my slice_buffer by using a single
malloc. I didn't realize that it would have a measurable impact.
I've also filled out the set of #define REG_x [e,r]ax in mmx.h because
my code needs them for AMD-64 interoperability.
Sincerely,
Robert Edele
-------------- next part --------------
A non-text attachment was scrubbed...
Name: snow_add_yblock_mmx_sse2.patch
Type: text/x-patch
Size: 39632 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20050410/d46754a9/attachment.bin>
More information about the ffmpeg-devel
mailing list