[FFmpeg-devel] [PATCH] Remove mmx.h from dsputil_mmx.c

Wed Mar 25 21:45:16 CET 2009

On Mon, Mar 23, 2009 at 8:48 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Mon, Mar 23, 2009 at 07:38:49PM -0400, Alex Converse wrote:
>> 2009/3/14 Michael Niedermayer <michaelni at gmx.at>:
>> > On Sat, Mar 14, 2009 at 12:37:54PM -0400, Alex Converse wrote:
>> >> On Sat, Mar 14, 2009 at 1:05 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> > On Fri, Mar 13, 2009 at 08:38:09PM -0400, Alex Converse wrote:
>> >> >> On Wed, Mar 11, 2009 at 3:44 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> >> > On Wed, Mar 11, 2009 at 03:19:03AM -0400, Alex Converse wrote:
>> >> >> >> On Tue, Mar 10, 2009 at 10:04 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> >> >> >> > On Tue, Mar 10, 2009 at 09:19:56PM -0400, Alex Converse wrote:
>> >> >> >> >> This is my first mmx patch so please bear with me.
>> >> >> >> >
>> >> >> >> >> diff --git a/libavcodec/x86/dsputil_mmx.c b/libavcodec/x86/dsputil_mmx.c
>> >> >> >> >> index 2f47f5f..3771c5f 100644
>> >> >> >> >> --- a/libavcodec/x86/dsputil_mmx.c
>> >> >> >> >> +++ b/libavcodec/x86/dsputil_mmx.c
>> >> >> >> >> @@ -28,7 +28,6 @@
>> >> >> >> >> ?#include "libavcodec/mpegvideo.h"
>> >> >> >> >> ?#include "libavcodec/simple_idct.h"
>> >> >> >> >> ?#include "dsputil_mmx.h"
>> >> >> >> >> -#include "mmx.h"
>> >> >> >> >> ?#include "vp3dsp_mmx.h"
>> >> >> >> >> ?#include "vp3dsp_sse2.h"
>> >> >> >> >> ?#include "vp6dsp_mmx.h"
>> >> >> >> >> @@ -278,16 +277,33 @@ static DECLARE_ALIGNED_8(const unsigned char, vector128[8]) =
>> >> >> >> >>
>> >> >> >> >> ?void put_signed_pixels_clamped_mmx(const DCTELEM *block, uint8_t *pixels, int line_size)
>> >> >> >> >> ?{
>> >> >> >> >> + ? ?const DCTELEM *p = block;
>> >> >> >> >> + ? ?uint8_t *pix = pixels;
>> >> >> >> >> ? ? ?int i;
>> >> >> >> >>
>> >> >> >> >> - ? ?movq_m2r(*vector128, mm1);
>> >> >> >> >> - ? ?for (i = 0; i < 8; i++) {
>> >> >> >> >> - ? ? ? ?movq_m2r(*(block), mm0);
>> >> >> >> >> - ? ? ? ?packsswb_m2r(*(block + 4), mm0);
>> >> >> >> >> - ? ? ? ?block += 8;
>> >> >> >> >> - ? ? ? ?paddb_r2r(mm1, mm0);
>> >> >> >> >> - ? ? ? ?movq_r2m(mm0, *pixels);
>> >> >> >> >> - ? ? ? ?pixels += line_size;
>> >> >> >> >> + ? ?__asm__ volatile ("movq %0, %%mm0" : : "m" (*vector128));
>> >> >> >> >> + ? ?for (i = 0; i < 8; i+=4) {
>> >> >> >> >> + ? ? ? ?__asm__ volatile (
>> >> >> >> >> + ? ? ? ? ? ? ? ?"movq ? (%1), %%mm1 ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"movq 16(%1), %%mm2 ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"movq 32(%1), %%mm3 ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"movq 48(%1), %%mm4 ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"packsswb ?8(%1), %%mm1 ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"packsswb 24(%1), %%mm2 ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"packsswb 40(%1), %%mm3 ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"packsswb 56(%1), %%mm4 ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"paddb %%mm0, %%mm1 ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"paddb %%mm0, %%mm2 ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"paddb %%mm0, %%mm3 ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"paddb %%mm0, %%mm4 ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"movq %%mm1, (%0) ? ? ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"movq %%mm2, (%0, %2) ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"movq %%mm3, (%0, %2, 2) ? ? ? ?\n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?"movq %%mm4, (%0, %3) ? ? ? ? ? \n\t"
>> >> >> >> >> + ? ? ? ? ? ? ? ?::"r" (pix), "r" (p), "r"((x86_reg)line_size), "r"((x86_reg)3*line_size)
>> >> >> >> >> + ? ? ? ? ? ? ? ?:"memory");
>> >> >> >> >> + ? ? ? ?p += 32;
>> >> >> >> >> + ? ? ? ?pix += 4*line_size;
>> >> >> >> >
>> >> >> >> > 1. the for() should not be there, it should be in the asm()
>> >> >> >>
>> >> >> >> The for loop has been eliminated
>> >> >> >>
>> >> >> >> > 2. benchmark is always needed when messing with optimized code.
>> >> >> >> > ? ?START/STOP_TIMER around the point that calls the changed
>> >> >> >> > ? ?function is likely the easiest way to do it
>> >> >> >>
>> >> >> >> The benchmarks seem inconclusive. The unpatched version seems to start
>> >> >> >> out faster and end slower. I don't know what of make of this.
>> >> >> >
>> >> >> > [...]
>> >> >> >
>> >> >> >> diff --git a/libavcodec/x86/dsputil_mmx.c b/libavcodec/x86/dsputil_mmx.c
>> >> >> >> index 2f47f5f..b9679ad 100644
>> >> >> >> --- a/libavcodec/x86/dsputil_mmx.c
>> >> >> >> +++ b/libavcodec/x86/dsputil_mmx.c
>> >> >> >> @@ -28,7 +28,6 @@
>> >> >> >> ?#include "libavcodec/mpegvideo.h"
>> >> >> >> ?#include "libavcodec/simple_idct.h"
>> >> >> >> ?#include "dsputil_mmx.h"
>> >> >> >> -#include "mmx.h"
>> >> >> >> ?#include "vp3dsp_mmx.h"
>> >> >> >> ?#include "vp3dsp_sse2.h"
>> >> >> >> ?#include "vp6dsp_mmx.h"
>> >> >> >> @@ -276,19 +275,39 @@ void put_pixels_clamped_mmx(const DCTELEM *block, uint8_t *pixels, int line_size
>> >> >> >> ?static DECLARE_ALIGNED_8(const unsigned char, vector128[8]) =
>> >> >> >> ? ?{ 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 };
>> >> >> >>
>> >> >> >> +#define put_signed_pixels_clamped_mmx_half \
>> >> >> >> + ? ? ? ? ? ?"movq ? (%1), %%mm1 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"movq 16(%1), %%mm2 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"movq 32(%1), %%mm3 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"movq 48(%1), %%mm4 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"packsswb ?8(%1), %%mm1 ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"packsswb 24(%1), %%mm2 ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"packsswb 40(%1), %%mm3 ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"packsswb 56(%1), %%mm4 ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"paddb %%mm0, %%mm1 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"paddb %%mm0, %%mm2 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"paddb %%mm0, %%mm3 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"paddb %%mm0, %%mm4 ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"movq %%mm1, (%0) ? ? ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"movq %%mm2, (%0, %2) ? ? ? ? ? ? ? \n\t"\
>> >> >> >> + ? ? ? ? ? ?"movq %%mm3, (%0, %2, 2) ? ? ? ? ? ?\n\t"\
>> >> >> >> + ? ? ? ? ? ?"movq %%mm4, (%0, %3) ? ? ? ? ? ? ? \n\t"
>> >> >> >> +
>> >> >> >> ?void put_signed_pixels_clamped_mmx(const DCTELEM *block, uint8_t *pixels, int line_size)
>> >> >> >> ?{
>> >> >> >> - ? ?int i;
>> >> >> >> -
>> >> >> >> - ? ?movq_m2r(*vector128, mm1);
>> >> >> >> - ? ?for (i = 0; i < 8; i++) {
>> >> >> >> - ? ? ? ?movq_m2r(*(block), mm0);
>> >> >> >> - ? ? ? ?packsswb_m2r(*(block + 4), mm0);
>> >> >> >> - ? ? ? ?block += 8;
>> >> >> >> - ? ? ? ?paddb_r2r(mm1, mm0);
>> >> >> >> - ? ? ? ?movq_r2m(mm0, *pixels);
>> >> >> >> - ? ? ? ?pixels += line_size;
>> >> >> >> - ? ?}
>> >> >> >> + ? ?const DCTELEM *p = block;
>> >> >> >> + ? ?uint8_t *pix = pixels;
>> >> >> >> + ? ?x86_reg line_skip = line_size;
>> >> >> >> +
>> >> >> >> + ? ?__asm__ volatile (
>> >> >> >> + ? ? ? ? ? ?"movq (%3), %%mm0 ? ? ? ? ? ? ? ? ? \n\t"
>> >> >> >> + ? ? ? ? ? ?"lea (%2, %2, 2), %3 ? ? ? ? ? ? ? ?\n\t"
>> >> >> >> + ? ? ? ? ? ?put_signed_pixels_clamped_mmx_half
>> >> >> >> + ? ? ? ? ? ?"add $64, %1 ? ? ? ? ? ? ? ? ? ? ? ?\n\t"
>> >> >> >> + ? ? ? ? ? ?"lea (%0, %2, 4), %0 ? ? ? ? ? ? ? ?\n\t"
>> >> >> >> + ? ? ? ? ? ?put_signed_pixels_clamped_mmx_half
>> >> >> >> + ? ? ? ? ? ?::"r" (pix), "r" (p), "r"(line_skip), "r" (vector128)
>> >> >> >> + ? ? ? ? ? ?:"memory");
>> >> >> >
>> >> >> > you dont need the add $64, this can be done by changing the offsets
>> >> >> > also %0,%1 and %3 are marked as input, while you change them
>> >> >> >
>> >> >>
>> >> >> Fixed
>> >> >>
>> >> >> Do the x87 stack or the mm* registers need to be marked as clobbered?
>> >> >
>> >> > strictly yes but we dont do it because it works and it might be faster
>> >> >
>> >> >>
>> >> >> Here is cycle estimation from Valgrind. I find it easier to read:
>> >> >> ? ? ? ? Ir ? ? ? ? Dr ? ? ? ? Dw ? ? I1mr ? ?D1mr ? ? D1mw ?I2mr ?D2mr ? ?D2mw
>> >> >> function put_signed_pixels_clamped_mmx:
>> >> >> before:
>> >> >> 16,088,784 ?6,033,294 ?2,681,464 ?122,988 ?34,505 ?259,448 ? 991 ? 292 ?27,624
>> >> >> ? Cycle Estimation = Ir + 10*L1m + 100*L2m = 23,184,894
>> >> >> after:
>> >> >> 12,736,954 ?6,033,294 ?2,681,464 ? 64,595 ?33,784 ?258,686 ? 617 ? 294 ?27,622
>> >> >> ? Cycle Estimation = Ir + 10*L1m + 100*L2m = 19,160,904
>> >> >
>> >> > nothing against valgrind but i prefer benchmarks done on real hardware
>> >> > where its possible because we have more users that run ffmpeg on real
>> >> > hw than who run it in emulators
>> >> >
>> >>
>> >> Well is there a guide to interpreting those benchmarks? I can't make
>> >> heads or tails of them.
>> >
>> > You can just write a tiny test program that runs a loop with START/STOP_TIMER
>> > 10000 times and have the inner loop do as many iterations as
>> > a command line arg says
>> > then run it with 100 and 101 and look at the benchmark scores.
>> >
>> > [...]
>> >
>>
>> I wrote a little timer test program, and when going from 100 to 101
>> all the numbers increased like I expected. I still am confused about
>> how to interpret the numbers I posted.
>
> hmm
> the value with the largest run number should be a better indicator of what
> is faster
>

Thanks, though this time I got my patched version doing better across
all runs for some reason.

881 dezicycles in 1st call, 8388543 runs, 65 skips
vs.
862 dezicycles in 1st call, 8388524 runs, 84 skips

or if you want the biggest with zero skips:
1315 dezicycles in 1st call, 8192 runs, 0 skips
vs
862 dezicycles in 1st call, 8192 runs, 0 skips

full logs attached.

--Alex
-------------- next part --------------
FFmpeg version git-531def6, Copyright (c) 2000-2009 Fabrice Bellard, et al.
  configuration: --enable-gpl --disable-ffserver
  libavutil     50. 2. 0 / 50. 2. 0
  libavcodec    52.22. 3 / 52.22. 3
  libavformat   52.32. 0 / 52.32. 0
  libavdevice   52. 1. 0 / 52. 1. 0
  libswscale     0. 7. 1 /  0. 7. 1
  built on Mar 25 2009 16:28:07, gcc: 4.3.2
Output #0, null, to 'pipe:':
    Stream #0.0: Video: rawvideo, yuv420p, 1280x720, q=2-31, 200 kb/s, 90k tbn, 59.92 tbc
9450 dezicycles in 1st call, 1 runs, 0 skips
5265 dezicycles in 1st call, 2 runs, 0 skips
3127 dezicycles in 1st call, 4 runs, 0 skips
2171 dezicycles in 1st call, 8 runs, 0 skips
1569 dezicycles in 1st call, 16 runs, 0 skips
1260 dezicycles in 1st call, 32 runs, 0 skips
1070 dezicycles in 1st call, 64 runs, 0 skips
965 dezicycles in 1st call, 128 runs, 0 skips
910 dezicycles in 1st call, 256 runs, 0 skips
888 dezicycles in 1st call, 512 runs, 0 skips
871 dezicycles in 1st call, 1024 runs, 0 skips
862 dezicycles in 1st call, 2048 runs, 0 skips
864 dezicycles in 1st call, 4096 runs, 0 skips
862 dezicycles in 1st call, 8192 runs, 0 skips
860 dezicycles in 1st call, 16384 runs, 0 skips
860 dezicycles in 1st call, 32768 runs, 0 skips
860 dezicycles in 1st call, 65536 runs, 0 skips
859 dezicycles in 1st call, 131070 runs, 2 skips
860 dezicycles in 1st call, 262141 runs, 3 skips
860 dezicycles in 1st call, 524285 runs, 3 skips
861 dezicycles in 1st call, 1048566 runs, 10 skips
861 dezicycles in 1st call, 2097131 runs, 21 skips
1710 dezicycles in 2nd call, 1 runs, 0 skips
1710 dezicycles in 2nd call, 2 runs, 0 skips
1395 dezicycles in 2nd call, 4 runs, 0 skips
1282 dezicycles in 2nd call, 8 runs, 0 skips
1231 dezicycles in 2nd call, 16 runs, 0 skips
1130 dezicycles in 2nd call, 32 runs, 0 skips
1035 dezicycles in 2nd call, 64 runs, 0 skips
976 dezicycles in 2nd call, 128 runs, 0 skips
947 dezicycles in 2nd call, 256 runs, 0 skips
862 dezicycles in 1st call, 4194268 runs, 36 skips
932 dezicycles in 2nd call, 512 runs, 0 skips
918 dezicycles in 2nd call, 1024 runs, 0 skips
916 dezicycles in 2nd call, 2048 runs, 0 skips
919 dezicycles in 2nd call, 4096 runs, 0 skips
862 dezicycles in 1st call, 8388524 runs, 84 skips
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dsputil_mmx-no-mmx.h.diff
Type: text/x-diff
Size: 2540 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090325/6c2bbe06/attachment.diff>
-------------- next part --------------
FFmpeg version git-531def6, Copyright (c) 2000-2009 Fabrice Bellard, et al.
  configuration: --enable-gpl --disable-ffserver
  libavutil     50. 2. 0 / 50. 2. 0
  libavcodec    52.22. 3 / 52.22. 3
  libavformat   52.32. 0 / 52.32. 0
  libavdevice   52. 1. 0 / 52. 1. 0
  libswscale     0. 7. 1 /  0. 7. 1
  built on Mar 25 2009 16:28:07, gcc: 4.3.2
Output #0, null, to 'pipe:':
    Stream #0.0: Video: rawvideo, yuv420p, 1280x720, q=2-31, 200 kb/s, 90k tbn, 59.92 tbc
9810 dezicycles in 1st call, 1 runs, 0 skips
6120 dezicycles in 1st call, 2 runs, 0 skips
3892 dezicycles in 1st call, 4 runs, 0 skips
2823 dezicycles in 1st call, 8 runs, 0 skips
2137 dezicycles in 1st call, 16 runs, 0 skips
1740 dezicycles in 1st call, 32 runs, 0 skips
1537 dezicycles in 1st call, 64 runs, 0 skips
1428 dezicycles in 1st call, 128 runs, 0 skips
1375 dezicycles in 1st call, 256 runs, 0 skips
1353 dezicycles in 1st call, 512 runs, 0 skips
1329 dezicycles in 1st call, 1024 runs, 0 skips
1320 dezicycles in 1st call, 2048 runs, 0 skips
1322 dezicycles in 1st call, 4096 runs, 0 skips
1315 dezicycles in 1st call, 8192 runs, 0 skips
1181 dezicycles in 1st call, 16383 runs, 1 skips
1029 dezicycles in 1st call, 32767 runs, 1 skips
954 dezicycles in 1st call, 65535 runs, 1 skips
915 dezicycles in 1st call, 131070 runs, 2 skips
898 dezicycles in 1st call, 262141 runs, 3 skips
888 dezicycles in 1st call, 524284 runs, 4 skips
884 dezicycles in 1st call, 1048570 runs, 6 skips
882 dezicycles in 1st call, 2097134 runs, 18 skips
1620 dezicycles in 2nd call, 1 runs, 0 skips
1665 dezicycles in 2nd call, 2 runs, 0 skips
1485 dezicycles in 2nd call, 4 runs, 0 skips
1395 dezicycles in 2nd call, 8 runs, 0 skips
1310 dezicycles in 2nd call, 16 runs, 0 skips
1186 dezicycles in 2nd call, 32 runs, 0 skips
1147 dezicycles in 2nd call, 64 runs, 0 skips
1080 dezicycles in 2nd call, 128 runs, 0 skips
1038 dezicycles in 2nd call, 256 runs, 0 skips
882 dezicycles in 1st call, 4194272 runs, 32 skips
1008 dezicycles in 2nd call, 512 runs, 0 skips
989 dezicycles in 2nd call, 1024 runs, 0 skips
985 dezicycles in 2nd call, 2048 runs, 0 skips
982 dezicycles in 2nd call, 4096 runs, 0 skips
881 dezicycles in 1st call, 8388543 runs, 65 skips