[FFmpeg-devel] [PATCH] fix add_bytes_mmx and add_bytes_l2_mmx for w <= 15

Sun Jun 22 09:34:08 CEST 2008

On Sun, Jun 22, 2008 at 03:16:14AM +0200, Michael Niedermayer wrote:
> On Sat, Jun 21, 2008 at 08:40:02PM +0200, Reimar D?ffinger wrote:
> > as noticeable when decoding small png images, these two functions do not
> > work correctly and cause a segfault.
> > Attached is one possible solution, I think another would be to change
> > the jb to js and jmp to the comparison before the first loop.
> 
> Iam ok with the solutiom that is faster and if they are the same speed the
> one that is smaller

In my quick tests (I am not going to do extensive benchmarks on code that will
be changed later anyway) the version using jmp is smaller and usually faster,
so I applied that.

> Besides the cmp is unneeded and can be removed

Like in attached patch? Unfortunately the benchmark number seem
completely unrealistic to me, going by them there would be 4x speedup in
some cases...
Though I tested with png images, maybe they are a horrible testcase.

Example numbers:
previous code:
39350 dezicycles in blub, 1 runs, 0 skips
24925 dezicycles in blub, 2 runs, 0 skips
16242 dezicycles in blub, 4 runs, 0 skips
11603 dezicycles in blub, 8 runs, 0 skips
9407 dezicycles in blub, 16 runs, 0 skips

new code:
7450 dezicycles in blub, 1 runs, 0 skips
8040 dezicycles in blub, 2 runs, 0 skips
7265 dezicycles in blub, 4 runs, 0 skips
6841 dezicycles in blub, 8 runs, 0 skips
6843 dezicycles in blub, 16 runs, 0 skips

Greetings,
Reimar D?ffinger
-------------- next part --------------
Index: libavcodec/i386/dsputil_mmx.c
===================================================================

--- libavcodec/i386/dsputil_mmx.c	(revision 13877)
+++ libavcodec/i386/dsputil_mmx.c	(working copy)
@@ -480,7 +480,7 @@
 }
 
 static void add_bytes_mmx(uint8_t *dst, uint8_t *src, int w){
-    x86_reg i=0;
+    x86_reg i=w;
     asm volatile(
         "jmp 2f                         \n\t"
         "1:                             \n\t"
@@ -492,19 +492,19 @@
         "movq 8(%2, %0), %%mm1          \n\t"
         "paddb %%mm0, %%mm1             \n\t"
         "movq %%mm1, 8(%2, %0)          \n\t"
-        "add $16, %0                    \n\t"
         "2:                             \n\t"
-        "cmp %3, %0                     \n\t"
-        " js 1b                         \n\t"
+        "sub $16, %0                    \n\t"
+        " jns 1b                        \n\t"
         : "+r" (i)
-        : "r"(src), "r"(dst), "r"((x86_reg)w-15)
+        : "r"(src), "r"(dst)
     );
-    for(; i<w; i++)
+    i += 16;
+    while(--i >= 0)
         dst[i+0] += src[i+0];
 }
 
 static void add_bytes_l2_mmx(uint8_t *dst, uint8_t *src1, uint8_t *src2, int w){
-    x86_reg i=0;
+    x86_reg i=w;
     asm volatile(
         "jmp 2f                         \n\t"
         "1:                             \n\t"
@@ -514,14 +514,14 @@
         "paddb 8(%3, %0), %%mm1         \n\t"
         "movq %%mm0,  (%1, %0)          \n\t"
         "movq %%mm1, 8(%1, %0)          \n\t"
-        "add $16, %0                    \n\t"
         "2:                             \n\t"
-        "cmp %4, %0                     \n\t"
-        " js 1b                         \n\t"
+        "sub $16, %0                    \n\t"
+        " jns 1b                        \n\t"
         : "+r" (i)
-        : "r"(dst), "r"(src1), "r"(src2), "r"((x86_reg)w-15)
+        : "r"(dst), "r"(src1), "r"(src2)
     );
-    for(; i<w; i++)
+    i += 16;
+    while(--i >= 0)
         dst[i] = src1[i] + src2[i];
 }