[FFmpeg-devel] [FFmpeg-cvslog] r12171 - trunk/doc/optimization.txt

Thu Feb 21 20:16:39 CET 2008

Hi,

On Thu, Feb 21, 2008 at 9:11 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Thu, Feb 21, 2008 at 08:52:17PM +0200, ?smail D?nmez wrote:
>  > Hi,
>  >
>  > >Author: melanson
>  > >Date: Thu Feb 21 19:46:49 2008
>  > >New Revision: 12171
>  > >
>  > >Log:
>  > >minor English corrections
>  > >
>  > >
>  > >Modified:
>  > >  trunk/doc/optimization.txt
>  > [...]
>  > >  -Use asm() instead of intrinsics. Later requires a good optimizing compiler
>  > >  +Use asm() instead of intrinsics. The latter requires a good optimizing compiler
>  > >   which gcc is not.
>  >
>  > We all know this is FUD now, I know Michael still uses gcc 2.95 but
>  > the world have moved on. GCC 4.3 is about to be released.
>  > So please either backup these claims or note that this is not true for
>  > recent GCCs.
>
>  I use gcc r132072 ATM, i admit its a few days old, do you claim that gcc
>  was rewritten yesterday?
>
>  Also to backup the claim, the following was suggested to me a few days ago:
>  -static inline void diff_pixels_mmx(DCTELEM *block, const uint8_t *s1, const uint8_t *s2, int stride)
>  +static void diff_pixels_mmx(DCTELEM *block, const uint8_t *s1, const uint8_t *s2, long stride)
>   {
>  -    asm volatile(
>  -        "pxor %%mm7, %%mm7              \n\t"
>  -        "mov $-128, %%"REG_a"           \n\t"
>  -        ASMALIGN(4)
>  -        "1:                             \n\t"
>  -        "movq (%0), %%mm0               \n\t"
>  -        "movq (%1), %%mm2               \n\t"
>  -        "movq %%mm0, %%mm1              \n\t"
>  -        "movq %%mm2, %%mm3              \n\t"
>  -        "punpcklbw %%mm7, %%mm0         \n\t"
>  -        "punpckhbw %%mm7, %%mm1         \n\t"
>  -        "punpcklbw %%mm7, %%mm2         \n\t"
>  -        "punpckhbw %%mm7, %%mm3         \n\t"
>  -        "psubw %%mm2, %%mm0             \n\t"
>  -        "psubw %%mm3, %%mm1             \n\t"
>  -        "movq %%mm0, (%2, %%"REG_a")    \n\t"
>  -        "movq %%mm1, 8(%2, %%"REG_a")   \n\t"
>  -        "add %3, %0                     \n\t"
>  -        "add %3, %1                     \n\t"
>  -        "add $16, %%"REG_a"             \n\t"
>  -        "jnz 1b                         \n\t"
>  -        : "+r" (s1), "+r" (s2)
>  -        : "r" (block+64), "r" ((long)stride)
>  -        : "%"REG_a
>  -    );
>  +    long offset = -128;
>  +    MOVQ_ZERO(mm7);
>  +    do {
>  +        asm volatile(
>  +            "movq (%0), %%mm0         \n\t"
>  +            "movq (%1), %%mm2         \n\t"
>  +            "movq %%mm0, %%mm1        \n\t"
>  +            "movq %%mm2, %%mm3        \n\t"
>  +            "punpcklbw %%mm7, %%mm0   \n\t"
>  +            "punpckhbw %%mm7, %%mm1   \n\t"
>  +            "punpcklbw %%mm7, %%mm2   \n\t"
>  +            "punpckhbw %%mm7, %%mm3   \n\t"
>  +            "psubw %%mm2, %%mm0       \n\t"
>  +            "psubw %%mm3, %%mm1       \n\t"
>  +            "movq %%mm0, (%2, %4)     \n\t"
>  +            "movq %%mm1, 8(%2, %4)    \n\t"
>  +            : : "r" (s1), "r" (s2), "r" (block+64), "r" (stride), "r" (offset)
>  +            : "memory");
>  +        s1 += stride;
>  +        s2 += stride;
>  +        offset += 16;
>  +    } while (offset < 0);
>   }
>
>  the effect that has on the generated asm is:
>  .L143:
>         .loc 3 241 0
>         leaq    (%rsi,%r8), %rdx
>         leaq    (%r10,%r8), %rax
>  #APP
>  # 241 "dsputil_mmx.c" 1
>         movq (%rdx), %mm0
>         movq (%rax), %mm2
>         movq %mm0, %mm1
>         movq %mm2, %mm3
>         punpcklbw %mm7, %mm0
>         punpckhbw %mm7, %mm1
>         punpcklbw %mm7, %mm2
>         punpckhbw %mm7, %mm3
>         psubw %mm2, %mm0
>         psubw %mm3, %mm1
>         movq %mm0, (%rdi, %r9)
>         movq %mm1, 8(%rdi, %r9)
>
>  # 0 "" 2
>         .loc 3 258 0
>  #NO_APP
>         addq    %rcx, %r8
>         .loc 3 259 0
>         addq    $16, %r9
>         jne     .L143
>  -------------
>
>  As you can see gcc injects 2 unneeded lea instructions in the innermost loop.
>  And i think this is a very simple asm, if you want you can try this with some
>  complex code, but i recommand that you have a few bags for vomit ready ...

If you can give an example based on complex asm we can report a bug to
gcc, just saying gcc is not a good optimizer
does not help anyone, do we have another better open source compiler?
No. So if you have a better example of bad asm produced we can ask
gcc developers.

Thanks!

-- 
Never learn by your mistakes, if you do you may never dare to try again