[FFmpeg-devel] [PATCHES] lpc_mmx: merge some asm blocks and add xmm registers to clobber list

Sun Oct 31 13:39:10 CET 2010

Hi,

On Sat, Oct 30, 2010 at 3:46 PM, Ramiro Polla <ramiro.polla at gmail.com> wrote:
> $subj, split in 2 patches
[..]
> --- a/libavcodec/x86/lpc_mmx.c
> +++ b/libavcodec/x86/lpc_mmx.c
> @@ -29,16 +29,15 @@ static void apply_welch_window_sse2(const int32_t *data, int len, double *w_data
>      x86_reg i = -n2*sizeof(int32_t);
>      x86_reg j =  n2*sizeof(int32_t);
>      __asm__ volatile(
> -        "movsd   %0,     %%xmm7                \n\t"
> +        "movsd   %4,     %%xmm7                \n\t"
>          "movapd  "MANGLE(ff_pd_1)", %%xmm6     \n\t"
>          "movapd  "MANGLE(ff_pd_2)", %%xmm5     \n\t"
>          "movlhps %%xmm7, %%xmm7                \n\t"
>          "subpd   %%xmm5, %%xmm7                \n\t"
>          "addsd   %%xmm6, %%xmm7                \n\t"
> -        ::"m"(c)
> -    );
> +        "bt      $1,     %5                    \n\t"
> +        "jne     2f                            \n\t"
>  #define WELCH(MOVPD, offset)\
> -    __asm__ volatile(\
>          "1:                                    \n\t"\
>          "movapd   %%xmm7,  %%xmm1              \n\t"\
>          "mulpd    %%xmm1,  %%xmm1              \n\t"\
> @@ -55,13 +54,15 @@ static void apply_welch_window_sse2(const int32_t *data, int len, double *w_data
>          "sub      $8,      %1                  \n\t"\
>          "add      $8,      %0                  \n\t"\
>          "jl 1b                                 \n\t"\
> -        :"+&r"(i), "+&r"(j)\
> -        :"r"(w_data+n2), "r"(data+n2)\
> -    );
> -    if(len&1)
> +
>          WELCH("movupd", -1)
> -    else
> +        "jmp 3f                                \n\t"
> +        "2:                                    \n\t"
>          WELCH("movapd", -2)
> +        "3:                                    \n\t"
> +        :"+&r"(i), "+&r"(j)
> +        :"r"(w_data+n2), "r"(data+n2), "m"(c), "r"(len)
> +    );
>  #undef WELCH
>  }

<nag> What I don't like here is that you're changing it in a hardcoded
(conditional) jump, like this:

test bla
jne else
; stuff code
jmp endif
.else
; stuff2 code
.endif
ret

Which means that both branches will always have a jmp. In the C case
(at least with a good gcc version), if you check the disassembly, gcc
will generate code that does this:

test bla
jne else
; stuff code
ret
.else
; stuff2 code
ret

And as you can see, the first half is actually jmp-less. I've tested
this and it's quite a bit faster, just the lack of the jmp in one half
of the runtime. Combined with stuff like predicting which of the two
is most likely to happen in the statistical sense, this can have a
significant impact on performance. In yasm, you'd simply code a RET at
the end of each branch. Can we do something similar in inline asm?

Ronald