[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Fri May 16 16:56:37 CEST 2008

Michael Niedermayer wrote:

> [...]
>> +static int vsad_intra16_iwmmxt(void *c, uint8_t *pix, uint8_t *dummy, int stride, int h)
>> +{
>> +    int s;
>> +
>> +    asm volatile("mov r1, %3            \n\t"
>> +                 "wzero wr0             \n\t"
>> +                 "1: wldrd wr1, [%1]    \n\t"
>> +                 "wldrd wr2, [%1, #8]   \n\t"
>> +                 "add %1, %1, %2        \n\t"
>> +                 "wldrd wr3, [%1]       \n\t"
>> +                 "wldrd wr4, [%1, #8]   \n\t"
>> +                 "wsadbz wr1, wr1, wr3  \n\t"
>> +                 "wsadbz wr2, wr2, wr4  \n\t"
>> +                 "waddw wr0, wr0, wr1   \n\t"
>> +                 "waddw wr0, wr0, wr2   \n\t"
>> +                 "subs r1, r1, #1       \n\t"
>> +                 "bne 1b                \n\t"
> 
> half of the loads in there are redundant, this also applies to a few
> other functions

Why? Unlike on x86, you can't do SIMD stuff between register(s) and memory - all data
should be loaded first. This means that WMMX code will always issue more loads than
equivalent MMX/SSE code.

For example x = x + y for 8x8 vectors issues 1 load and 1 store on x86 with MMX:

asm volatile("movq (%1), %%mm0\n\t"
              "paddb (%0), %%mm0\n\t"
              "movq %%mm0, (%0)\n\t"
              : : "r"(x), "r"(y));

For WMMX, you can't do it without at least 2 loads and 1 store:

asm volatile ("wldrd wr0, [%0]\n\t"
               "wldrd wr1, [%1]\n\t"
               "waddb wr0, wr0, wr1\n\t"
               "wstrd wr0, [%0]\n\t"
               : : "r"(x), "r"(y));

Am I missed something?

Dmitry