[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2
Dmitry Antipov
dmantipov
Fri May 16 16:56:37 CEST 2008
Michael Niedermayer wrote:
> [...]
>> +static int vsad_intra16_iwmmxt(void *c, uint8_t *pix, uint8_t *dummy, int stride, int h)
>> +{
>> + int s;
>> +
>> + asm volatile("mov r1, %3 \n\t"
>> + "wzero wr0 \n\t"
>> + "1: wldrd wr1, [%1] \n\t"
>> + "wldrd wr2, [%1, #8] \n\t"
>> + "add %1, %1, %2 \n\t"
>> + "wldrd wr3, [%1] \n\t"
>> + "wldrd wr4, [%1, #8] \n\t"
>> + "wsadbz wr1, wr1, wr3 \n\t"
>> + "wsadbz wr2, wr2, wr4 \n\t"
>> + "waddw wr0, wr0, wr1 \n\t"
>> + "waddw wr0, wr0, wr2 \n\t"
>> + "subs r1, r1, #1 \n\t"
>> + "bne 1b \n\t"
>
> half of the loads in there are redundant, this also applies to a few
> other functions
Why? Unlike on x86, you can't do SIMD stuff between register(s) and memory - all data
should be loaded first. This means that WMMX code will always issue more loads than
equivalent MMX/SSE code.
For example x = x + y for 8x8 vectors issues 1 load and 1 store on x86 with MMX:
asm volatile("movq (%1), %%mm0\n\t"
"paddb (%0), %%mm0\n\t"
"movq %%mm0, (%0)\n\t"
: : "r"(x), "r"(y));
For WMMX, you can't do it without at least 2 loads and 1 store:
asm volatile ("wldrd wr0, [%0]\n\t"
"wldrd wr1, [%1]\n\t"
"waddb wr0, wr0, wr1\n\t"
"wstrd wr0, [%0]\n\t"
: : "r"(x), "r"(y));
Am I missed something?
Dmitry
More information about the ffmpeg-devel
mailing list