[FFmpeg-devel] Some IWMMXT functions for libavcodec
Siarhei Siamashka
siarhei.siamashka
Mon May 12 19:50:21 CEST 2008
On Monday 12 May 2008, Dmitry Antipov wrote:
> here are some libavcodec DSP stuff I've developed for XScale CPU with Intel
> WMMX support.
>
> (At http://78.153.153.8/tmp/dspwmmx.c, there is also a small standalone
> validation & benchmark program for these functions).
Hi, just some comments regarding your code. Let's take 'vsad_intra16_iwmmxt'
function as an example:
int vsad_intra16_iwmmxt(void *c, uint8_t *pix, uint8_t *dummy, int stride,
int h)
{
int s, i;
for (s = 0, i = 1; i < h; i++) {
asm volatile("wldrd wr0, [%1] \n\t"
"wldrd wr1, [%2] \n\t"
"wsadbz wr1, wr0, wr1 \n\t"
"wldrd wr0, [%1, #8] \n\t"
"wldrd wr2, [%2, #8] \n\t"
"wsadbz wr2, wr0, wr2 \n\t"
"waddw wr1, wr1, wr2 \n\t"
"textrmsw r1, wr1, #0 \n\t"
"add %0, %0, r1 \n\t"
: "+r"(s)
: "r"(pix), "r"(pix + stride)
: "r1");
pix += stride;
}
return s;
}
This source translates into the following code using gcc 4.2.1 and
optimization options '-march=iwmmxt -mcpu=iwmmxt -O3 -fomit-frame-pointer':
0000879c <vsad_intra16_iwmmxt>:
879c: e92d4010 push {r4, lr}
87a0: e59d4008 ldr r4, [sp, #8]
87a4: e1a02001 mov r2, r1
87a8: e1a0e003 mov lr, r3
87ac: e3540001 cmp r4, #1 ; 0x1
87b0: d3a00000 movle r0, #0 ; 0x0
87b4: da00000f ble 87f8 <vsad_intra16_iwmmxt+0x5c>
87b8: e3a00000 mov r0, #0 ; 0x0
87bc: e3a0c001 mov ip, #1 ; 0x1
87c0: e08e3002 add r3, lr, r2
87c4: edd20100 wldrd wr0, [r2]
87c8: edd31100 wldrd wr1, [r3]
87cc: ee101121 wsadbz wr1, wr0, wr1
87d0: edd20102 wldrd wr0, [r2, #8]
87d4: edd32102 wldrd wr2, [r3, #8]
87d8: ee102122 wsadbz wr2, wr0, wr2
87dc: ee811182 waddw wr1, wr1, wr2
87e0: ee911078 textrmsw r1, wr1, #0
87e4: e0800001 add r0, r0, r1
87e8: e28cc001 add ip, ip, #1 ; 0x1
87ec: e15c0004 cmp ip, r4
87f0: e1a02003 mov r2, r3
87f4: 1afffff1 bne 87c0 <vsad_intra16_iwmmxt+0x24>
87f8: e8bd8010 pop {r4, pc}
Loop overhead added by the compiler seems a bit excessive. You could
decrement loop counter down to zero saving one instruction in the inner
loop. Also 'textrmsw' seems to be unneeded on each loop iteration, you can
accumulate result in some IWMMXT register and move it to ARM register only
at the very end of function saving one more instruction in the inner loop.
And Intel manual recommends to insert non-memory related instructions
after 'wldrd', you could quite conveniently insert pointer increment by
stride there, or anything else. Back-to back 'wldrd' instructions
introduce pipeline stall.
Similar performance problems may be in the other functions too. It would be
probably better to implement loops inside of the inline assembly code and
not rely on the compiler too much.
With all that said, I don't have IWMMXT hardware myself and can't do any
benchmarks or even verification of the code.
Do these optimizations improve performance much? Are there any benchmark
numbers? If I'm not mistaken, these functions are used for video encoding.
I'm just curious, is video encoding a usable task for XScale processors?
--
Best regards,
Siarhei Siamashka
More information about the ffmpeg-devel
mailing list