[FFmpeg-devel] Some IWMMXT functions for libavcodec

Mon May 12 19:50:21 CEST 2008

On Monday 12 May 2008, Dmitry Antipov wrote:
> here are some libavcodec DSP stuff I've developed for XScale CPU with Intel
> WMMX support.
>
> (At http://78.153.153.8/tmp/dspwmmx.c, there is also a small standalone
> validation & benchmark program for these functions).

Hi, just some comments regarding your code. Let's take 'vsad_intra16_iwmmxt'
function as an example:

int vsad_intra16_iwmmxt(void *c, uint8_t *pix, uint8_t *dummy, int stride, 
	int h)
{
    int s, i;

    for (s = 0, i = 1; i < h; i++) {
	asm volatile("wldrd wr0, [%1]       \n\t"
		     "wldrd wr1, [%2]       \n\t"
		     "wsadbz wr1, wr0, wr1  \n\t"
		     "wldrd wr0, [%1, #8]   \n\t"
		     "wldrd wr2, [%2, #8]   \n\t"
		     "wsadbz wr2, wr0, wr2  \n\t"
		     "waddw wr1, wr1, wr2   \n\t"
		     "textrmsw r1, wr1, #0  \n\t"
		     "add %0, %0, r1        \n\t"
		     : "+r"(s)
		     : "r"(pix), "r"(pix + stride)
		     : "r1");
	pix += stride;
    }
    return s;
}

This source translates into the following code using gcc 4.2.1 and
optimization options '-march=iwmmxt -mcpu=iwmmxt -O3 -fomit-frame-pointer':

0000879c <vsad_intra16_iwmmxt>:
    879c:	e92d4010 	push	{r4, lr}
    87a0:	e59d4008 	ldr	r4, [sp, #8]
    87a4:	e1a02001 	mov	r2, r1
    87a8:	e1a0e003 	mov	lr, r3
    87ac:	e3540001 	cmp	r4, #1	; 0x1
    87b0:	d3a00000 	movle	r0, #0	; 0x0
    87b4:	da00000f 	ble	87f8 <vsad_intra16_iwmmxt+0x5c>
    87b8:	e3a00000 	mov	r0, #0	; 0x0
    87bc:	e3a0c001 	mov	ip, #1	; 0x1

    87c0:	e08e3002 	add	r3, lr, r2
    87c4:	edd20100 	wldrd	wr0, [r2]
    87c8:	edd31100 	wldrd	wr1, [r3]
    87cc:	ee101121 	wsadbz	wr1, wr0, wr1
    87d0:	edd20102 	wldrd	wr0, [r2, #8]
    87d4:	edd32102 	wldrd	wr2, [r3, #8]
    87d8:	ee102122 	wsadbz	wr2, wr0, wr2
    87dc:	ee811182 	waddw	wr1, wr1, wr2
    87e0:	ee911078 	textrmsw	r1, wr1, #0
    87e4:	e0800001 	add	r0, r0, r1
    87e8:	e28cc001 	add	ip, ip, #1	; 0x1
    87ec:	e15c0004 	cmp	ip, r4
    87f0:	e1a02003 	mov	r2, r3
    87f4:	1afffff1 	bne	87c0 <vsad_intra16_iwmmxt+0x24>

    87f8:	e8bd8010 	pop	{r4, pc}

Loop overhead added by the compiler seems a bit excessive. You could
decrement loop counter down to zero saving one instruction in the inner 
loop. Also 'textrmsw' seems to be unneeded on each loop iteration, you can
accumulate result in some IWMMXT register and move it to ARM register only 
at the very end of function saving one more instruction in the inner loop.
And Intel manual recommends to insert non-memory related instructions
after 'wldrd', you could quite conveniently insert pointer increment by 
stride there, or anything else. Back-to back 'wldrd' instructions 
introduce pipeline stall.

Similar performance problems may be in the other functions too. It would be
probably better to implement loops inside of the inline assembly code and 
not rely on the compiler too much.

With all that said, I don't have IWMMXT hardware myself and can't do any
benchmarks or even verification of the code.

Do these optimizations improve performance much? Are there any benchmark
numbers? If I'm not mistaken, these functions are used for video encoding. 
I'm just curious, is video encoding a usable task for XScale processors?

-- 
Best regards,
Siarhei Siamashka