[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Wed May 21 13:46:34 CEST 2008

Siarhei Siamashka wrote:

> Please add the following implementation of "pix_sum" function to your
> benchmark set and post the results. I strongly suspect that it is a lot 
> faster than any of your variants.

I've updated http://78.153.153.8/tmp/pix_sum.c and http://78.153.153.8/tmp/pix_sum.txt
(BTW, it might be offline for now due to some issues with my internet connection).

This is an extract from pix_sum.txt (PMUs - performance monitoring unit clock cycles,
[16], [32], etc. is the pix_sum line size):

...
pix_sum_iwmmxt2_last[16]: 4458 PMUs [32407]
pix_sum_iwmmxt2_last[32]: 8864 PMUs [32216]
pix_sum_iwmmxt2_last[64]: 13302 PMUs [32001]
pix_sum_iwmmxt2_last[128]: 17727 PMUs [34186]
pix_sum_iwmmxt2_last[256]: 22169 PMUs [34349]
pix_sum_iwmmxt2_last[512]: 26583 PMUs [35318]
pix_sum_iwmmxt2_last[1024]: 31030 PMUs [34941]
--
pix_sum_iwmmxt2_pipelined[16]: 4458 PMUs [32407]
pix_sum_iwmmxt2_pipelined[32]: 8899 PMUs [32216]
pix_sum_iwmmxt2_pipelined[64]: 13341 PMUs [32001]
pix_sum_iwmmxt2_pipelined[128]: 17780 PMUs [34186]
pix_sum_iwmmxt2_pipelined[256]: 22215 PMUs [34349]
pix_sum_iwmmxt2_pipelined[512]: 26652 PMUs [35318]
pix_sum_iwmmxt2_pipelined[1024]: 31090 PMUs [34941]
...

So, here is a table:

Mine   Your   My speedup
------------------------
4458   4458   0.0%
8899   8864   0.39%
13341  13302  0.29%
17780  17727  0.29%
22215  22169  0.2%
26652  26583  0.25%
31090  31030  0.19%

These 0.1-0.4% are marginal, but stable - few tens of runs gives an approximately
the same percents, and your's version was never faster.

As for code size, both versions contains 68 instructions.

pix_sum_iwmmxt2_last() was:

#define LOAD(x,y) \
     "wldrd wr" #x ", [%1, %2]!\n\t" \
     "wldrd wr" #y ", [%1, #8] \n\t" \

#define SUM4(x,y,z,t)               \
     LOAD(x,y) LOAD(z,t)             \
     "wsadb wr0, wr" #x ", wr5 \n\t" \
     "wsadb wr0, wr" #y ", wr5 \n\t" \
     "wsadb wr0, wr" #z ", wr5 \n\t" \
     "wsadb wr0, wr" #t ", wr5 \n\t"

int pix_sum_iwmmxt2_last(uint8_t *pix, int line_size)
{
     int s;

     asm volatile("wldrd wr1, [%1]           \n\t"
                  "wzero wr5                 \n\t"
                  "wldrd wr2, [%1, #8]       \n\t"
                  LOAD(3,4)
                  "wsadbz wr0, wr1, wr5      \n\t"
                  "wsadb wr0, wr2, wr5       \n\t"
                  "wsadb wr0, wr3, wr5       \n\t"
                  "wsadb wr0, wr4, wr5       \n\t"
                  SUM4(1, 2, 3, 4)
                  SUM4(1, 2, 3, 4)
                  SUM4(1, 2, 3, 4)
                  SUM4(1, 2, 3, 4)
                  SUM4(1, 2, 3, 4)
                  SUM4(1, 2, 3, 4)
                  SUM4(1, 2, 3, 4)
                  "textrmsw %0, wr0, #0      \n\t"
                  : "=r"(s), "+r"(pix)
                  : "r"(line_size));
     return s;
}

Dmitry