[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2

Wed May 21 17:36:51 CEST 2008

On Wednesday 21 May 2008, Dmitry Antipov wrote:
> Siarhei Siamashka wrote:
> > Please add the following implementation of "pix_sum" function to your
> > benchmark set and post the results. I strongly suspect that it is a lot
> > faster than any of your variants.
>
> I've updated http://78.153.153.8/tmp/pix_sum.c and
> http://78.153.153.8/tmp/pix_sum.txt (BTW, it might be offline for now due
> to some issues with my internet connection).
>
> This is an extract from pix_sum.txt (PMUs - performance monitoring unit
> clock cycles, [16], [32], etc. is the pix_sum line size):
>
> ...
> pix_sum_iwmmxt2_last[16]: 4458 PMUs [32407]
> pix_sum_iwmmxt2_last[32]: 8864 PMUs [32216]
> pix_sum_iwmmxt2_last[64]: 13302 PMUs [32001]
> pix_sum_iwmmxt2_last[128]: 17727 PMUs [34186]
> pix_sum_iwmmxt2_last[256]: 22169 PMUs [34349]
> pix_sum_iwmmxt2_last[512]: 26583 PMUs [35318]
> pix_sum_iwmmxt2_last[1024]: 31030 PMUs [34941]
> --
> pix_sum_iwmmxt2_pipelined[16]: 4458 PMUs [32407]
> pix_sum_iwmmxt2_pipelined[32]: 8899 PMUs [32216]
> pix_sum_iwmmxt2_pipelined[64]: 13341 PMUs [32001]
> pix_sum_iwmmxt2_pipelined[128]: 17780 PMUs [34186]
> pix_sum_iwmmxt2_pipelined[256]: 22215 PMUs [34349]
> pix_sum_iwmmxt2_pipelined[512]: 26652 PMUs [35318]
> pix_sum_iwmmxt2_pipelined[1024]: 31090 PMUs [34941]
> ...
>
> So, here is a table:
>
> Mine   Your   My speedup
> ------------------------
> 4458   4458   0.0%
> 8899   8864   0.39%
> 13341  13302  0.29%
> 17780  17727  0.29%
> 22215  22169  0.2%
> 26652  26583  0.25%
> 31090  31030  0.19%
>
> These 0.1-0.4% are marginal, but stable - few tens of runs gives an
> approximately the same percents, and your's version was never faster.
>
> As for code size, both versions contains 68 instructions.

Please also try the following variant, it should be fast even for
WLDRD latency up to 5 (good for WMMX1). I wonder how it would compare
against the previous version on your CPU.

#define SUM2()                  \
    "wldrd wr1, [%1, %2]! \n\t" \
    "wsadb wr9, wr3, wr0  \n\t" \
    "wldrd wr2, [%1, #8]  \n\t" \
    "wsadb wr9, wr4, wr0  \n\t" \
    "wldrd wr3, [%1, %2]! \n\t" \
    "wsadb wr9, wr1, wr0  \n\t" \
    "wldrd wr4, [%1, #8]  \n\t" \
    "wsadb wr9, wr2, wr0  \n\t"

int pix_sum_iwmmxt2_deeper_pipelined(uint8_t *pix, int line_size)
{
    int s;
    asm volatile(
        "wldrd    wr1, [%1]           \n\t"
        "wldrd    wr2, [%1, #8]       \n\t"
        "wzero    wr0                 \n\t"
        "wldrd    wr3, [%1, %2]!      \n\t"
        "wsadbz   wr9, wr1, wr0       \n\t"
        "wldrd    wr4, [%1, #8]       \n\t"
        "wsadb    wr9, wr2, wr0       \n\t"
        SUM2()
        SUM2()
        SUM2()
        SUM2()
        SUM2()
        SUM2()
        SUM2()
        "wsadb    wr9, wr3, wr0       \n\t"
        "wsadb    wr9, wr4, wr0       \n\t"
        "textrmsw %0, wr9, #0      \n\t"
        : "=r"(s), "+r"(pix)
        : "r"(line_size));
    return s;
}

-- 
Best regards,
Siarhei Siamashka