[FFmpeg-devel] [PATCH] Some IWMMXT functions for libavcodec #2
Siarhei Siamashka
siarhei.siamashka
Wed May 21 17:36:51 CEST 2008
On Wednesday 21 May 2008, Dmitry Antipov wrote:
> Siarhei Siamashka wrote:
> > Please add the following implementation of "pix_sum" function to your
> > benchmark set and post the results. I strongly suspect that it is a lot
> > faster than any of your variants.
>
> I've updated http://78.153.153.8/tmp/pix_sum.c and
> http://78.153.153.8/tmp/pix_sum.txt (BTW, it might be offline for now due
> to some issues with my internet connection).
>
> This is an extract from pix_sum.txt (PMUs - performance monitoring unit
> clock cycles, [16], [32], etc. is the pix_sum line size):
>
> ...
> pix_sum_iwmmxt2_last[16]: 4458 PMUs [32407]
> pix_sum_iwmmxt2_last[32]: 8864 PMUs [32216]
> pix_sum_iwmmxt2_last[64]: 13302 PMUs [32001]
> pix_sum_iwmmxt2_last[128]: 17727 PMUs [34186]
> pix_sum_iwmmxt2_last[256]: 22169 PMUs [34349]
> pix_sum_iwmmxt2_last[512]: 26583 PMUs [35318]
> pix_sum_iwmmxt2_last[1024]: 31030 PMUs [34941]
> --
> pix_sum_iwmmxt2_pipelined[16]: 4458 PMUs [32407]
> pix_sum_iwmmxt2_pipelined[32]: 8899 PMUs [32216]
> pix_sum_iwmmxt2_pipelined[64]: 13341 PMUs [32001]
> pix_sum_iwmmxt2_pipelined[128]: 17780 PMUs [34186]
> pix_sum_iwmmxt2_pipelined[256]: 22215 PMUs [34349]
> pix_sum_iwmmxt2_pipelined[512]: 26652 PMUs [35318]
> pix_sum_iwmmxt2_pipelined[1024]: 31090 PMUs [34941]
> ...
>
> So, here is a table:
>
> Mine Your My speedup
> ------------------------
> 4458 4458 0.0%
> 8899 8864 0.39%
> 13341 13302 0.29%
> 17780 17727 0.29%
> 22215 22169 0.2%
> 26652 26583 0.25%
> 31090 31030 0.19%
>
> These 0.1-0.4% are marginal, but stable - few tens of runs gives an
> approximately the same percents, and your's version was never faster.
>
> As for code size, both versions contains 68 instructions.
Please also try the following variant, it should be fast even for
WLDRD latency up to 5 (good for WMMX1). I wonder how it would compare
against the previous version on your CPU.
#define SUM2() \
"wldrd wr1, [%1, %2]! \n\t" \
"wsadb wr9, wr3, wr0 \n\t" \
"wldrd wr2, [%1, #8] \n\t" \
"wsadb wr9, wr4, wr0 \n\t" \
"wldrd wr3, [%1, %2]! \n\t" \
"wsadb wr9, wr1, wr0 \n\t" \
"wldrd wr4, [%1, #8] \n\t" \
"wsadb wr9, wr2, wr0 \n\t"
int pix_sum_iwmmxt2_deeper_pipelined(uint8_t *pix, int line_size)
{
int s;
asm volatile(
"wldrd wr1, [%1] \n\t"
"wldrd wr2, [%1, #8] \n\t"
"wzero wr0 \n\t"
"wldrd wr3, [%1, %2]! \n\t"
"wsadbz wr9, wr1, wr0 \n\t"
"wldrd wr4, [%1, #8] \n\t"
"wsadb wr9, wr2, wr0 \n\t"
SUM2()
SUM2()
SUM2()
SUM2()
SUM2()
SUM2()
SUM2()
"wsadb wr9, wr3, wr0 \n\t"
"wsadb wr9, wr4, wr0 \n\t"
"textrmsw %0, wr9, #0 \n\t"
: "=r"(s), "+r"(pix)
: "r"(line_size));
return s;
}
--
Best regards,
Siarhei Siamashka
More information about the ffmpeg-devel
mailing list