[PATCH] New rgb32tobgr32 (was: Re: [Ffmpeg-devel] [PATCH] have cs_test check for sigsegv at smaller widths and sigill)
Michael Niedermayer
michaelni
Sat Apr 14 02:14:46 CEST 2007
On Fri, Apr 13, 2007 at 10:40:12PM +0200, Ivo wrote:
> Hi,
>
> On Friday 13 April 2007 19:19, Ivo wrote:
> > It's even worse. The change for rgb32tobgr32 is not ok as it doesn't have
> > any fallback C code if the HAVE_MMX block is compiled. So, with my
> > change, there will be no conversion done if the incoming image is too
> > small (1 pixel wide :) ). I'll brew up another patch that rewrites the
> > whole function from scratch and also fix two other scalers (rgb32to16 and
> > rgb32to15) that do not segfault, but do run the MMX code even if src_size
> > is smaller than the size of the units it processes.
>
> Okay, let's do one at the time. Here's a new rgb32tobgr32.
>
> Old C code:
> 71005170 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 68674000 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 68812770 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 70973970 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 80106370 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 72288570 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 70083870 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 69491770 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 71102770 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 68530510 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> Avg: 71106977
>
> New C code:
> 67884120 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 67676920 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 67062970 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 66117070 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 69446870 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 67218120 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 67959520 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 68862130 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 67234370 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 66610970 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> Avg: 67607306
>
> Old MMX code:
> 65658150 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 66108690 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 68708460 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 65514390 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 65134590 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 76013820 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 69046210 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 68236350 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 65870740 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 70115250 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> Avg: 68040665
>
> New MMX code:
> 67142180 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 66278990 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 65370130 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 68851770 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 66163960 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 66671710 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 67618260 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 66444230 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 75597570 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> 64721560 dezicycles in rgb32tobgr32, 1 runs, 0 skips
> Avg: 67486036
>
> My CPU is an AMD Sempron 2400+.
[...]
> + __asm __volatile(
> + " "PREFETCH" (%1) \n"
> + " movq %3, %%mm7 \n"
> + " pxor %4, %%mm7 \n"
> + " pxor %5, %%mm7 \n"
> + " movq %%mm7, %%mm6 \n"
this is senseless, rather use the register for something usefull
like avoiding reading %3 twice in the loop from memory
> + " jmp 2f \n"
> + ASMALIGN(4)
> + "1: \n"
> + " "PREFETCH" 32(%1) \n"
> + " movq (%1), %%mm0 \n"
> + " movq 8(%1), %%mm1 \n"
> + " movq %%mm0, %%mm2 \n"
> + " movq %%mm1, %%mm4 \n"
> + " movq %%mm0, %%mm3 \n"
> + " movq %%mm1, %%mm5 \n"
> + " pand %3, %%mm2 \n"
> + " pand %4, %%mm3 \n"
> + " pand %3, %%mm4 \n"
> + " pand %4, %%mm5 \n"
> + " pslld $16, %%mm2 \n"
> + " psrld $16, %%mm3 \n"
> + " pslld $16, %%mm4 \n"
> + " psrld $16, %%mm5 \n"
> + " pand %%mm6, %%mm0 \n"
> + " pand %%mm7, %%mm1 \n"
> + " por %%mm2, %%mm0 \n"
> + " por %%mm4, %%mm1 \n"
> + " por %%mm3, %%mm0 \n"
> + " por %%mm5, %%mm1 \n"
> + " "MOVNTQ" %%mm0, (%0) \n"
> + " "MOVNTQ" %%mm1, 8(%0) \n"
> + " add $16, %0 \n"
> + " add $16, %1 \n"
> + "2: \n"
> + " cmp %1, %2 \n"
> + " ja 1b \n"
> + " "SFENCE" \n"
> + " "EMMS" \n"
> + : "+r"(d), "+r"(s)
> + : "r" (end-15), "m" (mask32b), "m" (mask32r), "m" (mmx_one)
> + : "memory");
> #endif
> - }
> -#endif
> + for (; s<end; s+=4, d+=4) {
> + int v = *(uint32_t *)s;
> + int r = v & 0xff, g = (v>>8) & 0xff, b = (v>>16) & 0xff;
> + *(uint32_t *)d = b + (g<<8) + (r<<16);
int v = *(uint32_t *)s;
int g = v&0xFF00;
v &= 0xFF00FF;
*(uint32_t *)d = (v>>16) + (v<<16) + g
2 shift less
1 and less
the same trick can be done with the mmx code to avoid one pand
also all the shifts and register-register movq can be replaced
by a pshufw on mmx2
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
The misfortune of the wise is better than the prosperity of the fool.
-- Epicurus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070414/ae060036/attachment.pgp>
More information about the ffmpeg-devel
mailing list