[FFmpeg-devel] Idea about speedup of startcode search
Michael Niedermayer
michaelni
Fri Feb 8 15:36:12 CET 2008
On Fri, Feb 08, 2008 at 02:35:11PM +0100, Thorsten Jordan wrote:
> Hello,
>
> here is an idea about tiny optimization i stumbled over when scanning
> for mpeg startcodes, that i want to share.
>
> I was working on some stream correction like ProjectX, and profiling on
> a weak architecture (VIA C3) showed that the start code scan in the
> GOP-analyse was the weak spot, eating over 50% of CPU time.
>
> This is typically a simple for() loop iterating over a buffer and
> checking for zero bytes. H.264 has the same with the ill "emulation
> prevention three byte" stuff for NALs, see libavcodec/h264.c line 1393
> (SVN of today).
>
> Here one searches for 00 00 03 xx patterns, for mpeg 1/2 or h264 you
> often look for 00 00 01 xx patterns or 00 00 00 01.
>
> This can be done in a simple C loop, but gcc does a bad job and uses up
> to 7 instructions per scanned byte. On Core 2 Duo measurement shows 2.8
> instructions per byte.
>
> This can be brought down to 0.8 cycles per byte (less than 2
> instructions per byte) with the following idea:
If gcc compiles it to 7 instrucions per scanned byte that is a bug in gcc
which should be reported!
As it can easily do it with 3 instructions (or less if unrolled further),
that is:
xor %%eax, %%eax
1:
cmpb %%al, (%%ebx, %%ecx)
jz blah
cmpb %%al, 2(%%ebx, %%ecx)
jz blah2
add $2, %%ebx
jnc 1
Also dont forget that per scanned byte really is every 2nd byte as not
every is scanned.
>
> all startcodes mentioned above have two consecutive zero bytes. To
> filter them out, load 8 bytes to a mmx register and check 4x2 bytes for
> equality with zero, by using packed compare, packing to 4x1 bytes,
> or-ing and testing. Do this for 8 bytes at address x and x+1, until
> there are any two consecutive zero bytes found, then fine-check with c-code.
>
> It is worth only for large data chunks with rather rare startcodes, but
> this is mostly the case. Every byte of a h.264 stream must be piped
> through the "emulation prevention 3 byte" checker.
>
> Gain is however maybe too small to do it, at 20mbit with h264 that would
> be 2,38mb/sek to parse, so saving ca. 5 million cpu cycles - only a tiny
> fragment of a 2ghz cpu. But everything counts...
>
> anyway it may not be an important idea, but if anyone wants to try it
> out, here is some test code that i have written and declare as free to
> use (PD).
[...]
> uint32_t flag = 0; //dummy
> for ( ; buf < bufend_mmx; buf += 8) {
> asm volatile("1: \n\t"
> "movq (%0), %%mm0 \n\t"
> "movq 1(%0), %%mm1 \n\t"
> "pcmpeqw %%mm2, %%mm0 \n\t"
> "pcmpeqw %%mm2, %%mm1 \n\t"
> "packsswb %%mm0, %%mm0 \n\t"
> "packsswb %%mm1, %%mm1 \n\t"
> "por %%mm1, %%mm0 \n\t"
movq (%0), %%mm0
por 1(%0), %%mm0
pcmpeqb %%mm2, %%mm0
packsswb %%mm0, %%mm0
and this is not a patch ...
> "movd %%mm0, %1 \n\t"
> "testl %1, %1 \n\t"
> "jne 2f \n\t"
> "addl $8, %0 \n\t"
> "cmpl %2, %0 \n\t"
> "jl 1b \n\t"
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
The misfortune of the wise is better than the prosperity of the fool.
-- Epicurus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080208/f3ccfcd7/attachment.pgp>
More information about the ffmpeg-devel
mailing list