[FFmpeg-devel] [PATCH] More H.264 decoding speed tweaks
Loren Merritt
lorenm
Mon Jun 23 21:48:05 CEST 2008
On Sun, 22 Jun 2008, Jason Garrett-Glaser wrote:
> Odd that my benchmarks were otherwise; perhaps its more
> source-dependent than I thought?
16 more movies or clips thereof. again 10 runs each. core2 e6600.
704x400 706kbit svn: 78.61 +/- 0.12 patched: 82.03 +/- 0.11 (+4.35% +/- 0.21%)
704x400 771kbit svn: 61.56 +/- 0.12 patched: 64.64 +/- 0.11 (+5.01% +/- 0.27%)
1280x720 1170kbit svn: 196.65 +/- 0.27 patched: 208.57 +/- 0.40 (+6.07% +/- 0.25%)
1280x720 1604kbit svn: 207.14 +/- 0.18 patched: 219.12 +/- 0.25 (+5.79% +/- 0.15%)
704x480 1730kbit svn: 124.98 +/- 0.17 patched: 127.23 +/- 0.20 (+1.80% +/- 0.21%)
1280x720 3650kbit svn: 75.65 +/- 0.07 patched: 78.17 +/- 0.09 (+3.33% +/- 0.15%)
1920x1080 3658kbit svn: 20.48 +/- 0.04 patched: 21.52 +/- 0.04 (+5.03% +/- 0.30%)
1280x528 4434kbit svn: 39.92 +/- 0.05 patched: 40.70 +/- 0.06 (+1.96% +/- 0.20%)
1280x544 6399kbit svn: 25.65 +/- 0.04 patched: 25.96 +/- 0.05 (+1.21% +/- 0.26%)
1280x534 6868kbit svn: 230.24 +/- 0.36 patched: 234.44 +/- 0.26 (+1.82% +/- 0.19%)
1280x536 6964kbit svn: 29.60 +/- 0.05 patched: 29.94 +/- 0.05 (+1.15% +/- 0.22%)
1920x784 7052kbit svn: 29.46 +/- 0.05 patched: 30.13 +/- 0.04 (+2.30% +/- 0.21%)
1920x1040 7352kbit svn: 197.70 +/- 0.24 patched: 202.06 +/- 0.24 (+2.20% +/- 0.17%)
1920x1040 7457kbit svn: 44.82 +/- 0.09 patched: 46.53 +/- 0.05 (+3.81% +/- 0.23%)
1280x536 7534kbit svn: 16.26 +/- 0.03 patched: 16.40 +/- 0.02 (+0.85% +/- 0.22%)
1280x536 7670kbit svn: 62.17 +/- 0.04 patched: 63.59 +/- 0.06 (+2.29% +/- 0.12%)
... so yes the amount i's source dependent, but I failed to find any where
dc_add wasn't good.
> One question, Loren, while we're on this topic; how should your SSE2
> iDCT4x4 be implemented when we start bringing x264's nasm asm to
> ffmpeg? Should we check the DC coefficient for each 8x4 block and
> make an 8x4 dc_add, or what?
Yes. The question is how to modify the C so that it can use 8x4 without
losing speed on non-sse2 cpus.
(a) Duplicate the loops and put a cpu check in the C.
(b) Define a 8x4 mmx idct which just calls 2 4x4 mmx idcts. Check 2
blocks at a time, and call either 8x4 dc, 8x4 idct, or 4x4 of each. dc
uses packed bytes, so it can handle up to width8 in mmx at no speed loss,
which might make up for the extra branch.
--Loren Merritt
More information about the ffmpeg-devel
mailing list