[FFmpeg-devel] [HACK] 50% faster H.264 decoding

Thu Aug 19 18:55:26 CEST 2010

Hi again,

On Thu, Aug 19, 2010 at 9:56 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> On Wed, Aug 18, 2010 at 6:44 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>> On Wed, Aug 18, 2010 at 6:28 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>> On Wed, Aug 18, 2010 at 12:42:11PM -0400, Ronald S. Bultje wrote:
>>>> On Tue, Aug 17, 2010 at 1:35 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>>> > On Tue, Aug 17, 2010 at 11:01:03AM -0400, Ronald S. Bultje wrote:
>>>> >> On Mon, Aug 16, 2010 at 6:40 PM, Jason Garrett-Glaser
>>>> >> <darkshikari at gmail.com> wrote:
>>>> >> > On Mon, Aug 16, 2010 at 3:35 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>>>> >> >> Hi,
>>>> >> >>
>>>> >> >> On Wed, Aug 11, 2010 at 5:32 PM, Jason Garrett-Glaser
>>>> >> >> <darkshikari at gmail.com> wrote:
>>>> >> >>> 13. Use MPEG-2 MC for chroma MC, since we know that MVs are
>>>> >> >>> fullpel-only. ?Simplify edge emulation stuff accordingly too.
>>>> >> >>
>>>> >> >> Does h264 chroma subpel actually use a memcpy shortcut if it's
>>>> >> >> fullpel? I don't remember exactly, but I don't think it has such a
>>>> >> >> shortcut for chroma, only for luma.
>>>> >> >
>>>> >> > It doesn't. ?It should at least have a shortcut for the 0,0 motion
>>>> >> > vector because its very high probability (relative to other fullpel
>>>> >> > motion vectors that result in no chroma interpolation). ?For other
>>>> >> > cases, it might or might not be worthwhile to add a branch in the asm
>>>> >> > to the 1D-only case.
>>>> >>
>>>> >> Attached sets up framework for that. The [0] functions can be copied
>>>> >> straight from VP8 (they are pixel_copy functions, with very fast
>>>> >> aligned implementations for all relevant archs) and others, and should
>>>> >> make VC-1, RV3/4, h264, H264/MPEG etc. significantly faster for the
>>>> >> MVxy==0 case. The [1]/[2] functions are probably going to be faster as
>>>> >> well but that would need some testing to see how big the effect is.
>>>> >> [3] is the function as-is now, which should obviously stay the way it
>>>> >> is.
>>>> >>
>>>> >> Michael, OK to apply this? It's mostly just changing all kind of files
>>>> >
>>>> > if its not slower ...
>>>>
>>>> Same speed. Attached is an updated version that fixes a bug in one of
>>>> the fate samples where mx gets changed and thus we called the wrong
>>>> version.
>>>>
>>>> I've tested this version with a semi-finished patch that splits up the
>>>> h264 chroma MC functions (particularly the mc8 ones) into smaller
>>>> ones, thus having cleaner (and unbranched) handling of mx==0/my==0.
>>>> This will remove most (if not all) of the branching, which might give
>>>> a minor speedup, and also removes a little duplicate code (in the
>>>> binary, not source), e.g. the fullpel handling between
>>>> mmx/3dnow/mmx2/ssse3 rv40/h264/vc1 mc8 is identical (it's all
>>>> put_pixels8_mmx) and only needs a single function. I'm only doing this
>>>> for the C and x86 ones because I can't test any of the others.
>>>>
>>>> After that's done, I plan to do a third patch which will add fullpel
>>>> or 1D-filter versions for mc4/mc2 as well, which should actually
>>>> provide a speedup for code on our desktops, as we saw for Jason's
>>>> hackpatch.
>>>>
>>>> Ronald
>>>
>>>> ?arm/dsputil_init_neon.c | ? 32 ++++++++++---
>>>> ?cavs.c ? ? ? ? ? ? ? ? ?| ? 13 ++---
>>>> ?dsputil.c ? ? ? ? ? ? ? | ? 40 +++++++++++++---
>>>> ?dsputil.h ? ? ? ? ? ? ? | ? 12 ++--
>>>> ?h264.c ? ? ? ? ? ? ? ? ?| ? 24 +++++----
>>>> ?mpegvideo.c ? ? ? ? ? ? | ? 28 ++++++-----
>>>> ?ppc/h264_altivec.c ? ? ?| ? 20 ++++++--
>>>> ?rv34.c ? ? ? ? ? ? ? ? ?| ? ?9 ++-
>>>> ?rv40dsp.c ? ? ? ? ? ? ? | ? 20 ++++++--
>>>> ?sh4/dsputil_align.c ? ? | ? 30 +++++++++---
>>>> ?vc1dec.c ? ? ? ? ? ? ? ?| ? 33 +++++++------
>>>> ?vp6.c ? ? ? ? ? ? ? ? ? | ? ?6 +-
>>>> ?x86/dsputil_mmx.c ? ? ? | ?118 +++++++++++++++++++++++++++++++++++++-----------
>>>> ?13 files changed, 272 insertions(+), 113 deletions(-)
>>>> 183027123a1213b2e037504a01d87c9c0678c1db ?h264-chroma-mvzero-shortcut.patch
>>>
>>> no objections
>>
>> Attached are the follow-up patches, C-only for now (still working on the asm).
>>
>> Patch #1 splits the H264 macro function creation macros into two, and
>> makes vc1_no_rnd use this macro instead of re-doing its own version of
>> it. Patch somehow thinks I changed mc2 into mc8, mc4 into mc2 and mc8
>> into mc4, rather than seeing I moved mc8 up from below, but the patch
>> should be readable nevertheless.
>>
>> Patch #2 then splits the C functions into 3: one each for x=0 or y=0,
>> and the remaining one for 2D bilinear filtering. It also adds one for
>> the case where x=0 AND y=0 (direct copy). Make fate has no objections.
>> There is no speed change for 1D/2D. The direct copy would be expected
>> to be faster but I didn't test because the C code isn't that relevant.
>> I can test if you prefer, but I'd rather focus on the asm functions
>> and make sure every change there is speed-tested. If you want, I can
>> move the adding of the direct copy functions to a separate patch, but
>> I didn't think that was necessary.
>>
>> I will do similar splits to the asm code
> [..]
>
> And these can be found in attached. Iv'e checked make fate for MMX,
> MMX2 and SSSE3 and all is identical. I will do some basic performance
> checks to make sure I didn't screw up anything, but speed should be
> identical except maybe for MMX avg_mc8 for x=0&&y=0, which is added by
> this patch (it was pretty much a one-liner). This is generally not
> used since MMX2/3DNOW versions are available also. If wanted, I can
> separate this or remove it.
>
> Next step is to actually implement new functions for 1D/no-filter
> mc4/mc2 which leads to the actually wanted speedup.

Example of such an optimization attached, so we can start applying
this whole thing (now that I'm showing an actual improvement in
performance :-) ).

START/STOP_TIMER around chroma_op[]() in h264.c, measuring only the
case where mx=0, my=0 and chroma_function_index=1 (local hack). CPU is
Intel Core i7 (Macbook Pro, OSX 10.6.4). GCC:
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664).
Sample: /Users/ronaldbultje/Movies/fate-suite/h264-conformance/MR3_TANDBERG_B.264

after:
1925 dezicycles in w=4,mx=0,my=0, 2 runs, 0 skips
2075 dezicycles in w=4,mx=0,my=0, 4 runs, 0 skips
2445 dezicycles in w=4,mx=0,my=0, 8 runs, 0 skips
1903 dezicycles in w=4,mx=0,my=0, 16 runs, 0 skips
1792 dezicycles in w=4,mx=0,my=0, 32 runs, 0 skips
1609 dezicycles in w=4,mx=0,my=0, 64 runs, 0 skips

before (here it would use the 2D filter ssse3 code):
2990 dezicycles in w=4,mx=0,my=0, 2 runs, 0 skips
2850 dezicycles in w=4,mx=0,my=0, 4 runs, 0 skips
2917 dezicycles in w=4,mx=0,my=0, 8 runs, 0 skips
2623 dezicycles in w=4,mx=0,my=0, 16 runs, 0 skips
2505 dezicycles in w=4,mx=0,my=0, 32 runs, 0 skips
2518 dezicycles in w=4,mx=0,my=0, 64 runs, 0 skips

C-only (the version after my patches applied, so the 32-bit direct
read/write loop):
5230 dezicycles in w=4,mx=0,my=0, 2 runs, 0 skips
5215 dezicycles in w=4,mx=0,my=0, 4 runs, 0 skips
5755 dezicycles in w=4,mx=0,my=0, 8 runs, 0 skips
4255 dezicycles in w=4,mx=0,my=0, 16 runs, 0 skips
3819 dezicycles in w=4,mx=0,my=0, 32 runs, 0 skips
3772 dezicycles in w=4,mx=0,my=0, 64 runs, 0 skips

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264-mc4_x0_y0_simd.patch
Type: application/octet-stream
Size: 3183 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100819/67019b5b/attachment.obj>