[FFmpeg-devel] [PATCH] some SIMD write-combining for h264

Sun Jan 17 13:26:52 CET 2010

Alexander Strange <astrange at ithinksw.com> writes:

> On Jan 16, 2010, at 9:59 PM, M?ns Rullg?rd wrote:
>
>> Alexander Strange <astrange at ithinksw.com> writes:
>> 
>>> On Jan 16, 2010, at 12:35 AM, Michael Niedermayer wrote:
>>> 
>>>> On Fri, Jan 15, 2010 at 11:11:23PM -0500, Alexander Strange wrote:
>>>>> This adds intreadwrite macros for 64/128-bit memory operations and uses them in h264.
>>>>> 
>>>>> Unlike the other macros, these assume correct alignment, and the patch only defines the ones there was an immediate use for.
>>>>> This only has x86 versions, but others should be easy. The 64-bit operations can be done with double copies on most systems, I guess.
>>>>> 
>>>>> Decoding a 30s file on Core 2 Merom with --cpu=core2 (minimum of 5 runs):
>>>>> x86-32: 12.72s before, 12.51s after (1.7%)
>>>>> x86-64: 10.24s before, 10.20s after (.4%)
>>>>> 
>>>>> Tested on x86-32, x86-64, x86-32 with --arch=c.
>>>> 
>>>> as your code uses MMX you need to at least mention EMMS/float issue in the
>>>> dox and probably a  emms_c(); call before draw_horiz_band()
>>>> dunno if these are all
>>> 
>>> Added in the comment.
>>> 
>>>> also what sets __MMX__ ? we have our own defines for that
>>> 
>>> It's a gcc builtin define, set based on ./configure --cpu=x adding
>>> -march.  HAVE_MMX is for the build and not the host cpu family, and
>>> this is inlined asm, so it can't use it.
>> 
>> Huh?  Host... build???
>
> Oh, that was supposed to be "target"...

What was?

> Anyway, this is MMX being used like the cmov/clz inlines, so it
> depends on the given --cpu and not on the build system's
> capabilities.

*ALL* settings in config.h are for the target.  Anything else would be
pointless.  Normally HAVE_MMX and __MMX__ should be the same.
However, with --disable-mmx, they are not.  In this case, the user
explicitly requested to disable all MMX code, which you are not doing
if you test for __MMX__.

>>>> and last
>>>> 1.7% makes 50->48.3 % left to CoreAVC if we assume we are 50% behind
>>> 
>>> Well, it turned out to work unusually well on that file (I had the
>>> idea when profiling it in the first place), closer to 1% on
>>> avchd-test-1.ts posted here a while ago.
>>> 
>>> 
>>> From c1d931e9846ca862935254726010e2d21737f5c5 Mon Sep 17 00:00:00 2001
>>> From: Alexander Strange <astrange at ithinksw.com>
>>> Date: Fri, 15 Jan 2010 01:47:47 -0500
>>> Subject: [PATCH 1/2] Add macros for write-combining optimization.
>>> 
>>> ---
>>> libavutil/intreadwrite.h     |   36 ++++++++++++++++-
>>> libavutil/x86/intreadwrite.h |   92 ++++++++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 127 insertions(+), 1 deletions(-)
>>> create mode 100644 libavutil/x86/intreadwrite.h
>>> 
>>> diff --git a/libavutil/intreadwrite.h b/libavutil/intreadwrite.h
>>> index 933732c..73d6000 100644
>>> --- a/libavutil/intreadwrite.h
>>> +++ b/libavutil/intreadwrite.h
>>> @@ -25,7 +25,7 @@
>>> 
>>> /*
>>>  * Arch-specific headers can provide any combination of
>>> - * AV_[RW][BLN](16|24|32|64) macros.  Preprocessor symbols must be
>>> + * AV_([RW]|COPY|SWAP)[BLN](16|24|32|64) macros.  Preprocessor symbols must be
>> 
>> That's both ugly and wrong.  Write  "AV_[RW][BLN](16|24|32|64) or
>> AV_(COPY|SWAP)(32|64|128)" instead, or whatever sizes make sense for
>> the second one.
>
> Quite true, done.
>
> Sizes smaller than 64 don't need optimizing like this, so I've left that out.

Oh, you're an optimist.

> They need special macros in the future, since all of these type-puns
> are strict aliasing violations.  Somehow it's not being miscompiled
> just yet though.

Yeah, we'll have to deal with that at some point.

> All comments below not replied to are done (tested with --cpu=586 and 686).
> By the way, AV_WN is missing parens around the definition.

Will fix.

>>> +#define AV_COPY128 AV_COPY128
>>> +static inline void AV_COPY128(void *d, const void *s)
>>> +{
>>> +    typedef struct {uint64_t i[2];} v;
>>> +
>>> +    __asm__("movaps   %1, %%xmm0  \n\t"
>>> +            "movaps   %%xmm0, %0  \n\t"
>>> +            : "=m"(*(v*)d)
>>> +            : "m" (*(const v*)s)
>>> +            : "xmm0");
>>> +}
>> 
>> I would have done something like this instead of the typedef:
>> 
>>    struct { uint64_t i[2]; } *vd = d, *vs = s;
>> 
>> Or maybe like this:
>> 
>>    uint64_t (*vd)[2] = d;
>>    uint64_t (*vs)[2] = s;
>> 
>> Examining asm is probably the best way of choosing.
>
> It doesn't affect it as long as it's the right size (in fact, even that doesn't seem to matter, char* gives 1 insn worse asm).
>
> I'd be OK with the first one, but s is const and d isn't, so it needs separate declarations, which look worse to me.
>
> As for the second, clang gives a strange warning and then breaks:
>     uint64_t (*vd)[2] = d;
>     const uint64_t (*vs)[2] = s;
>
> ./libavutil/x86/intreadwrite.h:67:31: warning: initializing 'void const *' discards qualifiers, expected 'uint64_t const (*)[2]'
>     const uint64_t (*vs)[2] = s;
>                               ^
> ./libavutil/x86/intreadwrite.h:72:20: error: cannot compile this unexpected cast lvalue yet
>             : "m" (*vs)

Weird.  I've no idea what that might mean.  I certainly don't see any
casts uses as lvalues.  Drop that idea then.

> Looks a bit nicer to me without the typedef, but I'd rather not
> spend time changing this around.

Patch looks OK to me.

-- 
M?ns Rullg?rd
mans at mansr.com