[FFmpeg-devel] [PATCH] libavutil: add x86 optimized av_popcount

James Almer jamrial at gmail.com
Wed Feb 25 18:52:25 CET 2015


On 25/02/15 2:36 PM, Ronald S. Bultje wrote:
> Hi James,
> 
> On Wed, Feb 25, 2015 at 12:25 PM, James Almer <jamrial at gmail.com> wrote:
> 
>> On 25/02/15 9:41 AM, Ronald S. Bultje wrote:
>>> Hi,
>>>
>>> On Tue, Feb 24, 2015 at 8:05 PM, James Almer <jamrial at gmail.com> wrote:
>>>>
>>>> +#if HAVE_FAST_POPCNT
>>>> +#if AV_GCC_VERSION_AT_LEAST(4,5)
>>>> +#ifndef av_popcount
>>>> +    #define av_popcount   __builtin_popcount
>>>> +#endif /* av_popcount */
>>>> +#if HAVE_FAST_64BIT
>>>> +#ifndef av_popcount64
>>>> +    #define av_popcount64 __builtin_popcountll
>>>> +#endif /* av_popcount64 */
>>>> +#endif /* HAVE_FAST_64BIT */
>>>> +#endif /* AV_GCC_VERSION_AT_LEAST(4,5) */
>>>> +#endif /* HAVE_FAST_POPCNT */
>>>>
>>>
>>> Is this just to get the sse4 popcnt instruction if we compile with
>>> -mcpu=sse4? The slightly odd thing is that we're using a built-in, yet
>>> configure still does an arch/cpu check. I'd expect the built-in/compiler
>> to
>>> do that for us based on -mcpu, and we could always unconditionally use
>> this
>>> (as long as gcc >= 4.5); alternatively, you could use inline asm and then
>>> have the configure check (HAVE_FAST_POPCNT). But doing both seems a
>> little
>>> odd. I have no objection to it, patch is still fine, just odd.
>>>
>>> Ronald
>>
>> I purposely made the checks for gcc 4.5 and in configure for cpus that
>> support popcnt
>> because otherwise __builtin_popcount (at least gcc's) is slower than our
>> generic
>> av_popcount_c function from lavu/common.h.
>> When the CPU supports popcnt the builtin becomes a single inlined
>> instruction.
>>
>> I tried the __asm__ approach, but the code generated by the builtin seemed
>> better.
> 
> 
> That's interesting, can you show the code you tried?
> 
> Ronald

If instead of builtins i use

+#if HAVE_INLINE_ASM
+
+#ifdef __POPCNT__
+
+#define av_popcount av_popcount_abm
+static av_always_inline av_const int av_popcount_abm(uint32_t a)
+{
+    int x;
+
+    __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
+    return x;
+}
+
+#if ARCH_X86_64
+#define av_popcount64 av_popcount64_abm
+static av_always_inline av_const int av_popcount64_abm(uint64_t a)
+{
+    uint64_t x;
+
+    __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a));
+    return x;
+}
+#endif
+
+#endif /* __POPCNT__ */
+
+#endif /* HAVE_INLINE_ASM */

in the x86/ header from the second version i sent, on x86_32 av_cpu_count from 
libavutil/cpu.o on Windows (Will not work with other OSes because of 
HAVE_GETPROCESSAFFINITYMASK) generates

 1af:	f3 0f b8 5c 24 18    	popcnt ebx,DWORD PTR [esp+0x18]
 1b5:	31 c0                	xor    eax,eax
 1b7:	f3 0f b8 c0          	popcnt eax,eax
 1bb:	01 c3                	add    ebx,eax

Whereas with the builtins i instead get

 1af:	f3 0f b8 5c 24 18    	popcnt ebx,DWORD PTR [esp+0x18]

This is probably because of av_popcount64_c (Used unconditionally on x86_32) calling 
av_popcount twice, and proc_aff being DWORD_PTR type means it's 4 bytes in x86_32.
The builtin seems to know proc_aff can't be right shifted 32 bits so it generates 
a single popcnt.

This is probably a rare case since av_popcount64 is not normally called with a 32bits 
argument, but the builtin handled it better.
If the argument used with av_popcount64 is effectively 64bits then both the builtin 
and the above asm seem to generate the same code.


More information about the ffmpeg-devel mailing list