[FFmpeg-devel] FASTDIV macro

Sun Nov 9 15:24:19 CET 2008

Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:

> On Sunday 09 November 2008, M?ns Rullg?rd wrote:
>> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
>> > On Saturday 08 November 2008, M?ns Rullg?rd wrote:
>> >> libavutil/internal.h defines a macro, FASTDIV(), for fast 32/16-bit
>> >> division my means of multiplying by a table value.  If the
>> >> architecture is not ARM or x86, which have asm versions, this macro is
>> >> defined as a normal division if CONFIG_FASTDIV is not set.  The odd
>> >> thing is, nothing ever sets CONFIG_FASTDIV.  Something is clearly not
>> >> right here.
>> >
>> > A right thing here would be a patch with a description like
>> > "Enabling FASTDIV macro for architecture X improves performance of
>> > FFmpeg on this use case by Y percents..."
>> >
>> >> I see these alternatives to fix it:
>> >
>> > I think you first need to provide some kind of convincing proof that
>> > it is broken. This macro is definitely useful for ARM processors
>> > without instruction for hardware division. In other cases I suspect
>> > that something like what is done by FASTDIV macro could be somehow
>> > implemented in silicon itself (some cases of division could be
>> > performed faster than the others). Even a benchark of FASTDIV
>> > vs. native division for modern x86 cores would be interesting to
>> > see.
>>
>> What are you talking about?  I am not suggesting to change anything
>> for ARM or x86.
>
> I'm talking about the FASTDIV macro. Its primary use is to improve
> performance. Because of that any decisions about what to do must be
> primarily done based on the benchmarks, and not based on the
> theoretical discussion. You proposed a number of options, but the
> critical information is missing: performance impact of any of these
> options. You do have some x86 box, several ARM devices and PS3
> unless I'm missing something. So what's the problem with providing
> some benchmark numbers as well?

Before going through the trouble of running benchmarks, I wanted to
make sure there hadn't just been some silly mistake.

>> I'm talking about what to do with the impossible to 
>> enable C version using the table.
>
> Of course it is possible just by patching a few lines of code ;)

The question was whether or not to do that patching.

> Here is some very crude synthetic benchmarking program attached. Of
> course it does not take into account possible cache misses on the
> table access and also the fact that sometimes we may need to use
> expressions like "b==1 ? a : FASTDIV(a, b)".
>
> The results are the following:
>
> --- Pentium-M, gcc 4.3.2 (-O2) ---
> normaldiv(-1896828497) : time=2.195s
> fastdiv_c(-1896828497) : time=0.564s
> fastdiv_asm_x86(-1896828497) : time=0.416s
>
> --- Core2 (64-bit), gcc 4.1.2 (-O2) ---
> normaldiv(-1896828497) : time=0.681s
> fastdiv_c(-1896828497) : time=0.183s
> fastdiv_asm_x86(-1896828497) : time=0.222s

So plain C is faster than asm on Core2?  Did you look at the generated
code?

> --- ARM11, gcc 4.3.1 (-O2) ---
> normaldiv(-1896828497) : time=43.910s
> fastdiv_c(-1896828497) : time=5.480s
> fastdiv_asm_armv4(-1896828497) : time=5.049s
> fastdiv_asm_armv6(-1896828497) : time=4.629s

I ran a very similar test on Cortex-A8, and although I don't remember
the exact figures, the order came out the same.

I suspect that anything with a half-decent D-cache will benefit from
the table trick.  Cache-starved machines might suffer from the extra
cache pollution the table causes, at least if they have a reasonably
fast divide instruction.  Some MIPS incarnations fall in the second
category.

Could someone please run the test on a PPC G4 and/or G5?

-- 
M?ns Rullg?rd
mans at mansr.com