[FFmpeg-devel] [PATCH] avutil/mathematics: speed up av_gcd by using Stein's binary GCD algorithm
Ganesh Ajjanagadde
gajjanag at mit.edu
Sun Oct 11 02:56:52 CEST 2015
On Sat, Oct 10, 2015 at 8:37 PM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
> On Sat, Oct 10, 2015 at 8:34 PM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
>> On Sat, Oct 10, 2015 at 8:22 PM, Michael Niedermayer
>> <michael at niedermayer.cc> wrote:
>>> On Sun, Oct 11, 2015 at 02:13:59AM +0200, Michael Niedermayer wrote:
>>>> On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote:
>>>> > On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer
>>>> > <michael at niedermayer.cc> wrote:
>>>> > > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote:
>>>> > >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde
>>>> > >> <gajjanagadde at gmail.com> wrote:
>>>> > >> > This uses Stein's binary GCD algorithm:
>>>> > >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm
>>>> > >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard architectures.
>>>> > >> > Have not benchmarked, so I can't comment
>>>> > >>
>>>> > >> Before you submit a patch that's supposed to make something faster,
>>>> > >> you should benchmark it to verify that it is in fact faster. Do this
>>>> > >> with inputs of various sizes on both 32- and 64-bit architectures and
>>>> > >> both with and without compilers that support __builtin_ctzll(v).
>>>> > >
>>>> > > without __builtin_ctzll() the old code seems faster in a simple test
>>>> > > like:
>>>> > > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov
>>>> >
>>>> > >
>>>> > > also it seems a simple
>>>> > > while (u != v) {
>>>> > > if (u > v)
>>>> > > FFSWAP(int64_t, v, u);
>>>> > > v -= u;
>>>> > > do {
>>>> > > v >>= 1;
>>>> > > } while(!(v & 1));
>>>> > > }
>>>> > > is faster than the non builtin ctzll in its place but still not as
>>>> > > fast as the current code
>>>> >
>>>> > So I checked out the various implementation of (non built in) ctzll at
>>>> > https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch.
>>>> > The funny thing is, among the looping ones, the fastest is the simplest :):
>>>> > static inline int ff_ctzll_c(long long v)
>>>> > {
>>>> > int c = 0;
>>>> > while (!(v & 1)) {
>>>> > c++;
>>>> > v >>= 1;
>>>> > }
>>>> > return c;
>>>> > }
>>>
>>> tested this with the matrixbench_mpeg2 ->mov test and it seems
>>> 20% or so slower than the LUT based version
>>
>> If the LUT seems to be a good idea, I think the De Bruijn one should
>> be even better - no branching, only a multiply. and some shifts etc
>> Cache effects should be similar for the LUT and the De Bruijn, perhaps
>> slightly in favor of De Bruijn due to a smaller table (64 vs 256
>> entries).
>>
>> Most places I see only the 32 bit version, so I can't quickly check this.
>
> For reference: http://supertech.csail.mit.edu/papers/debruijn.pdf.
De-Bruijn works well, using my random gcd bench: 22 seconds old, 11
seconds (De Bruijn), 7 seconds (built-in).
I will do some FATE testing and then post updated patch.
>
>>
>>>
>>>
>>>>
>>>> its possible gcc/clang recognizes this and replaces it by the
>>>> instruction for the builtin
>>>> this is perfectly fine when such an instruction is available
>>>> but when not this might be slower than what we would expect from
>>>> just testing on architectures with such instructions
>>>
>>> gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
>>> fails to replace the loop by anything smart
>>>
>>> you can use
>>> make libavutil/mathematics.s
>>>
>>> to compile the C code to asm and see how the functions get compiled
>>>
>>> [...]
>>>
>>> --
>>> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>>>
>>> What does censorship reveal? It reveals fear. -- Julian Assange
>>>
>>> _______________________________________________
>>> ffmpeg-devel mailing list
>>> ffmpeg-devel at ffmpeg.org
>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>>
More information about the ffmpeg-devel
mailing list