[FFmpeg-devel] [PATCHv2] lavc/cbrt_tablegen: speed up tablegen
Daniel Serpell
dserpell at gmail.com
Wed Jan 6 16:38:27 CET 2016
Hi!,
El Tue, Jan 05, 2016 at 09:01:40PM -0800, Ganesh Ajjanagadde escribio:
> On Tue, Jan 5, 2016 at 10:46 AM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
> > On Tue, Jan 5, 2016 at 10:10 AM, Daniel Serpell <dserpell at gmail.com> wrote:
> >> Hi!,
> >>
> >> El Tue, Jan 05, 2016 at 08:08:35AM -0800, Ganesh Ajjanagadde escribio:
> >>> On Tue, Jan 5, 2016 at 7:44 AM, Daniel Serpell <dserpell at gmail.com> wrote:
> >>> > Hi!,
> >>> >
> >>> > El Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde escribio:
> >>> >> This exploits an approach based on the sieve of Eratosthenes, a popular
> >>> >> method for generating prime numbers.
> >>> >>
> >>> >> Tables are identical to previous ones.
> >>> >>
> >>> >> Tested with FATE with/without --enable-hardcoded-tables.
> >>> >>
> >>> >> Sample benchmark (Haswell, GNU/Linux+gcc):
> >>> >> prev:
> >>> >> 7860100 decicycles in cbrt_tableinit, 1 runs, 0 skips
> >>> >> 7777490 decicycles in cbrt_tableinit, 2 runs, 0 skips
> >>> >> [...]
> >>> >> 7582339 decicycles in cbrt_tableinit, 256 runs, 0 skips
> >>> >> 7563556 decicycles in cbrt_tableinit, 512 runs, 0 skips
> >>> >>
> >>> >> new:
> >>> >> 2099480 decicycles in cbrt_tableinit, 1 runs, 0 skips
> >>> >> 2044470 decicycles in cbrt_tableinit, 2 runs, 0 skips
> >>> >> [...]
> >>> >> 1796544 decicycles in cbrt_tableinit, 256 runs, 0 skips
> >>> >> 1791631 decicycles in cbrt_tableinit, 512 runs, 0 skips
> >>> >>
> >>> >
> >>> > See attached code, function "test1", based on an approximation of:
> >>> >
> >>> > (i+1)^(1/3) ~= i^(1/3) * ( 1 + 1/(3i) - 1/(9i) + 5/(81i) - .... )
> [...]
>
> So I looked up Mr. Fog's manuals, and it seems like divides are
> considerably slower than multiplies and adds. This made me somewhat
> skeptical of your idea, and so I benched it.
> Seems like your patch is actually significantly slower on my setup
> (gcc 5.3.0, GNU libm):
> 3349080 decicycles in cbrt_tableinit, 1 runs, 0 skips
> 3466605 decicycles in cbrt_tableinit, 2 runs, 0 skips
> [...]
> 3425939 decicycles in cbrt_tableinit, 256 runs, 0 skips
> 3414528 decicycles in cbrt_tableinit, 512 runs, 0 skips
>
> (clang 3.7.0):
> 3337590 decicycles in cbrt_tableinit, 1 runs, 0 skips
> 3276225 decicycles in cbrt_tableinit, 2 runs, 0 skips
> [...]
> 2871065 decicycles in cbrt_tableinit, 256 runs, 0 skips
> 2832137 decicycles in cbrt_tableinit, 512 runs, 0 skips
>
> The divides seem to be what is killing your approach.
> Times (just for the divisions, clang):
> 1085430 decicycles in cbrt_tableinit, 1 runs, 0 skips
> 1005165 decicycles in cbrt_tableinit, 2 runs, 0 skips
> [...]
> 863732 decicycles in cbrt_tableinit, 256 runs, 0 skips
> 864907 decicycles in cbrt_tableinit, 512 runs, 0 skips
>
> Another illustration, even if I change the 1.0/(3*i) to simply i to
> get rid of the multiplication as well, obviously not possible but
> really wishful thinking, you still only get:
> (clang)
> 2013390 decicycles in cbrt_tableinit, 1 runs, 0 skips
> 1950135 decicycles in cbrt_tableinit, 2 runs, 0 skips
> [...]
> 1668963 decicycles in cbrt_tableinit, 256 runs, 0 skips
> 1653179 decicycles in cbrt_tableinit, 512 runs, 0 skips
>
> To complete my curiousity,
> (gcc)
> 1485240 decicycles in cbrt_tableinit, 1 runs, 0 skips
> 1522115 decicycles in cbrt_tableinit, 2 runs, 0 skips
> [...]
> 1324325 decicycles in cbrt_tableinit, 256 runs, 0 skips
> 1322110 decicycles in cbrt_tableinit, 512 runs, 0 skips
>
> i.e not a whole lot faster than the benchmark I posted.
>
> If you feel this can't be right, I can do give a higher level
> justification, which shows that for a reasonable libm cbrt, and
> standard architectural assumptions, the approach I have is faster.
>
Thanks for your tests.
In my PC I measured a higher speedup, perhaps you can replace the main
loop with:
cb = 7;
for(i=343; i<8192; i++)
{
double s, r, t;
out[i] = cb * i;
// For big values, we use 4 terms:
r = (1.0/3.0)/i; // 0 A , 1 M , 1 D
t = r; //
s = 1.0 + r; // 1 A , 1 M , 1 D
r = r * r; // 1 A , 2 M , 1 D
s = s - r; // 2 A , 2 M , 1 D
r = r * t; // 2 A , 3 M , 1 D
s = s + 1.6644 *r; // 3 A , 4 M , 1 D
cb = cb * s; // 3 A , 5 M , 1 D
}
I know that divisions are considerably slower than multiplications, but
at least in my PC, the following is slower still (based on newton's
method):
// "cb" is the cube-root approximation to "1/i²"
cb = 1.0/49.0;
for(i=343; i<8192; i++)
{
double s = cb*cb*i;
cb = (1.0/3.0) * (4*cb - s*s);
s = cb*cb*i;
cb = (1.0/3.0) * (4*cb - s*s);
out[i] = cb * i * i;
}
Perhaps in your PC is faster?
Thanks,
Daniel.
More information about the ffmpeg-devel
mailing list