[FFmpeg-devel] [PATCHv2] lavc/cbrt_tablegen: speed up tablegen

Fri Jan 8 03:46:39 CET 2016

On Thu, Jan 7, 2016 at 5:20 PM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
> On Thu, Jan 7, 2016 at 4:48 PM, Michael Niedermayer
> <michael at niedermayer.cc> wrote:
>> On Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde wrote:
>>> This exploits an approach based on the sieve of Eratosthenes, a popular
>>> method for generating prime numbers.
>>>
>>> Tables are identical to previous ones.
>>>
>>> Tested with FATE with/without --enable-hardcoded-tables.
>>>
>>> Sample benchmark (Haswell, GNU/Linux+gcc):
>>> prev:
>>> 7860100 decicycles in cbrt_tableinit,       1 runs,      0 skips
>>> 7777490 decicycles in cbrt_tableinit,       2 runs,      0 skips
>>> [...]
>>> 7582339 decicycles in cbrt_tableinit,     256 runs,      0 skips
>>> 7563556 decicycles in cbrt_tableinit,     512 runs,      0 skips
>>>
>>> new:
>>> 2099480 decicycles in cbrt_tableinit,       1 runs,      0 skips
>>> 2044470 decicycles in cbrt_tableinit,       2 runs,      0 skips
>>> [...]
>>> 1796544 decicycles in cbrt_tableinit,     256 runs,      0 skips
>>> 1791631 decicycles in cbrt_tableinit,     512 runs,      0 skips
>>>
>>> Both small and large run count given as this is called once so small run
>>> count may give a better picture, small numbers are fairly consistent,
>>> and there is a consistent downward trend from small to large runs,
>>> at which point it stabilizes to a new value.
>>>
>>> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
>>> ---
>>>  libavcodec/aacdec_fixed.c           |  4 +--
>>>  libavcodec/aacdec_template.c        |  2 +-
>>>  libavcodec/cbrt_tablegen.h          | 53 ++++++++++++++++++++++++++-----------
>>>  libavcodec/cbrt_tablegen_template.c | 12 ++++++++-
>>>  4 files changed, 51 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/libavcodec/aacdec_fixed.c b/libavcodec/aacdec_fixed.c
>>> index 396a874..f7b882b 100644
>>> --- a/libavcodec/aacdec_fixed.c
>>> +++ b/libavcodec/aacdec_fixed.c
>>> @@ -155,9 +155,9 @@ static void vector_pow43(int *coefs, int len)
>>>      for (i=0; i<len; i++) {
>>>          coef = coefs[i];
>>>          if (coef < 0)
>>> -            coef = -(int)cbrt_tab[-coef];
>>> +            coef = -(int)cbrt_tab[-coef].i;
>>>          else
>>> -            coef = (int)cbrt_tab[coef];
>>> +            coef = (int)cbrt_tab[coef].i;
>>>          coefs[i] = coef;
>>>      }
>>>  }
>>> diff --git a/libavcodec/aacdec_template.c b/libavcodec/aacdec_template.c
>>> index d819958..1380510 100644
>>> --- a/libavcodec/aacdec_template.c
>>> +++ b/libavcodec/aacdec_template.c
>>> @@ -1791,7 +1791,7 @@ static int decode_spectrum_and_dequant(AACContext *ac, INTFLOAT coef[1024],
>>>                                          v = -v;
>>>                                      *icf++ = v;
>>>  #else
>>> -                                    *icf++ = cbrt_tab[n] | (bits & 1U<<31);
>>> +                                    *icf++ = cbrt_tab[n].i | (bits & 1U<<31);
>>>  #endif /* USE_FIXED */
>>>                                      bits <<= 1;
>>>                                  } else {
>>> diff --git a/libavcodec/cbrt_tablegen.h b/libavcodec/cbrt_tablegen.h
>>> index 59b5a1d..e3d6634 100644
>>> --- a/libavcodec/cbrt_tablegen.h
>>> +++ b/libavcodec/cbrt_tablegen.h
>>> @@ -26,14 +26,13 @@
>>>  #include <stdint.h>
>>>  #include <math.h>
>>>  #include "libavutil/attributes.h"
>>> +#include "libavutil/intfloat.h"
>>>  #include "libavcodec/aac_defines.h"
>>>
>>> -#if USE_FIXED
>>> -#define CBRT(x) lrint((x).f * 8192)
>>> -#else
>>> -#define CBRT(x) x.i
>>> -#endif
>>> -
>>
>>> +union ff_int32float64 {
>>> +    uint32_t i;
>>> +    double   f;
>>> +};
>>>  #if CONFIG_HARDCODED_TABLES
>>>  #if USE_FIXED
>>>  #define cbrt_tableinit_fixed()
>>> @@ -43,20 +42,42 @@
>>>  #include "libavcodec/cbrt_tables.h"
>>>  #endif
>>>  #else
>>> -static uint32_t cbrt_tab[1 << 13];
>>> +static union ff_int32float64 cbrt_tab[1 << 13];
>>
>> this doubles the size of the cpu cache needed at runtime to store
>> the same number of elements
>
> Yes, it does, and it was a tradeoff I made that I forgot to list. One
> can of course use floats; but this loses accuracy at significant
> levels.
>
> So one could malloc and free a double precision array (for temporary
> storage) at costs of some code complexity, possible heap
> fragmentation, and the problem of possible failure (may be ok since
> anyway aac_decode_init is not guaranteed to succeed; it allocates
> memory for the dsp context). Malloc/free is AFAIK ~ 100's of cycles,
> dwarfed by the table generation cost.

or local static array, once init'ed, this will be handled in a natural
way. Superior to the malloc solution, and IMHO is fine.