[FFmpeg-devel] [PATCH] Split reading and decoding of blocks in ALS

Thu Nov 26 10:40:56 CET 2009

Michael Niedermayer schrieb:
> On Wed, Nov 25, 2009 at 01:26:57PM +0100, Thilo Borgmann wrote:
>> Michael Niedermayer schrieb:
>>> On Mon, Nov 23, 2009 at 06:24:00PM -0500, Justin Ruggles wrote:
>>>> Thilo Borgmann wrote:
>>>>
>>>>> Michael Niedermayer schrieb:
>>>>>> On Mon, Nov 23, 2009 at 11:28:35AM +0100, Thilo Borgmann wrote:
>>>>>>> Michael Niedermayer schrieb:
>>>>>>>> On Sat, Nov 14, 2009 at 02:15:30PM +0100, Thilo Borgmann wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> in order to support multi-channel correlation, reading and decoding of a
>>>>>>>>> block has to be separated. This patch introduces ALSBockData struct for
>>>>>>>>> that purpose.
>>>>>>>> [...]
>>>>>>>>> +/** Decodes the block data for a constant block
>>>>>>>>> + */
>>>>>>>>> +static void decode_const_block_data(ALSDecContext *ctx, ALSBlockData *bd)
>>>>>>>>> +{
>>>>>>>>> +    int smp;
>>>>>>>>> +
>>>>>>>>>      // write raw samples into buffer
>>>>>>>>> -    for (k = 0; k < block_length; k++)
>>>>>>>>> -        raw_samples[k] = const_val;
>>>>>>>>> +    for (smp = 0; smp < bd->block_length; smp++)
>>>>>>>>> +        bd->raw_samples[smp] = bd->const_val;
>>>>>>>>>  }
>>>>>>>> memset32(dst, val, len){
>>>>>>>> }
>>>>>>>> would possibly be faster due to not doing bd-> in the loop
>>>>>>> Ok will do...
>>>>>>>
>>>>>>>
>>>>>>>> [...]
>>>>>>>>> @@ -553,115 +579,132 @@
>>>>>>>>>      }
>>>>>>>>>  
>>>>>>>>>      if (get_bits1(gb))
>>>>>>>>> -        *shift_lsbs = get_bits(gb, 4) + 1;
>>>>>>>>> +        bd->shift_lsbs = get_bits(gb, 4) + 1;
>>>>>>>>>  
>>>>>>>>> -    store_prev_samples = (*js_blocks && raw_other) || *shift_lsbs;
>>>>>>>>> +    bd->store_prev_samples = (bd->js_blocks && bd->raw_other) || bd->shift_lsbs;
>>>>>>>>>  
>>>>>>>>> -
>>>>>>>>>      if (!sconf->rlslms) {
>>>>>>>>>          if (sconf->adapt_order) {
>>>>>>>>> -            int opt_order_length = av_ceil_log2(av_clip((block_length >> 3) - 1,
>>>>>>>>> +            int opt_order_length = av_ceil_log2(av_clip((bd->block_length >> 3) - 1,
>>>>>>>>>                                                  2, sconf->max_order + 1));
>>>>>>>>> -            opt_order            = get_bits(gb, opt_order_length);
>>>>>>>>> +            bd->opt_order        = get_bits(gb, opt_order_length);
>>>>>>>>>          } else {
>>>>>>>>> -            opt_order = sconf->max_order;
>>>>>>>>> +            bd->opt_order = sconf->max_order;
>>>>>>>>>          }
>>>>>>>>>  
>>>>>>>>> -        if (opt_order) {
>>>>>>>>> +        if (bd->opt_order) {
>>>>>>>>>              int add_base;
>>>>>>>>>  
>>>>>>>>>              if (sconf->coef_table == 3) {
>>>>>>>>>                  add_base = 0x7F;
>>>>>>>>>  
>>>>>>>>>                  // read coefficient 0
>>>>>>>>> -                quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>>> +                bd->quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>>>  
>>>>>>>>>                  // read coefficient 1
>>>>>>>>> -                if (opt_order > 1)
>>>>>>>>> -                    quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>>> +                if (bd->opt_order > 1)
>>>>>>>>> +                    bd->quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>>>  
>>>>>>>>>                  // read coefficients 2 to opt_order
>>>>>>>>> -                for (k = 2; k < opt_order; k++)
>>>>>>>>> -                    quant_cof[k] = get_bits(gb, 7);
>>>>>>>>> +                for (k = 2; k < bd->opt_order; k++)
>>>>>>>>> +                    bd->quant_cof[k] = get_bits(gb, 7);
>>>>>>>>>              } else {
>>>>>>>>>                  int k_max;
>>>>>>>>>                  add_base = 1;
>>>>>>>>>  
>>>>>>>>>                  // read coefficient 0 to 19
>>>>>>>>> -                k_max = FFMIN(opt_order, 20);
>>>>>>>>> +                k_max = FFMIN(bd->opt_order, 20);
>>>>>>>>>                  for (k = 0; k < k_max; k++) {
>>>>>>>>>                      int rice_param = parcor_rice_table[sconf->coef_table][k][1];
>>>>>>>>>                      int offset     = parcor_rice_table[sconf->coef_table][k][0];
>>>>>>>>> -                    quant_cof[k] = decode_rice(gb, rice_param) + offset;
>>>>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, rice_param) + offset;
>>>>>>>>>                  }
>>>>>>>>>  
>>>>>>>>>                  // read coefficients 20 to 126
>>>>>>>>> -                k_max = FFMIN(opt_order, 127);
>>>>>>>>> +                k_max = FFMIN(bd->opt_order, 127);
>>>>>>>>>                  for (; k < k_max; k++)
>>>>>>>>> -                    quant_cof[k] = decode_rice(gb, 2) + (k & 1);
>>>>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, 2) + (k & 1);
>>>>>>>>>  
>>>>>>>>>                  // read coefficients 127 to opt_order
>>>>>>>>> -                for (; k < opt_order; k++)
>>>>>>>>> -                    quant_cof[k] = decode_rice(gb, 1);
>>>>>>>>> +                for (; k < bd->opt_order; k++)
>>>>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, 1);
>>>>>>>>>  
>>>>>>>>> -                quant_cof[0] = 32 * parcor_scaled_values[quant_cof[0] + 64];
>>>>>>>>> +                bd->quant_cof[0] = 32 * parcor_scaled_values[bd->quant_cof[0] + 64];
>>>>>>>>>  
>>>>>>>>> -                if (opt_order > 1)
>>>>>>>>> -                    quant_cof[1] = -32 * parcor_scaled_values[quant_cof[1] + 64];
>>>>>>>>> +                if (bd->opt_order > 1)
>>>>>>>>> +                    bd->quant_cof[1] = -32 * parcor_scaled_values[bd->quant_cof[1] + 64];
>>>>>>>>>              }
>>>>>>>>>  
>>>>>>>>> -            for (k = 2; k < opt_order; k++)
>>>>>>>>> -                quant_cof[k] = (quant_cof[k] << 14) + (add_base << 13);
>>>>>>>>> +            for (k = 2; k < bd->opt_order; k++)
>>>>>>>>> +                bd->quant_cof[k] = (bd->quant_cof[k] << 14) + (add_base << 13);
>>>>>>>> it might make sense to have commonly used variables like opt_order on the
>>>>>>>> stack intead of just accessable through a pointer
>>>>>>> Yes.
>>>>>>>
>>>>>>>> besides does this patch lead to any slowdown?
>>>>>>> The benchmark results showed a little slowdown but at least hardly
>>>>>>> noticable on the command line...
>>>>>>> We had this issue at the soc list. Ended up with me providing some
>>>>>>> assembler file grep results for checking inlining done by the compiler.
>>>>>> was the slowdown fixed or not?
>>>>>> "not" is bad
>>>>> It was not. You blamed the many bd-> to be a reason for the slowdown and
>>>>> requested a check for wrong inlining before considering other
>>>>> possibilities to make this faster. I'm not able to see anything about
>>>>> the inlining stuff in assembler files so I provided a grep result you
>>>>> asked for.
>>>>>
>>>>> See:
>>>>> https://lists.mplayerhq.hu/pipermail/ffmpeg-soc/2009-October/008364.html
>>>>>
>>>>> I'm ready to fix that slowdown issue, all I need is a little advise
>>>>> about how to... the obvious way to use local variables to store
>>>>> bd->something and removing many dereferences that way did not yield a
>>>>> significant reduction of the slowdown.
>>>> grepping for "call" and removing duplicates gives:
>>>>
>>>> combined:
>>>>         call    _parse_bs_info
>>>>         call    _decode_end
>>>>         call    _read_block_data
>>>>         call    _zero_remaining
>>>>
>>>> separate:
>>>>         call    _parse_bs_info
>>>>         call    _decode_end
>>>>         call    _decode_var_block_data
>>>>         call    _zero_remaining
>>>>         call    _read_var_block_data
>>>>         call    _decode_blocks_ind
>>>>
>>>> It seems that decode_blocks_ind is inlined in the combined version but
>>>> not in the separate version.
>> Thanks for that!
>>
>>> so, does adding always_inline help?
>>> (note, expect that adding this to one function will make gcc stop
>>>  inlining other unrelated functions so you might need many always_inline
>>>  to force gcc not to be silly)
>> I've redone all the benchmarking and attached the results in a text
>> file. The benchmarked stops time for the call to read_frame_data() in
>> decode_frame(). I've printed the functions with av_always_inline, the
>> benchmark results and the grep result for calls in the assembler code.
>> All done using gcc 4.0.
>>
>> Unfortunately, using av_always_inline for many functions (so that the
>> same calls within alsdec.c remain) reveals many other functions not
>> inlined anymore and therefore seems not to reduce the slowdown. And as
>> you said, just av_always_inline'ing decode_blocks_ind lets many other
>> functione not being inlined any more...
>>
>> Anything else I can test?
> 
> There are varous command line switches for gcc that control inlining
> also
> You could report this to the gcc devels, not sure what that would lead
> to but i guess it might be worth a try.
> 
> What is certain though is that code split / factorization that reduces
> speed by 10% is rejected

Compilation of aldec.c is done using -O3 which enables the common
inlining optimizations. One other is untouched by this, -finline-limit.
I could not figure out what the default value for this is, but I tested
the seperate version (pure patch) without using av_always_inline
anywhere. I set it by chance to 4096 and it ran as fast as the combined
version.

8028640 dezicycles in sep, 1 runs, 0 skips
9019255 dezicycles in sep, 2 runs, 0 skips
12334895 dezicycles in sep, 4 runs, 0 skips
15384680 dezicycles in sep, 8 runs, 0 skips
15415286 dezicycles in sep, 16 runs, 0 skips
15224882 dezicycles in sep, 32 runs, 0 skips
15370329 dezicycles in sep, 64 runs, 0 skips

   2 	call	_decode_blocks_ind
   4 	call	_decode_end
  36 	call	_decode_rice
  10 	call	_get_bits_long
  11 	call	_parse_bs_info
   2 	call	_zero_remaining

The assembler output points out, that decode_blocks_ind() is still not
inlines without av_always_inline, but {read,decode}_var_block_data() are
(compare with qouted output above).

Using -finline_limit with the combined version does not yield any speed
gain.

So, is using -finline_limit an option? If so, someone with more
experience with the gcc might suggest a better values than 4096 for it.

Otherwise, there are two options, I think. First, using a split version
of the functions for multi-channel-correlation only which has to deal
with a lot of code duplication somehow or not to support
multi-channel-correlation at all which is not really an option...

-Thilo