[FFmpeg-devel] [PATCH] Extra build options for ALS (and others)

Mon Nov 30 16:09:23 CET 2009

Thilo Borgmann schrieb:
> M?ns Rullg?rd schrieb:
>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>
>>> M?ns Rullg?rd schrieb:
>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>
>>>>> M?ns Rullg?rd schrieb:
>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>
>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> recently the need for an extra build option for the ALS decoder arose.
>>>>>>>> Is it impossible to achieve the desired outcome with some combination
>>>>>>>> of always_inline, noinline, and flatten attributes?
>>>>>>> No. See [PATCH] Split reading and decoding of blocks in ALS.
>>>>>>>
>>>>>>> Although I've managed to have the functions from the alsdec.c inlined
>>>>>>> manually according to the grep'ed output of the assembler code, it seems
>>>>>>> like it is not enough to manually inline functions from within that .c
>>>>>>> file only using these technique.
>>>>>> I'm confused.  Can it be done in the C code only or not?  This kind of
>>>>>> issue should really not be solved in the makefile.
>>>>> The issue is the big slowdown. The patch that causes this splits a big
>>>>> function into two, which are then called successively.
>>>>>
>>>>> To overcome the slowdown issue, I inspected the functions being inlined
>>>>> with and without the -finline-limit option. I can use av_always_inline
>>>>> for many functions within alsdec.c to have the same functions inlined
>>>>> like -finline-limit does.
>>>>>
>>>>> Unfortunately, using -finline-limit removes the slowdown introduced by
>>>>> the patch while using av_always_inline does not.
>>>> So it's not doing the same thing.  What is it doing differently?
>>>> Where did you get the limit number from?
>>>>
>>> All function calls within alsdec.s when using -finline-limit=4096:
>>>    1 	call	L1102
>>>    1 	call	L138
>>>    1 	call	L456
>>>    2 	call	L___udivdi3$stub
>>>   10 	call	L_av_freep$stub
>>>    1 	call	L_av_get_bits_per_sample_format$stub
>>>   12 	call	L_av_log$stub
>>>    5 	call	L_av_log_missing_feature$stub
>>>    8 	call	L_av_malloc$stub
>>>    2 	call	L_av_mallocz$stub
>>>    1 	call	L_ff_mpeg4audio_get_config$stub
>>>    6 	call	L_memcpy$stub
>>>    2 	call	L_memmove$stub
>>>    1 	call	L_memset$stub
>>>    2 	call	_decode_blocks_ind
>>>    4 	call	_decode_end
>>>   36 	call	_decode_rice
>>>   10 	call	_get_bits_long
>>>   11 	call	_parse_bs_info
>>>    2 	call	_zero_remaining
>>>
>>> All function calls within alsdec.s when using many av_always_inline's.
>>> This is designed to inline the same functions from alsdec.c like the
>>> unpatched alsdec.c would yield without any extra build option:
>>>    1 	call	L1561
>>>    1 	call	L176
>>>    1 	call	L21
>>>    2 	call	L___udivdi3$stub
>>>   10 	call	L_av_freep$stub
>>>    1 	call	L_av_get_bits_per_sample_format$stub
>>>   13 	call	L_av_log$stub
>>>    5 	call	L_av_log_missing_feature$stub
>>>    8 	call	L_av_malloc$stub
>>>    2 	call	L_av_mallocz$stub
>>>    1 	call	L_ff_mpeg4audio_get_config$stub
>>>    1 	call	L_memcpy$stub
>>>    1 	call	L_memmove$stub
>>>    2 	call	L_memset$stub
>>>    8 	call	___inline_memcpy_chk
>>>    2 	call	___inline_memmove_chk
>>>    6 	call	_align_get_bits
>>>    5 	call	_av_ceil_log2
>>>    4 	call	_av_clip
>>>    4 	call	_decode_end
>>>   47 	call	_get_bits
>>>   90 	call	_get_bits1
>>>    3 	call	_get_bits_count
>>>   61 	call	_get_bits_left
>>>   39 	call	_get_bits_long
>>>    4 	call	_get_sbits_long
>>>   60 	call	_get_unary
>>>    2 	call	_init_get_bits
>>>    3 	call	_parse_bs_info
>>>    3 	call	_read_time
>>>    7 	call	_skip_bits
>>>    2 	call	_skip_bits1
>>>    5 	call	_skip_bits_long
>> Not inlining those get_bits etc will certainly slow things down,
>> that's for sure.
>>
>>> So -finline-limit can inline many functions in the object file which are
>>> not part of alsdec.c. Which might be the reason for the performance
>>> difference.
>>>
>>> But using -finline-limit does not yield a speed gain for the unpatched
>>> file! So there might be something else but I don't see.
>>>
>>> The value of 4096 has been choosen randomly. As long as I don't know
>>> exactly why -finline-limit removes the slowdown and that it cannot be
>>> replaced by another approach, there is no need to figure out a more
>>> optimal value...
>> We should do some benchmarks using that flag globally and see what
>> happens.  Maybe we'd gain from using it everywhere.
> 
> Like Michael said, this would be a big test for different platforms and
> compilers which I cannot offer alone so several people would have to do
> this - if a benchmark would indicate that it might be worth testing.
> 
> Also, I'm lacking a good idea of how to test this efficiently without
> having other factors like harddrives playing a predominant role which
> means testing execution time of ffmpeg.

I played around a little with the regression tests and audio decoders.
For most of my tests -finline-limit=4096 makes it a little faster, e.g.

g726: 47001535 dezicycles -> 41628457 dezicycles (12%)
alac: 12855244 dezicycles -> 12849127 dezicycles ( 0%)
flac:   842020 dezicycles ->   786226 dezicycles ( 7%)
wma:   3663166 dezicycles ->  3197273 dezicycles (14%)

which is not surprising. Inlining comes for a price, ffmpeg executable
growed from 5,4 MB to 6.1 MB.
Value used fro -finline-limit is 4096, default is 600 for gcc-4.0.

> 
> But does a common profit from this option makes it a good one to be
> globally added? If yes, could we add this specifically to als for the
> time being instead of holding back als decoder development completely?
> Benchmarking and testing will surely take a lot of time...

Still unanswered. Unfortunately, I cannot test for other compilers than
gcc either.

So what do you think about adding this locally/globally?

If there is no consensus for adding it, or at least going to test it on
various platforms first, I have to start thinking about alternatives for
ALS.

-Thilo