[FFmpeg-devel] [PATCH] Extra build options for ALS (and others)

Thu Dec 10 20:24:22 CET 2009

Am 02.12.09 12:52, schrieb Thilo Borgmann:
> Thilo Borgmann schrieb:
>> Michael Niedermayer schrieb:
>>> On Mon, Nov 30, 2009 at 04:09:23PM +0100, Thilo Borgmann wrote:
>>>> Thilo Borgmann schrieb:
>>>>> M?ns Rullg?rd schrieb:
>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>
>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>
>>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>>
>>>>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> recently the need for an extra build option for the ALS decoder arose.
>>>>>>>>>>>> Is it impossible to achieve the desired outcome with some combination
>>>>>>>>>>>> of always_inline, noinline, and flatten attributes?
>>>>>>>>>>> No. See [PATCH] Split reading and decoding of blocks in ALS.
>>>>>>>>>>>
>>>>>>>>>>> Although I've managed to have the functions from the alsdec.c inlined
>>>>>>>>>>> manually according to the grep'ed output of the assembler code, it seems
>>>>>>>>>>> like it is not enough to manually inline functions from within that .c
>>>>>>>>>>> file only using these technique.
>>>>>>>>>> I'm confused.  Can it be done in the C code only or not?  This kind of
>>>>>>>>>> issue should really not be solved in the makefile.
>>>>>>>>> The issue is the big slowdown. The patch that causes this splits a big
>>>>>>>>> function into two, which are then called successively.
>>>>>>>>>
>>>>>>>>> To overcome the slowdown issue, I inspected the functions being inlined
>>>>>>>>> with and without the -finline-limit option. I can use av_always_inline
>>>>>>>>> for many functions within alsdec.c to have the same functions inlined
>>>>>>>>> like -finline-limit does.
>>>>>>>>>
>>>>>>>>> Unfortunately, using -finline-limit removes the slowdown introduced by
>>>>>>>>> the patch while using av_always_inline does not.
>>>>>>>> So it's not doing the same thing.  What is it doing differently?
>>>>>>>> Where did you get the limit number from?
>>>>>>>>
>>>>>>> All function calls within alsdec.s when using -finline-limit=4096:
>>>>>>>    1 	call	L1102
>>>>>>>    1 	call	L138
>>>>>>>    1 	call	L456
>>>>>>>    2 	call	L___udivdi3$stub
>>>>>>>   10 	call	L_av_freep$stub
>>>>>>>    1 	call	L_av_get_bits_per_sample_format$stub
>>>>>>>   12 	call	L_av_log$stub
>>>>>>>    5 	call	L_av_log_missing_feature$stub
>>>>>>>    8 	call	L_av_malloc$stub
>>>>>>>    2 	call	L_av_mallocz$stub
>>>>>>>    1 	call	L_ff_mpeg4audio_get_config$stub
>>>>>>>    6 	call	L_memcpy$stub
>>>>>>>    2 	call	L_memmove$stub
>>>>>>>    1 	call	L_memset$stub
>>>>>>>    2 	call	_decode_blocks_ind
>>>>>>>    4 	call	_decode_end
>>>>>>>   36 	call	_decode_rice
>>>>>>>   10 	call	_get_bits_long
>>>>>>>   11 	call	_parse_bs_info
>>>>>>>    2 	call	_zero_remaining
>>>>>>>
>>>>>>> All function calls within alsdec.s when using many av_always_inline's.
>>>>>>> This is designed to inline the same functions from alsdec.c like the
>>>>>>> unpatched alsdec.c would yield without any extra build option:
>>>>>>>    1 	call	L1561
>>>>>>>    1 	call	L176
>>>>>>>    1 	call	L21
>>>>>>>    2 	call	L___udivdi3$stub
>>>>>>>   10 	call	L_av_freep$stub
>>>>>>>    1 	call	L_av_get_bits_per_sample_format$stub
>>>>>>>   13 	call	L_av_log$stub
>>>>>>>    5 	call	L_av_log_missing_feature$stub
>>>>>>>    8 	call	L_av_malloc$stub
>>>>>>>    2 	call	L_av_mallocz$stub
>>>>>>>    1 	call	L_ff_mpeg4audio_get_config$stub
>>>>>>>    1 	call	L_memcpy$stub
>>>>>>>    1 	call	L_memmove$stub
>>>>>>>    2 	call	L_memset$stub
>>>>>>>    8 	call	___inline_memcpy_chk
>>>>>>>    2 	call	___inline_memmove_chk
>>>>>>>    6 	call	_align_get_bits
>>>>>>>    5 	call	_av_ceil_log2
>>>>>>>    4 	call	_av_clip
>>>>>>>    4 	call	_decode_end
>>>>>>>   47 	call	_get_bits
>>>>>>>   90 	call	_get_bits1
>>>>>>>    3 	call	_get_bits_count
>>>>>>>   61 	call	_get_bits_left
>>>>>>>   39 	call	_get_bits_long
>>>>>>>    4 	call	_get_sbits_long
>>>>>>>   60 	call	_get_unary
>>>>>>>    2 	call	_init_get_bits
>>>>>>>    3 	call	_parse_bs_info
>>>>>>>    3 	call	_read_time
>>>>>>>    7 	call	_skip_bits
>>>>>>>    2 	call	_skip_bits1
>>>>>>>    5 	call	_skip_bits_long
>>>>>> Not inlining those get_bits etc will certainly slow things down,
>>>>>> that's for sure.
>>>>>>
>>>>>>> So -finline-limit can inline many functions in the object file which are
>>>>>>> not part of alsdec.c. Which might be the reason for the performance
>>>>>>> difference.
>>>>>>>
>>>>>>> But using -finline-limit does not yield a speed gain for the unpatched
>>>>>>> file! So there might be something else but I don't see.
>>>>>>>
>>>>>>> The value of 4096 has been choosen randomly. As long as I don't know
>>>>>>> exactly why -finline-limit removes the slowdown and that it cannot be
>>>>>>> replaced by another approach, there is no need to figure out a more
>>>>>>> optimal value...
>>>>>> We should do some benchmarks using that flag globally and see what
>>>>>> happens.  Maybe we'd gain from using it everywhere.
>>>>> Like Michael said, this would be a big test for different platforms and
>>>>> compilers which I cannot offer alone so several people would have to do
>>>>> this - if a benchmark would indicate that it might be worth testing.
>>>>>
>>>>> Also, I'm lacking a good idea of how to test this efficiently without
>>>>> having other factors like harddrives playing a predominant role which
>>>>> means testing execution time of ffmpeg.
>>>> I played around a little with the regression tests and audio decoders.
>>>> For most of my tests -finline-limit=4096 makes it a little faster, e.g.
>>>>
>>>> g726: 47001535 dezicycles -> 41628457 dezicycles (12%)
>>>> alac: 12855244 dezicycles -> 12849127 dezicycles ( 0%)
>>>> flac:   842020 dezicycles ->   786226 dezicycles ( 7%)
>>>> wma:   3663166 dezicycles ->  3197273 dezicycles (14%)
>>>>
>>>> which is not surprising. Inlining comes for a price, ffmpeg executable
>>>> growed from 5,4 MB to 6.1 MB.
>>>> Value used fro -finline-limit is 4096, default is 600 for gcc-4.0.
>>> what about video codecs? h264, mpeg4, mpeg2 h263 ?
>>
>> Can do tomorrow.
> 
> h.261: 34067354 dezicycles -> 33048969 dezicycles ( 3%)
> h.263: 32138793 dezicycles -> 30895187 dezicycles ( 4%)
> 
> For h.264 we are using external libraries and there seems not to be a
> regression test on these? (set timer in libx264.c and h264.c and got no
> measurements)
> 
> Anyway, the video regression tests yield dezicycle measures for around
> 32 runs which are not really stable...
> I tested h263 with a longer video and ended at 512 runs with 1% more
> dezicycles needed - so slightly worse in fact.
> 
> So I got the impression that the video decoders do not profit from that
> compiler option in a way the audio decoders do. Why that is the case, is
> another question though.

If noone is still looking at this anymore, I assume this patch being
rejected?

-Thilo