[FFmpeg-devel] [PATCH] Extra build options for ALS (and others)
Thilo Borgmann
thilo.borgmann
Wed Dec 2 12:52:47 CET 2009
Thilo Borgmann schrieb:
> Michael Niedermayer schrieb:
>> On Mon, Nov 30, 2009 at 04:09:23PM +0100, Thilo Borgmann wrote:
>>> Thilo Borgmann schrieb:
>>>> M?ns Rullg?rd schrieb:
>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>
>>>>>> M?ns Rullg?rd schrieb:
>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>
>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>
>>>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> recently the need for an extra build option for the ALS decoder arose.
>>>>>>>>>>> Is it impossible to achieve the desired outcome with some combination
>>>>>>>>>>> of always_inline, noinline, and flatten attributes?
>>>>>>>>>> No. See [PATCH] Split reading and decoding of blocks in ALS.
>>>>>>>>>>
>>>>>>>>>> Although I've managed to have the functions from the alsdec.c inlined
>>>>>>>>>> manually according to the grep'ed output of the assembler code, it seems
>>>>>>>>>> like it is not enough to manually inline functions from within that .c
>>>>>>>>>> file only using these technique.
>>>>>>>>> I'm confused. Can it be done in the C code only or not? This kind of
>>>>>>>>> issue should really not be solved in the makefile.
>>>>>>>> The issue is the big slowdown. The patch that causes this splits a big
>>>>>>>> function into two, which are then called successively.
>>>>>>>>
>>>>>>>> To overcome the slowdown issue, I inspected the functions being inlined
>>>>>>>> with and without the -finline-limit option. I can use av_always_inline
>>>>>>>> for many functions within alsdec.c to have the same functions inlined
>>>>>>>> like -finline-limit does.
>>>>>>>>
>>>>>>>> Unfortunately, using -finline-limit removes the slowdown introduced by
>>>>>>>> the patch while using av_always_inline does not.
>>>>>>> So it's not doing the same thing. What is it doing differently?
>>>>>>> Where did you get the limit number from?
>>>>>>>
>>>>>> All function calls within alsdec.s when using -finline-limit=4096:
>>>>>> 1 call L1102
>>>>>> 1 call L138
>>>>>> 1 call L456
>>>>>> 2 call L___udivdi3$stub
>>>>>> 10 call L_av_freep$stub
>>>>>> 1 call L_av_get_bits_per_sample_format$stub
>>>>>> 12 call L_av_log$stub
>>>>>> 5 call L_av_log_missing_feature$stub
>>>>>> 8 call L_av_malloc$stub
>>>>>> 2 call L_av_mallocz$stub
>>>>>> 1 call L_ff_mpeg4audio_get_config$stub
>>>>>> 6 call L_memcpy$stub
>>>>>> 2 call L_memmove$stub
>>>>>> 1 call L_memset$stub
>>>>>> 2 call _decode_blocks_ind
>>>>>> 4 call _decode_end
>>>>>> 36 call _decode_rice
>>>>>> 10 call _get_bits_long
>>>>>> 11 call _parse_bs_info
>>>>>> 2 call _zero_remaining
>>>>>>
>>>>>> All function calls within alsdec.s when using many av_always_inline's.
>>>>>> This is designed to inline the same functions from alsdec.c like the
>>>>>> unpatched alsdec.c would yield without any extra build option:
>>>>>> 1 call L1561
>>>>>> 1 call L176
>>>>>> 1 call L21
>>>>>> 2 call L___udivdi3$stub
>>>>>> 10 call L_av_freep$stub
>>>>>> 1 call L_av_get_bits_per_sample_format$stub
>>>>>> 13 call L_av_log$stub
>>>>>> 5 call L_av_log_missing_feature$stub
>>>>>> 8 call L_av_malloc$stub
>>>>>> 2 call L_av_mallocz$stub
>>>>>> 1 call L_ff_mpeg4audio_get_config$stub
>>>>>> 1 call L_memcpy$stub
>>>>>> 1 call L_memmove$stub
>>>>>> 2 call L_memset$stub
>>>>>> 8 call ___inline_memcpy_chk
>>>>>> 2 call ___inline_memmove_chk
>>>>>> 6 call _align_get_bits
>>>>>> 5 call _av_ceil_log2
>>>>>> 4 call _av_clip
>>>>>> 4 call _decode_end
>>>>>> 47 call _get_bits
>>>>>> 90 call _get_bits1
>>>>>> 3 call _get_bits_count
>>>>>> 61 call _get_bits_left
>>>>>> 39 call _get_bits_long
>>>>>> 4 call _get_sbits_long
>>>>>> 60 call _get_unary
>>>>>> 2 call _init_get_bits
>>>>>> 3 call _parse_bs_info
>>>>>> 3 call _read_time
>>>>>> 7 call _skip_bits
>>>>>> 2 call _skip_bits1
>>>>>> 5 call _skip_bits_long
>>>>> Not inlining those get_bits etc will certainly slow things down,
>>>>> that's for sure.
>>>>>
>>>>>> So -finline-limit can inline many functions in the object file which are
>>>>>> not part of alsdec.c. Which might be the reason for the performance
>>>>>> difference.
>>>>>>
>>>>>> But using -finline-limit does not yield a speed gain for the unpatched
>>>>>> file! So there might be something else but I don't see.
>>>>>>
>>>>>> The value of 4096 has been choosen randomly. As long as I don't know
>>>>>> exactly why -finline-limit removes the slowdown and that it cannot be
>>>>>> replaced by another approach, there is no need to figure out a more
>>>>>> optimal value...
>>>>> We should do some benchmarks using that flag globally and see what
>>>>> happens. Maybe we'd gain from using it everywhere.
>>>> Like Michael said, this would be a big test for different platforms and
>>>> compilers which I cannot offer alone so several people would have to do
>>>> this - if a benchmark would indicate that it might be worth testing.
>>>>
>>>> Also, I'm lacking a good idea of how to test this efficiently without
>>>> having other factors like harddrives playing a predominant role which
>>>> means testing execution time of ffmpeg.
>>> I played around a little with the regression tests and audio decoders.
>>> For most of my tests -finline-limit=4096 makes it a little faster, e.g.
>>>
>>> g726: 47001535 dezicycles -> 41628457 dezicycles (12%)
>>> alac: 12855244 dezicycles -> 12849127 dezicycles ( 0%)
>>> flac: 842020 dezicycles -> 786226 dezicycles ( 7%)
>>> wma: 3663166 dezicycles -> 3197273 dezicycles (14%)
>>>
>>> which is not surprising. Inlining comes for a price, ffmpeg executable
>>> growed from 5,4 MB to 6.1 MB.
>>> Value used fro -finline-limit is 4096, default is 600 for gcc-4.0.
>> what about video codecs? h264, mpeg4, mpeg2 h263 ?
>
> Can do tomorrow.
h.261: 34067354 dezicycles -> 33048969 dezicycles ( 3%)
h.263: 32138793 dezicycles -> 30895187 dezicycles ( 4%)
For h.264 we are using external libraries and there seems not to be a
regression test on these? (set timer in libx264.c and h264.c and got no
measurements)
Anyway, the video regression tests yield dezicycle measures for around
32 runs which are not really stable...
I tested h263 with a longer video and ended at 512 runs with 1% more
dezicycles needed - so slightly worse in fact.
So I got the impression that the video decoders do not profit from that
compiler option in a way the audio decoders do. Why that is the case, is
another question though.
-Thilo
More information about the ffmpeg-devel
mailing list