[FFmpeg-devel] [PATCH] force dnxhd encoder to be independent of qsort internals
Baptiste Coudurier
baptiste.coudurier
Sun Sep 20 22:23:49 CEST 2009
Hi Reimar,
On 09/20/2009 05:13 AM, Reimar D?ffinger wrote:
> Hello,
> On Sun, Sep 20, 2009 at 12:10:24PM +0200, Michael Niedermayer wrote:
>> On Sun, Sep 20, 2009 at 11:39:26AM +0200, Reimar D?ffinger wrote:
>>> On Sun, Sep 20, 2009 at 11:08:45AM +0200, Reimar D?ffinger wrote:
>>>> On Sat, Sep 19, 2009 at 01:16:51PM +0200, Reimar D?ffinger wrote:
>>>>> this change makes sure there are no equal elements in the data passed to
>>>>> qsort, thus making sure the result is independent of implementation
>>>>> internals.
>>>>
>>>> Very slightly different, swapped the mb comparison so the original order
>>>> is kept by default. Seems nicer and might decreases needed memory
>>>> bandwidth if a lot of values are identical (very unlikely I think).
>>>
>>> qsort itself is nearly 5% slower for the Linux implementation:
>>> 31596660 dezicycles in encode_fast, 1 runs, 0 skips
>>> 30286580 dezicycles in encode_fast, 2 runs, 0 skips
>>> 29324712 dezicycles in encode_fast, 4 runs, 0 skips
>>> 28957121 dezicycles in encode_fast, 8 runs, 0 skips
>>> 28722058 dezicycles in encode_fast, 16 runs, 0 skips
>>> 28665462 dezicycles in encode_fast, 32 runs, 0 skips
>>> 28597622 dezicycles in encode_fast, 64 runs, 0 skips
>>>
>>> 32579940 dezicycles in encode_fast, 1 runs, 0 skips
>>> 31456890 dezicycles in encode_fast, 2 runs, 0 skips
>>> 30567462 dezicycles in encode_fast, 4 runs, 0 skips
>>> 30156817 dezicycles in encode_fast, 8 runs, 0 skips
>>> 30013356 dezicycles in encode_fast, 16 runs, 0 skips
>>> 29935421 dezicycles in encode_fast, 32 runs, 0 skips
>>> 29924871 dezicycles in encode_fast, 64 runs, 0 skips
>>>
>>> But difference for the whole encode_fast function is within the
>>> variance, at most 0.2 % slower I'd say as in this case (again,
>>
>> why dont you just do a radix sort? Should only be a few lines
>> of C, faster and portable
>
> Well, I thought I have the time and here is an attempt using malloc in
> the sort function (I know, bad...).
> Speed is about 20% faster than Linux qsort.
> 21715740 dezicycles in sort, 1 runs, 0 skips
> 21793630 dezicycles in sort, 2 runs, 0 skips
> 21638567 dezicycles in sort, 4 runs, 0 skips
> 21513510 dezicycles in sort, 8 runs, 0 skips
> 21445663 dezicycles in sort, 16 runs, 0 skips
> 21495331 dezicycles in sort, 32 runs, 0 skips
> 21510185 dezicycles in sort, 64 runs, 0 skips
>
> And obviously I am somehow messing up the whole test since the overall
> speed seems to have doubled according to START_TIME/STOP_TIMER which is
> just impossible.
> Sorry, but I'll better leave benchmarking to someone who actually
> managed to at least once get a consistent result.
> In addition I suspect that the "value" range is actually quite limited,
> so it might be possible to reduce the number of passes to 3 or even 2
> for the radix sort (of course many more general optimizations are
> possible, like checking if all values fall in one bucket and then do a
> simple memcpy, check if the array is already sorted during radix_count
> etc.).
> On the plus side, regression tests seem completely unchanged so far, but
> it seems the 1080i case does not use the sort anyway...
Yes, if qscale 1 fit in the frame size, values won't be sorted.
You can check the if (!ret) branch.
And thanks a lot for the radix sort :)
[...]
--
Baptiste COUDURIER
Key fingerprint 8D77134D20CC9220201FC5DB0AC9325C5C1ABAAA
FFmpeg maintainer http://www.ffmpeg.org
More information about the ffmpeg-devel
mailing list