[FFmpeg-devel] [PATCH] force dnxhd encoder to be independent of qsort internals
Reimar Döffinger
Reimar.Doeffinger
Sun Sep 20 14:13:03 CEST 2009
Hello,
On Sun, Sep 20, 2009 at 12:10:24PM +0200, Michael Niedermayer wrote:
> On Sun, Sep 20, 2009 at 11:39:26AM +0200, Reimar D?ffinger wrote:
> > On Sun, Sep 20, 2009 at 11:08:45AM +0200, Reimar D?ffinger wrote:
> > > On Sat, Sep 19, 2009 at 01:16:51PM +0200, Reimar D?ffinger wrote:
> > > > this change makes sure there are no equal elements in the data passed to
> > > > qsort, thus making sure the result is independent of implementation
> > > > internals.
> > >
> > > Very slightly different, swapped the mb comparison so the original order
> > > is kept by default. Seems nicer and might decreases needed memory
> > > bandwidth if a lot of values are identical (very unlikely I think).
> >
> > qsort itself is nearly 5% slower for the Linux implementation:
> > 31596660 dezicycles in encode_fast, 1 runs, 0 skips
> > 30286580 dezicycles in encode_fast, 2 runs, 0 skips
> > 29324712 dezicycles in encode_fast, 4 runs, 0 skips
> > 28957121 dezicycles in encode_fast, 8 runs, 0 skips
> > 28722058 dezicycles in encode_fast, 16 runs, 0 skips
> > 28665462 dezicycles in encode_fast, 32 runs, 0 skips
> > 28597622 dezicycles in encode_fast, 64 runs, 0 skips
> >
> > 32579940 dezicycles in encode_fast, 1 runs, 0 skips
> > 31456890 dezicycles in encode_fast, 2 runs, 0 skips
> > 30567462 dezicycles in encode_fast, 4 runs, 0 skips
> > 30156817 dezicycles in encode_fast, 8 runs, 0 skips
> > 30013356 dezicycles in encode_fast, 16 runs, 0 skips
> > 29935421 dezicycles in encode_fast, 32 runs, 0 skips
> > 29924871 dezicycles in encode_fast, 64 runs, 0 skips
> >
> > But difference for the whole encode_fast function is within the
> > variance, at most 0.2 % slower I'd say as in this case (again,
>
> why dont you just do a radix sort? Should only be a few lines
> of C, faster and portable
Well, I thought I have the time and here is an attempt using malloc in
the sort function (I know, bad...).
Speed is about 20% faster than Linux qsort.
21715740 dezicycles in sort, 1 runs, 0 skips
21793630 dezicycles in sort, 2 runs, 0 skips
21638567 dezicycles in sort, 4 runs, 0 skips
21513510 dezicycles in sort, 8 runs, 0 skips
21445663 dezicycles in sort, 16 runs, 0 skips
21495331 dezicycles in sort, 32 runs, 0 skips
21510185 dezicycles in sort, 64 runs, 0 skips
And obviously I am somehow messing up the whole test since the overall
speed seems to have doubled according to START_TIME/STOP_TIMER which is
just impossible.
Sorry, but I'll better leave benchmarking to someone who actually
managed to at least once get a consistent result.
In addition I suspect that the "value" range is actually quite limited,
so it might be possible to reduce the number of passes to 3 or even 2
for the radix sort (of course many more general optimizations are
possible, like checking if all values fall in one bucket and then do a
simple memcpy, check if the array is already sorted during radix_count
etc.).
On the plus side, regression tests seem completely unchanged so far, but
it seems the 1080i case does not use the sort anyway...
-------------- next part --------------
Index: libavcodec/dnxhdenc.c
===================================================================
--- libavcodec/dnxhdenc.c (revision 19926)
+++ libavcodec/dnxhdenc.c (working copy)
@@ -651,11 +651,51 @@
return 0;
}
-static int dnxhd_rc_cmp(const void *a, const void *b)
+#define NBUCKETS 256
+
+static inline int get_bucket(int value, int shift)
{
- return ((const RCCMPEntry *)b)->value - ((const RCCMPEntry *)a)->value;
+ value >>= shift;
+ value &= NBUCKETS - 1;
+ return NBUCKETS - 1 - value;
}
+static void radix_count(const RCCMPEntry *data, int size, int *buckets, int shift)
+{
+ int i;
+ int offset;
+ memset(buckets, 0, sizeof(*buckets) * NBUCKETS);
+ for (i = 0; i < size; i++)
+ buckets[get_bucket(data[i].value, shift)]++;
+ offset = size;
+ for (i = NBUCKETS - 1; i >= 0; i--)
+ buckets[i] = offset -= buckets[i];
+ assert(!buckets[0]);
+}
+
+static void radix_sort_pass(RCCMPEntry *dst, const RCCMPEntry *data, int size, int pass)
+{
+ int i;
+ int shift = pass * av_log2(NBUCKETS);
+ int buckets[NBUCKETS];
+ radix_count(data, size, buckets, shift);
+ for (i = 0; i < size; i++) {
+ int v = get_bucket(data[i].value, shift);
+ int pos = buckets[v]++;
+ dst[pos] = data[i];
+ }
+}
+
+static void radix_sort(RCCMPEntry *data, int size)
+{
+ RCCMPEntry *tmp = av_malloc(sizeof(*tmp) * size);
+ radix_sort_pass(tmp, data, size, 0);
+ radix_sort_pass(data, tmp, size, 1);
+ radix_sort_pass(tmp, data, size, 2);
+ radix_sort_pass(data, tmp, size, 3);
+ av_free(tmp);
+}
+
static int dnxhd_encode_fast(AVCodecContext *avctx, DNXHDEncContext *ctx)
{
int max_bits = 0;
@@ -682,7 +722,7 @@
if (!ret) {
if (RC_VARIANCE)
avctx->execute(avctx, dnxhd_mb_var_thread, (void**)&ctx->thread[0], NULL, avctx->thread_count, sizeof(void*));
- qsort(ctx->mb_cmp, ctx->m.mb_num, sizeof(RCEntry), dnxhd_rc_cmp);
+ radix_sort(ctx->mb_cmp, ctx->m.mb_num);
for (x = 0; x < ctx->m.mb_num && max_bits > ctx->frame_bits; x++) {
int mb = ctx->mb_cmp[x].mb;
max_bits -= ctx->mb_rc[ctx->qscale][mb].bits - ctx->mb_rc[ctx->qscale+1][mb].bits;
More information about the ffmpeg-devel
mailing list