[FFmpeg-devel] [PATCH] force dnxhd encoder to be independent of qsort internals

Sun Sep 20 14:13:03 CEST 2009

Hello,
On Sun, Sep 20, 2009 at 12:10:24PM +0200, Michael Niedermayer wrote:
> On Sun, Sep 20, 2009 at 11:39:26AM +0200, Reimar D?ffinger wrote:
> > On Sun, Sep 20, 2009 at 11:08:45AM +0200, Reimar D?ffinger wrote:
> > > On Sat, Sep 19, 2009 at 01:16:51PM +0200, Reimar D?ffinger wrote:
> > > > this change makes sure there are no equal elements in the data passed to
> > > > qsort, thus making sure the result is independent of implementation
> > > > internals.
> > > 
> > > Very slightly different, swapped the mb comparison so the original order
> > > is kept by default. Seems nicer and might decreases needed memory
> > > bandwidth if a lot of values are identical (very unlikely I think).
> > 
> > qsort itself is nearly 5% slower for the Linux implementation:
> > 31596660 dezicycles in encode_fast, 1 runs, 0 skips
> > 30286580 dezicycles in encode_fast, 2 runs, 0 skips
> > 29324712 dezicycles in encode_fast, 4 runs, 0 skips
> > 28957121 dezicycles in encode_fast, 8 runs, 0 skips
> > 28722058 dezicycles in encode_fast, 16 runs, 0 skips
> > 28665462 dezicycles in encode_fast, 32 runs, 0 skips
> > 28597622 dezicycles in encode_fast, 64 runs, 0 skips
> > 
> > 32579940 dezicycles in encode_fast, 1 runs, 0 skips
> > 31456890 dezicycles in encode_fast, 2 runs, 0 skips
> > 30567462 dezicycles in encode_fast, 4 runs, 0 skips
> > 30156817 dezicycles in encode_fast, 8 runs, 0 skips
> > 30013356 dezicycles in encode_fast, 16 runs, 0 skips
> > 29935421 dezicycles in encode_fast, 32 runs, 0 skips
> > 29924871 dezicycles in encode_fast, 64 runs, 0 skips
> > 
> > But difference for the whole encode_fast function is within the
> > variance, at most 0.2 % slower I'd say as in this case (again,
> 
> why dont you just do a radix sort? Should only be a few lines
> of C, faster and portable

Well, I thought I have the time and here is an attempt using malloc in
the sort function (I know, bad...).
Speed is about 20% faster than Linux qsort.
21715740 dezicycles in sort, 1 runs, 0 skips
21793630 dezicycles in sort, 2 runs, 0 skips
21638567 dezicycles in sort, 4 runs, 0 skips
21513510 dezicycles in sort, 8 runs, 0 skips
21445663 dezicycles in sort, 16 runs, 0 skips
21495331 dezicycles in sort, 32 runs, 0 skips
21510185 dezicycles in sort, 64 runs, 0 skips

And obviously I am somehow messing up the whole test since the overall
speed seems to have doubled according to START_TIME/STOP_TIMER which is
just impossible.
Sorry, but I'll better leave benchmarking to someone who actually
managed to at least once get a consistent result.
In addition I suspect that the "value" range is actually quite limited,
so it might be possible to reduce the number of passes to 3 or even 2
for the radix sort (of course many more general optimizations are
possible, like checking if all values fall in one bucket and then do a
simple memcpy, check if the array is already sorted during radix_count
etc.).
On the plus side, regression tests seem completely unchanged so far, but
it seems the 1080i case does not use the sort anyway...
-------------- next part --------------
Index: libavcodec/dnxhdenc.c
===================================================================

--- libavcodec/dnxhdenc.c	(revision 19926)
+++ libavcodec/dnxhdenc.c	(working copy)
@@ -651,11 +651,51 @@
     return 0;
 }
 
-static int dnxhd_rc_cmp(const void *a, const void *b)
+#define NBUCKETS 256
+
+static inline int get_bucket(int value, int shift)
 {
-    return ((const RCCMPEntry *)b)->value - ((const RCCMPEntry *)a)->value;
+    value >>= shift;
+    value &= NBUCKETS - 1;
+    return NBUCKETS - 1 - value;
 }
 
+static void radix_count(const RCCMPEntry *data, int size, int *buckets, int shift)
+{
+    int i;
+    int offset;
+    memset(buckets, 0, sizeof(*buckets) * NBUCKETS);
+    for (i = 0; i < size; i++)
+        buckets[get_bucket(data[i].value, shift)]++;
+    offset = size;
+    for (i = NBUCKETS - 1; i >= 0; i--)
+        buckets[i] = offset -= buckets[i];
+    assert(!buckets[0]);
+}
+
+static void radix_sort_pass(RCCMPEntry *dst, const RCCMPEntry *data, int size, int pass)
+{
+    int i;
+    int shift = pass * av_log2(NBUCKETS);
+    int buckets[NBUCKETS];
+    radix_count(data, size, buckets, shift);
+    for (i = 0; i < size; i++) {
+        int v = get_bucket(data[i].value, shift);
+        int pos = buckets[v]++;
+        dst[pos] = data[i];
+    }
+}
+
+static void radix_sort(RCCMPEntry *data, int size)
+{
+    RCCMPEntry *tmp = av_malloc(sizeof(*tmp) * size);
+    radix_sort_pass(tmp, data, size, 0);
+    radix_sort_pass(data, tmp, size, 1);
+    radix_sort_pass(tmp, data, size, 2);
+    radix_sort_pass(data, tmp, size, 3);
+    av_free(tmp);
+}
+
 static int dnxhd_encode_fast(AVCodecContext *avctx, DNXHDEncContext *ctx)
 {
     int max_bits = 0;
@@ -682,7 +722,7 @@
     if (!ret) {
         if (RC_VARIANCE)
             avctx->execute(avctx, dnxhd_mb_var_thread, (void**)&ctx->thread[0], NULL, avctx->thread_count, sizeof(void*));
-        qsort(ctx->mb_cmp, ctx->m.mb_num, sizeof(RCEntry), dnxhd_rc_cmp);
+        radix_sort(ctx->mb_cmp, ctx->m.mb_num);
         for (x = 0; x < ctx->m.mb_num && max_bits > ctx->frame_bits; x++) {
             int mb = ctx->mb_cmp[x].mb;
             max_bits -= ctx->mb_rc[ctx->qscale][mb].bits - ctx->mb_rc[ctx->qscale+1][mb].bits;