[FFmpeg-devel] Parallelized h264 proof-of-concept

Fri May 18 23:00:57 CEST 2007

(this a resend of a previously sent mail which was filtered
due to too big attachment. If that mail pops thru aswell for
some reason, I apologize)

Hi Everyone.

I decided to take a shot at parallelizing the h264 decoder (at slice
level)

The attached diff is far from complete yet, but most stuff works.

The reason i send it is merely for discussions of how to progress.

o Is this (sliced based parallelism) acceptable? Apart from the
   obvious issue (content encoded in only one slice) i don't see any
   real show stopper (other than the yet-to-solve issues described
   below). I know there have been previous discussions of
   implementing it as a two-threaded solution with one thread
   doing the entropy-decoding and the other the mb-decode.
   Still, i believe the changes I've made are beneficial for
   such a solution as well.
   The content I've tried with so far has all been multi-slice
   stuff, Apple movie trailers (CAVLC) and HDTV broadcast here
   from sweden (CABAC), is encoded as multiple slices.
   Perhaps anyone has more knowledge how common / uncommon
   this is. I'm gonna download all the samples from mphq
   and examine further.

The issues left to fix are:

o The error resilience data structures are not protected (but
   still shared). This usually manifests itself into:

[h264 @ 0xb7c64208]concealing 0 DC, 0 AC, 0 MV errors

   because the s->error_count decrement races between
   cpus. This is pretty easy to fix if the avcodec thread
   implementations would expose a locking primitive.

o deblocking doesn't work correctly. When deblocking is enabled
   the md5 sum output from my test program changes for every run.
   I quite sure this is caused by the fact that deblocking is done
   over the entire frame, not locally per slice, and thus, if
   slices complete out-of-order, there will be errors.
   I don't see any visual artifacts, but something is fishy for
   sure. I'll need to nail the exact reason before i can be
   more specific about problems / solutions here.

o The SVQ3 decoding has not yet been adapted. (one need to configure
   with --disable-decoder=svq3 to compile at all now)

o If the decoder receives nal-units directly (CODEC_FLAG2_CHUNKS)
   there wont be any speedup at all right now. This is not very
   high on my prio-list.

o Probably more stuff which i havent thought of yet.

Okay, a few words about the changes.

A new structure H264Thread (name suggestions very welcome) is
passed around to almost all functions. This structure is
local for every slice (perhaps H264Slice would be a better
name) and contains all members from H264Context that
changed during slice decode. I also moved a few things
(most notably mb_[xy]) from MpegEncContext here.

H264Context has been const'ified, mostly to aid me in finding
all members that needed to be moved.

decode_nal_units() decodes the slice-header, but instead
of calling decode_slice() directly it "enqueues" them by
incrementing a counter. The work is then executed
by decode_slices(), by using the lavc thread workers.
Pretty simple.

The diff is still in a nasty shape (yes I will deliver
separate incremental cleaned up diffs later on if this
gets a go-ahead). h264.h got especially messed up.
(So, Michael, there is no need to comment the diff inline
quite yet, unless you really want to :)

Some performance figures (on a Dual Core2 6400  @ 2.13GHz)
The figures shows real time elapsed during avcodec_decode_video()

Original ffmpeg, fresh checkout.

zodiac-tlr1_h1080p.mov: d4edd5b58d37a83bf2fb2254110e3b34
      min: 3169?s  max: 41877?s  avg: 13026?s (0 threads)

Parallelized decoder, no threading:

zodiac-tlr1_h1080p.mov: d4edd5b58d37a83bf2fb2254110e3b34
      min: 3239?s  max: 42338?s  avg: 13438?s (0 threads)

Two theads:

zodiac-tlr1_h1080p.mov: 3cbb4b818ef388503580db5e3b20c9ff
      min: 1873?s  max: 23610?s  avg: 7511?s (2 threads)

As can be seen, there is (not unexpectedly) a significant
speedup when running with two threads. The drawback is
that the single-threaded version gets a bit slower.
I haven't looked very much into this yet. But i would
guess it is a combination of the following things:

- thread_encode() still enqueue up to MAX_THREAD slices, and uses
   avctx->execute() - This results in a slight cache miss penalty.
   Easy to fix (if it actually is a problem)

- Passing around two pointers instead of one.
   One can certainly use some macro tricks to obtain the old
   function prototypes at compile-time. At the expense of
   nastier source code.

- Perhaps something else...

Anyway,
If this is something that ffmpeg is willing to integrate
I'd like to get a few pointers, hints and answers on the
topics above before I continue with the stuff that's left.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264-multi-threaded-poc.diff.gz
Type: application/x-gzip
Size: 43878 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070518/cdc3a454/attachment.bin>