[FFmpeg-devel] Parallelized h264 proof-of-concept
Andreas Öman
andreas
Fri May 18 23:00:57 CEST 2007
(this a resend of a previously sent mail which was filtered
due to too big attachment. If that mail pops thru aswell for
some reason, I apologize)
Hi Everyone.
I decided to take a shot at parallelizing the h264 decoder (at slice
level)
The attached diff is far from complete yet, but most stuff works.
The reason i send it is merely for discussions of how to progress.
o Is this (sliced based parallelism) acceptable? Apart from the
obvious issue (content encoded in only one slice) i don't see any
real show stopper (other than the yet-to-solve issues described
below). I know there have been previous discussions of
implementing it as a two-threaded solution with one thread
doing the entropy-decoding and the other the mb-decode.
Still, i believe the changes I've made are beneficial for
such a solution as well.
The content I've tried with so far has all been multi-slice
stuff, Apple movie trailers (CAVLC) and HDTV broadcast here
from sweden (CABAC), is encoded as multiple slices.
Perhaps anyone has more knowledge how common / uncommon
this is. I'm gonna download all the samples from mphq
and examine further.
The issues left to fix are:
o The error resilience data structures are not protected (but
still shared). This usually manifests itself into:
[h264 @ 0xb7c64208]concealing 0 DC, 0 AC, 0 MV errors
because the s->error_count decrement races between
cpus. This is pretty easy to fix if the avcodec thread
implementations would expose a locking primitive.
o deblocking doesn't work correctly. When deblocking is enabled
the md5 sum output from my test program changes for every run.
I quite sure this is caused by the fact that deblocking is done
over the entire frame, not locally per slice, and thus, if
slices complete out-of-order, there will be errors.
I don't see any visual artifacts, but something is fishy for
sure. I'll need to nail the exact reason before i can be
more specific about problems / solutions here.
o The SVQ3 decoding has not yet been adapted. (one need to configure
with --disable-decoder=svq3 to compile at all now)
o If the decoder receives nal-units directly (CODEC_FLAG2_CHUNKS)
there wont be any speedup at all right now. This is not very
high on my prio-list.
o Probably more stuff which i havent thought of yet.
Okay, a few words about the changes.
A new structure H264Thread (name suggestions very welcome) is
passed around to almost all functions. This structure is
local for every slice (perhaps H264Slice would be a better
name) and contains all members from H264Context that
changed during slice decode. I also moved a few things
(most notably mb_[xy]) from MpegEncContext here.
H264Context has been const'ified, mostly to aid me in finding
all members that needed to be moved.
decode_nal_units() decodes the slice-header, but instead
of calling decode_slice() directly it "enqueues" them by
incrementing a counter. The work is then executed
by decode_slices(), by using the lavc thread workers.
Pretty simple.
The diff is still in a nasty shape (yes I will deliver
separate incremental cleaned up diffs later on if this
gets a go-ahead). h264.h got especially messed up.
(So, Michael, there is no need to comment the diff inline
quite yet, unless you really want to :)
Some performance figures (on a Dual Core2 6400 @ 2.13GHz)
The figures shows real time elapsed during avcodec_decode_video()
Original ffmpeg, fresh checkout.
zodiac-tlr1_h1080p.mov: d4edd5b58d37a83bf2fb2254110e3b34
min: 3169?s max: 41877?s avg: 13026?s (0 threads)
Parallelized decoder, no threading:
zodiac-tlr1_h1080p.mov: d4edd5b58d37a83bf2fb2254110e3b34
min: 3239?s max: 42338?s avg: 13438?s (0 threads)
Two theads:
zodiac-tlr1_h1080p.mov: 3cbb4b818ef388503580db5e3b20c9ff
min: 1873?s max: 23610?s avg: 7511?s (2 threads)
As can be seen, there is (not unexpectedly) a significant
speedup when running with two threads. The drawback is
that the single-threaded version gets a bit slower.
I haven't looked very much into this yet. But i would
guess it is a combination of the following things:
- thread_encode() still enqueue up to MAX_THREAD slices, and uses
avctx->execute() - This results in a slight cache miss penalty.
Easy to fix (if it actually is a problem)
- Passing around two pointers instead of one.
One can certainly use some macro tricks to obtain the old
function prototypes at compile-time. At the expense of
nastier source code.
- Perhaps something else...
Anyway,
If this is something that ffmpeg is willing to integrate
I'd like to get a few pointers, hints and answers on the
topics above before I continue with the stuff that's left.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264-multi-threaded-poc.diff.gz
Type: application/x-gzip
Size: 43878 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070518/cdc3a454/attachment.bin>
More information about the ffmpeg-devel
mailing list