[FFmpeg-devel] [HACK] 50% faster H.264 decoding

Wed Aug 11 23:32:15 CEST 2010

For a very particular use-case, we needed an H.264 decoder that was
extremely fast (for a certain ARM device with no accessible DSP), but
where we fully controlled the incoming H.264 stream and didn't care
much about compression.  And yes, we had to use H.264 -- I won't go
into the details.  So I spent a few days hacking apart ffh264; the
result is here for anyone batshit insane enough to look at, use it, or
learn from it.

Encoding requirements for bit-exactness:

Baseline profile
--ref 1 (large-DPB-size error concealment should still work though; it
only supports one ref during actual decoding, while the DPB can be
larger)
--nf (no deblocking)
--subme 0
--partitions must not contain p8x8
--interlaced must not be set
No custom quantization matrices
No PCM blocks
Various other restrictions

Incomplete summary of optimizations:

1.  Rip out everything that isn't used with the above restrictions.
2.  Reorder the H264Context to be a bit more cache-friendly and
offset-friendly for ARM.
3.  Allow 16-bit multiplications in dequant by storing quantization
factors with no extra precision (breaking CQMs).  Saves an add+shift
in dequant in the main decode_residual function as well.
4.  Store MVs as one MV per MB, instead of 16 or 32.
5.  Eliminate all reference frame handling, under the assumption of
only one ref.  This means we don't have to store refs at all.  ref=-1
(not used) becomes a USES_LIST check, and ref=-2 (not available)
becomes a neighbor type check.
6.  Eliminate decode cache handling of MVs entirely, handling that in
MV prediction.  This leaves only nnz/i4x4 in cache-filling code.
7.  Eliminate all dual-list handling of B-frames; saves a
prefetch_motion call, for example.
8.  Add prefetching of current-frame macroblock data, not just past frame data.
9.  Rip out SVQ3, it was causing me too much annoyance.
10. Rip out all error detection and handling. (for our application, we
could guarantee that any slice that arrived on the client was correct)
11. Make Golomb decoding use count-trailing-zeroes instead of a lookup
table and branch -- it's faster on ARM that way.  Saves cache, too.
12. Inline basically everything.
13. Use MPEG-2 MC for chroma MC, since we know that MVs are
fullpel-only.  Simplify edge emulation stuff accordingly too.

Most (but probably not all) of these ideas would be
pants-on-head-retarded for a real, working H.264 decoder.

Dark Shikari
-------------- next part --------------
A non-text attachment was scrubbed...
Name: faster_h264_decode7.diff
Type: application/octet-stream
Size: 201683 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100811/28b733a0/attachment.obj>