[FFmpeg-devel] low and not so low hanging h264 fruits

Sun Feb 14 04:20:03 CET 2010

Hi

Id like you dear reader to help in optimizing h264, so please pick an
idea from below and work on it.

1. The direct temporal mv generation code (stuff like "scale * my_col + 128")
   works with 2 values at once and looks like a good candidate for mmx
   this one is easy

2. Our direct temporal & spatial MV generation code works with either 1 16x16
   or 4 8x8 blocks, try adding code for the 2 block case (16x8/8x16)
   this is easy too but no gurantee that its a win speedwise, it could be
   slower due to being more code.

3. mb_stride, b4_stride, b8_stride whatever, our decoder is full of them
   change them to a named macro and make it a constant this would reduce
   the amount of reading these from context, less register pressure and
   addressing values above and below change from [2+2*b4_stride] to [constant]
   may or may not be easy

4. interleave code from fill_decode_caches and the mb decode functions calling
   that so that branches are reduced as well as code is being excuted less
   often. An example would be dark shikaris suggestion of not setting 
   non_zero_count_cache if cbp is 0.
   This will likely be a "argh why doesnt that work" requireing some analysis
   of where things are set and used and what can and cannot be move where.
   Also no gurantee that this is faster at all, changes register pressure and
   more complex code accessing more different things + gcc could kill the gains
   Something similar could also be tried with the fill_filter_caches and the
   loop filter
   Also i might be working on parts of this, more specifically the fill+cabac
   relative stuff, cabac is doing some seriously redundant looking things that
   i plan to work on soon

5. simply going over all the if() finding ones that are poorly predictable
   and trying to replace them by branchless code where it is faster and
   cleanly doable

6. our Motion Compensation code works directly from the pictures, it is
   possible that in some cases it would be faster to use a intermediate
   halfperl interpolated image. This should be especially for small blocks
   and bidirectionally predicted blocks
   This likely is not easy, and ideally should be adaptively selected
   depending on picture content (use last pictures motion vectors to predict
   which way is better for the current picture. Or maybe have some kind of
   cache that calcuates and reuses halfpel values between blocks but doesnt
   cause any to be calculated if no block needs them ...

I have many more ideas, ill post them once some of these are done

PS: a SOC h264 optimizing project would be a good idea too with qualification
task of making our decoder at least 1% faster.

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Democracy is the form of government in which you can choose your dictator
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100214/e04a5cf2/attachment.pgp>