[FFmpeg-devel] [PATCH] h264 parallelized

Sun Sep 2 09:50:01 CEST 2007

Michael,

Michael Niedermayer wrote:
> Hi
> 
>> ive tried another file (Aladin.mpg 995 frames 352x240, the other file
>> was 538 frames 160x128)
>> svn  : 0m10.828s, 0m10.777s, 0m10.848s, 0m10.799s, 0m10.742s avg:10.799
>> patch: 0m10.770s, 0m10.777s, 0m10.831s, 0m10.918s, 0m10.778s avg:10.815
>>
>> ill do more tests
> 
> ive tried the first file concatenated 5 times:
> 0m3.669s, 0m3.696s, 0m3.674s, 0m3.700s, 0m3.724s avg:3.693
> 0m3.781s, 0m3.782s, 0m3.770s, 0m3.797s, 0m3.776s avg:3.781
> 
> this should exclude any once run init code as a possible cause
> 

I'm stumbling a bit around the problem here and not really able to
reproduce the slowdown on any of my systems. It's actually mostly
faster with the patch.

10 rounds of decoding (without audio), user time:

Aladin.mpg:
Intel(R) Pentium(R) M processor 1.73GHz
unmodified: avg: 2.658  stddev: 0.053  med: 2.672
patched:    avg: 2.673  stddev: 0.014  med: 2.676

AMD Sempron(tm) Processor 2800+
unmodified: avg: 3.670  stddev: 0.033  med: 3.670
patched:    avg: 3.511  stddev: 0.055  med: 3.500


apple zodiac trailer:
Intel(R) Pentium(R) M processor 1.73GHz
unmodified: avg: 67.354  stddev: 0.132  med: 67.370
patched:    avg: 66.801  stddev: 0.371  med: 66.642

AMD Sempron(tm) Processor 2800+
unmodified: avg: 78.481  stddev: 0.543  med: 78.485
patched:    avg: 76.089  stddev: 0.293  med: 76.090

All tests has been run under a vanilla ./configure build.

I've ran tests with valgrind's cachegrind -> cant see any difference.

gprofing wont really compile with optimized cabac-support (7regs
conflicts with function instrumentation). But then again, i'm
not even able to reproduce the slowdown so i doubt it would
give me any usable feedback.

Looking at symbol sizes with nm there is not much difference
either, see below.

I've tried to rearrange the added functions to see if there
is any inlineing issues, but there is not much speed change.

If you (or anyone else) have any ideas I'd be happy to hear them :-)

Otherwise, i'll just have to drop the patch on the floor.
(Or let it linger till i come up with some idea, or stumble across
a machine where it slows down)

--- /tmp/unmodified.symbols	2007-09-02 09:37:07.000000000 +0200
+++ /tmp/patched.symbols	2007-09-02 09:37:01.000000000 +0200
@@ -1,4 +1,4 @@
-00000315 t alloc_tables
+000002b5 t alloc_tables
  0000009c r alpha_table
  0000005c r b_mb_type_info
  00000034 r b_sub_mb_type_info
@@ -14,26 +14,27 @@
  0000000c r chroma_dc_total_zeros_len
  00000030 b chroma_dc_total_zeros_vlc
  00000034 r chroma_qp
+0000012f t clone_slice
  00000110 r coeff_token_bits
  00000110 r coeff_token_len
  00000040 b coeff_token_vlc
  000001d1 t decode_cabac_intra_mb_type
  00000704 t decode_cabac_mb_mvd
  00001233 t decode_cabac_residual
  00000040 t decode_end
  000015f2 t decode_frame
-00000f3c t decode_init
-000075f7 t decode_mb_cabac
-00006103 t decode_mb_cavlc
+00000f49 t decode_init
+000075b7 t decode_mb_cabac
+00006137 t decode_mb_cavlc
  00000acf t decode_mb_skip
-00001667 t decode_nal_units
+00001a7e t decode_nal_units
  00000e8a t decode_ref_pic_list_reordering
  0000088b t decode_residual
  00000d8b t decode_scaling_matrices
  00001a97 t decode_seq_parameter_set
  00000706 t decode_slice
-0000384a t decode_slice_header
+00002b84 t decode_slice_header
  00000020 r default_scaling4
  00000080 r default_scaling8
  00000012 r dequant4_coeff_init
@@ -60,15 +61,17 @@
  00000010 r field_scan
  00000040 r field_scan8x8
  00000040 r field_scan8x8_cavlc
-00001fee t fill_caches
+00001ffe t fill_caches
+00000f75 t fill_default_ref_list
+00000240 t fill_mbaff_ref_list
  00001e1c t filter_mb
  000003a0 t filter_mb_edgeh
  000002bf t filter_mb_edgev
  00002651 t filter_mb_fast
  00000202 t filter_mb_mbaff_edgecv
  000002da t flush_dpb
-000002e1 t frame_start
-0000010f t free_tables
+000002fe t frame_start
+00000145 t free_tables
  000000ae t get_cabac_noinline
  00000030 r golomb_to_inter_cbp
  00000030 r golomb_to_intra4x4_cbp
@@ -76,10 +79,11 @@
  00000034 D h264_decoder
  000002f6 t h264_luma_dc_dequant_idct_c
  0000270f t hl_decode_mb_complex
-0000124b t hl_decode_mb_simple
+00001244 t hl_decode_mb_simple
  00000c87 t hl_motion
  00000068 r i_mb_type_info
  00000613 t init_dequant_tables
+000004cb t init_scan_tables
  0000003f r last_coeff_flag_offset_8x8
  00000010 r luma_dc_zigzag_scan
  00001e0f t mc_part
@@ -116,7 +120,7 @@
  0000072c t svq3_add_idct_c
  00000040 r svq3_dct_tables
  000008c8 t svq3_decode_frame
-00002392 t svq3_decode_mb
+00002372 t svq3_decode_mb
  000004cd t svq3_decode_slice_header
  00000034 D svq3_decoder
  00000080 r svq3_dequant_coeff