[FFmpeg-devel] MMX accelerated DSP functions for VC1/WMV3 decoders

Sat Jun 30 19:08:24 CEST 2007

Hi

On Sat, Jun 30, 2007 at 02:37:53PM +0200, Christophe GISQUET wrote:
> Hello,
> 
> the attached patch provides some mmx functions (pshuw from mmx2 would
> only marginally be faster) for those decoders. They could also be used
> in the encoder, but I didn't bother with this, as there are probably
> people more fit than me to accommodate this with the build system.
> 
> Tests and benchmarks were performed on
> http://samples.mplayerhq.hu/V-codecs/WMV9/highdef/Robotica_720.wmv
> 
> I have tested decoding accuracy with a cmp (as I don't know nor plan to
> introduce this in the regression tests), and used the following command
> to measure speed/profile:
> ./ffmpeg -benchmark -i Robotica_720.wmv -an -f rawvideo -y /dev/null
> 
> And now for the row figures...
> without patch, utime: 7.44 7.35 7.16 7.37 7.27
> with:                 5.32 5.37 5.33 5.31 5.41
> 
> And the profiling (oprofile results)...
> without patch:
> samples  %        symbol name
>     129666   40.5939  vc1_mspel_mc
>     45812    14.3422  vc1_inv_trans_8x8_c
>     26404     8.2662  vc1_decode_p_blocks
>     21967     6.8771  put_no_rnd_h264_chroma_mc8_c
>     21336     6.6796  vc1_decode_ac_coeff
>     8582      2.6867  vc1_decode_intra_block
>     8273      2.5900  vc1_decode_p_block
>     8157      2.5537  clear_blocks_mmx
>     6896      2.1589  put_h264_chroma_mc8_mmx
>     6748      2.1126  vc1_inv_trans_8x4_c
>     6254      1.9579  vc1_inv_trans_4x8_c
> 
> with:
>     samples  %        symbol name
>     6095     17.8169  vc1_inv_trans_8x8_c
>     3769     11.0176  vc1_decode_p_blocks
>     3565     10.4212  put_no_rnd_h264_chroma_mc8_c
>     3380      9.8804  vc1_decode_ac_coeff
>     1365      3.9902  vc1_inv_trans_8x4_c
>     1348      3.9405  vc1_decode_p_block
>     1260      3.6832  clear_blocks_mmx
>     1146      3.3500  put_h264_chroma_mc8_mmx
>     1046      3.0577  vc1_inv_trans_4x8_c
>     938       2.7420  ff_emulated_edge_mc
>     849       2.4818  ff_put_vc1_mspel_mc22_mmx
>     791       2.3123  vc1_mc_1mv
>     774       2.2626  vc1_decode_intra_block
>     746       2.1807  ff_put_vc1_mspel_mc00_mmx
>     698       2.0404  ff_put_vc1_mspel_mc20_mmx
>     576       1.6838  ff_put_vc1_mspel_mc21_mmx
>     500       1.4616  ff_put_vc1_mspel_mc23_mmx
>     481       1.4061  ff_put_vc1_mspel_mc12_mmx
>     476       1.3914  ff_put_vc1_mspel_mc32_mmx
>     339       0.9910  ff_put_vc1_mspel_mc31_mmx
>     334       0.9764  ff_put_vc1_mspel_mc11_mmx
>     334       0.9764  ff_put_vc1_mspel_mc13_mmx
>     333       0.9734  ff_put_vc1_mspel_mc02_mmx
>     318       0.9296  ff_init_block_index
>     305       0.8916  ff_put_vc1_mspel_mc10_mmx
>     304       0.8887  add_pixels_clamped_mmx
>     274       0.8010  ff_put_vc1_mspel_mc33_mmx
>     267       0.7805  ff_put_vc1_mspel_mc30_mmx
>     233       0.6811  vc1_decode_i_blocks
>     180       0.5262  ff_put_vc1_mspel_mc01_mmx
>     165       0.4823  ff_put_vc1_mspel_mc03_mmx
>     154       0.4502  vc1_inv_trans_4x4_c
> 
> The new total for the ff_put_vc1_mspel_mc* functions is now just above
> 20%. There is some unoptimal stuff left of course, like filter 0 being
> just a source/destination modification, put_pixels8_mmx being
> duplicated, or some useless register loads, but code complexity would
> increase beyond what I'm willing to put in.
> 
> vc1_inv_trans_8x8_c would be a next follow-up candidate but the code
> looks bothersome. On the other hand, put_no_rnd_h264_chroma_mc8_c would
> benefit other codecs. I do have an mmx1/2 implementation for it, but I'm
> holding it until this patch gets in svn, if it ever does.
> 
> Best regards,
> Christophe GISQUET

> Index: libavcodec/i386/dsputil_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.c	(r??vision 9451)
> +++ libavcodec/i386/dsputil_mmx.c	(copie de travail)
> @@ -3163,6 +3163,10 @@
>      asm volatile("emms");
>  }
>  
> +#if defined(CONFIG_VC1_DECODER) || defined(CONFIG_WMV3_DECODER)
> +extern void ff_vc1dsp_init_mmx(DSPContext* dsp, AVCodecContext *avctx);
> +#endif
> +

the #if is unneeded


[...]
> +     "psllw     $1, %%mm1               \n\t"                   \
> +     "psllw     $1, %%mm2               \n\t"                   \

paddw


[...]
> +FF_PUT_VC1_MSPEL_MC(_mc10, MSPEL_FILTER_1, "32", MSPEL_FILTER_0, " 0")
> +FF_PUT_VC1_MSPEL_MC(_mc20, MSPEL_FILTER_2, " 8", MSPEL_FILTER_0, " 0")
> +FF_PUT_VC1_MSPEL_MC(_mc30, MSPEL_FILTER_3, "32", MSPEL_FILTER_0, " 0")
> +
> +FF_PUT_VC1_MSPEL_MC(_mc01, MSPEL_FILTER_0, " 0", MSPEL_FILTER_1, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc11, MSPEL_FILTER_1, "32", MSPEL_FILTER_1, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc21, MSPEL_FILTER_2, " 8", MSPEL_FILTER_1, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc31, MSPEL_FILTER_3, "32", MSPEL_FILTER_1, "32")
> +
> +FF_PUT_VC1_MSPEL_MC(_mc02, MSPEL_FILTER_0, " 0", MSPEL_FILTER_2, " 8")
> +FF_PUT_VC1_MSPEL_MC(_mc12, MSPEL_FILTER_1, "32", MSPEL_FILTER_2, " 8")
> +FF_PUT_VC1_MSPEL_MC(_mc22, MSPEL_FILTER_2, " 8", MSPEL_FILTER_2, " 8")
> +FF_PUT_VC1_MSPEL_MC(_mc32, MSPEL_FILTER_3, "32", MSPEL_FILTER_2, " 8")
> +
> +FF_PUT_VC1_MSPEL_MC(_mc03, MSPEL_FILTER_0, " 0", MSPEL_FILTER_3, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc13, MSPEL_FILTER_1, "32", MSPEL_FILTER_3, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc23, MSPEL_FILTER_2, " 8", MSPEL_FILTER_3, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc33, MSPEL_FILTER_3, "32", MSPEL_FILTER_3, "32")

duplicating each filter 4 times with macros is unacceptable
the overhead for 2 calls is not that big


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

It is dangerous to be right in matters on which the established authorities
are wrong. -- Voltaire
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070630/0bf54ee1/attachment.pgp>