[FFmpeg-devel] MMX accelerated DSP functions for VC1/WMV3 decoders
Michael Niedermayer
michaelni
Sat Jun 30 19:08:24 CEST 2007
Hi
On Sat, Jun 30, 2007 at 02:37:53PM +0200, Christophe GISQUET wrote:
> Hello,
>
> the attached patch provides some mmx functions (pshuw from mmx2 would
> only marginally be faster) for those decoders. They could also be used
> in the encoder, but I didn't bother with this, as there are probably
> people more fit than me to accommodate this with the build system.
>
> Tests and benchmarks were performed on
> http://samples.mplayerhq.hu/V-codecs/WMV9/highdef/Robotica_720.wmv
>
> I have tested decoding accuracy with a cmp (as I don't know nor plan to
> introduce this in the regression tests), and used the following command
> to measure speed/profile:
> ./ffmpeg -benchmark -i Robotica_720.wmv -an -f rawvideo -y /dev/null
>
> And now for the row figures...
> without patch, utime: 7.44 7.35 7.16 7.37 7.27
> with: 5.32 5.37 5.33 5.31 5.41
>
> And the profiling (oprofile results)...
> without patch:
> samples % symbol name
> 129666 40.5939 vc1_mspel_mc
> 45812 14.3422 vc1_inv_trans_8x8_c
> 26404 8.2662 vc1_decode_p_blocks
> 21967 6.8771 put_no_rnd_h264_chroma_mc8_c
> 21336 6.6796 vc1_decode_ac_coeff
> 8582 2.6867 vc1_decode_intra_block
> 8273 2.5900 vc1_decode_p_block
> 8157 2.5537 clear_blocks_mmx
> 6896 2.1589 put_h264_chroma_mc8_mmx
> 6748 2.1126 vc1_inv_trans_8x4_c
> 6254 1.9579 vc1_inv_trans_4x8_c
>
> with:
> samples % symbol name
> 6095 17.8169 vc1_inv_trans_8x8_c
> 3769 11.0176 vc1_decode_p_blocks
> 3565 10.4212 put_no_rnd_h264_chroma_mc8_c
> 3380 9.8804 vc1_decode_ac_coeff
> 1365 3.9902 vc1_inv_trans_8x4_c
> 1348 3.9405 vc1_decode_p_block
> 1260 3.6832 clear_blocks_mmx
> 1146 3.3500 put_h264_chroma_mc8_mmx
> 1046 3.0577 vc1_inv_trans_4x8_c
> 938 2.7420 ff_emulated_edge_mc
> 849 2.4818 ff_put_vc1_mspel_mc22_mmx
> 791 2.3123 vc1_mc_1mv
> 774 2.2626 vc1_decode_intra_block
> 746 2.1807 ff_put_vc1_mspel_mc00_mmx
> 698 2.0404 ff_put_vc1_mspel_mc20_mmx
> 576 1.6838 ff_put_vc1_mspel_mc21_mmx
> 500 1.4616 ff_put_vc1_mspel_mc23_mmx
> 481 1.4061 ff_put_vc1_mspel_mc12_mmx
> 476 1.3914 ff_put_vc1_mspel_mc32_mmx
> 339 0.9910 ff_put_vc1_mspel_mc31_mmx
> 334 0.9764 ff_put_vc1_mspel_mc11_mmx
> 334 0.9764 ff_put_vc1_mspel_mc13_mmx
> 333 0.9734 ff_put_vc1_mspel_mc02_mmx
> 318 0.9296 ff_init_block_index
> 305 0.8916 ff_put_vc1_mspel_mc10_mmx
> 304 0.8887 add_pixels_clamped_mmx
> 274 0.8010 ff_put_vc1_mspel_mc33_mmx
> 267 0.7805 ff_put_vc1_mspel_mc30_mmx
> 233 0.6811 vc1_decode_i_blocks
> 180 0.5262 ff_put_vc1_mspel_mc01_mmx
> 165 0.4823 ff_put_vc1_mspel_mc03_mmx
> 154 0.4502 vc1_inv_trans_4x4_c
>
> The new total for the ff_put_vc1_mspel_mc* functions is now just above
> 20%. There is some unoptimal stuff left of course, like filter 0 being
> just a source/destination modification, put_pixels8_mmx being
> duplicated, or some useless register loads, but code complexity would
> increase beyond what I'm willing to put in.
>
> vc1_inv_trans_8x8_c would be a next follow-up candidate but the code
> looks bothersome. On the other hand, put_no_rnd_h264_chroma_mc8_c would
> benefit other codecs. I do have an mmx1/2 implementation for it, but I'm
> holding it until this patch gets in svn, if it ever does.
>
> Best regards,
> Christophe GISQUET
> Index: libavcodec/i386/dsputil_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.c (r??vision 9451)
> +++ libavcodec/i386/dsputil_mmx.c (copie de travail)
> @@ -3163,6 +3163,10 @@
> asm volatile("emms");
> }
>
> +#if defined(CONFIG_VC1_DECODER) || defined(CONFIG_WMV3_DECODER)
> +extern void ff_vc1dsp_init_mmx(DSPContext* dsp, AVCodecContext *avctx);
> +#endif
> +
the #if is unneeded
[...]
> + "psllw $1, %%mm1 \n\t" \
> + "psllw $1, %%mm2 \n\t" \
paddw
[...]
> +FF_PUT_VC1_MSPEL_MC(_mc10, MSPEL_FILTER_1, "32", MSPEL_FILTER_0, " 0")
> +FF_PUT_VC1_MSPEL_MC(_mc20, MSPEL_FILTER_2, " 8", MSPEL_FILTER_0, " 0")
> +FF_PUT_VC1_MSPEL_MC(_mc30, MSPEL_FILTER_3, "32", MSPEL_FILTER_0, " 0")
> +
> +FF_PUT_VC1_MSPEL_MC(_mc01, MSPEL_FILTER_0, " 0", MSPEL_FILTER_1, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc11, MSPEL_FILTER_1, "32", MSPEL_FILTER_1, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc21, MSPEL_FILTER_2, " 8", MSPEL_FILTER_1, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc31, MSPEL_FILTER_3, "32", MSPEL_FILTER_1, "32")
> +
> +FF_PUT_VC1_MSPEL_MC(_mc02, MSPEL_FILTER_0, " 0", MSPEL_FILTER_2, " 8")
> +FF_PUT_VC1_MSPEL_MC(_mc12, MSPEL_FILTER_1, "32", MSPEL_FILTER_2, " 8")
> +FF_PUT_VC1_MSPEL_MC(_mc22, MSPEL_FILTER_2, " 8", MSPEL_FILTER_2, " 8")
> +FF_PUT_VC1_MSPEL_MC(_mc32, MSPEL_FILTER_3, "32", MSPEL_FILTER_2, " 8")
> +
> +FF_PUT_VC1_MSPEL_MC(_mc03, MSPEL_FILTER_0, " 0", MSPEL_FILTER_3, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc13, MSPEL_FILTER_1, "32", MSPEL_FILTER_3, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc23, MSPEL_FILTER_2, " 8", MSPEL_FILTER_3, "32")
> +FF_PUT_VC1_MSPEL_MC(_mc33, MSPEL_FILTER_3, "32", MSPEL_FILTER_3, "32")
duplicating each filter 4 times with macros is unacceptable
the overhead for 2 calls is not that big
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
It is dangerous to be right in matters on which the established authorities
are wrong. -- Voltaire
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070630/0bf54ee1/attachment.pgp>
More information about the ffmpeg-devel
mailing list