[FFmpeg-devel] [PATCH] mmx implementation of vc-1 inverse transformations
Michael Niedermayer
michaelni
Wed Jul 30 20:57:29 CEST 2008
On Tue, Jul 29, 2008 at 10:59:38PM +0200, Victor Pollex wrote:
> Kostya schrieb:
>> On Mon, Jul 28, 2008 at 09:27:24PM +0200, Victor Pollex wrote:
>> [...]
>>
>>>>
>>> changed the other scantables aswell, though i haven't tested the 8x4 and
>>> 4x8 for the advanced profile, as I haven't found a progressive video
>>> encoded with the advanced profile.
>>>
>>
>> try http://samples.mplayerhq.hu/evob/sample.evo or
>> http://samples.mplayerhq.hu/evob/sample-long.evo
>>
>> also please use -f crc to make sure output is the same
>>
>
> I used the first 300 frames of sample.evo, as at some point later on there
> is a bits overconsumption error and i end up with different CRCs. The
> result I got for both normal and transponsed scantables is CRC=0x158632f5
>
>> [...]
>>
>>> 00_constants.patch:
>>> changes the ff_pw_4 and ff_pw_64 constants to be able to use them with
>>> sse2
>>>
>>> 01_dsputil_mmx.patch:
>>> adds an parameter to the LOAD4 / STORE4 macros to specify a mov mnemonic
>>> suffix
>>>
>>> 02_vc1dsp.patch:
>>> the new "transposed" scantables and altered code to use them
>>>
>>> 03_vc1dsp_mmx.patch:
>>> adds a new header file with macros for the transformations and adds an
>>> mmx version for the transformations
>>>
>>> 04_vc1dsp_sse2.path:
>>> adds an sse2/mmx version for the transformations and uses a bit of sse
>>> instructions. I hope this is acceptable.
>>> someone should benchmark the 4x8 and 8x4 versions, as for me the former
>>> was slower and the latter a bit faster compared to the mmx versions
>>>
>> Hmm, I'm ok with transposed scantables if that does not change decoded
>> frames.
>>
>
> It only changes the data layout in "block", everything in "dest" should be
> the same as before.
>
>> As for the rest - I'm not a maintainer but your files did not compile on
>> Core2.
>> Here's a snippet (same for vc1dsp_mmx.c):
>>
>> libavcodec/i386/vc1dsp_sse2.c: Assembler messages:
>> libavcodec/i386/vc1dsp_sse2.c:299: Error: suffix or operands invalid for
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:305: Error: suffix or operands invalid for
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:311: Error: suffix or operands invalid for
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:232: Error: `(%rdi,%eax)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:236: Error: `(%rdi,%eax)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:237: Error: `(%rdi,%esi,4)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:241: Error: `(%rdi,%esi,4)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:242: Error: suffix or operands invalid for
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:243: Error: `(%rdi,%eax,2)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:247: Error: `(%rdi,%eax,2)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:331: Error: `(%rdi,%esi)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:335: Error: `(%rdi,%esi)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:336: Error: `(%rdi,%esi,4)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:340: Error: `(%rdi,%esi,4)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:341: Error: suffix or operands invalid for
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:342: Error: `(%rdi,%esi,4)' is not a valid
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:346: Error: `(%rdi,%esi,4)' is not a valid
>> base/index expression
>> make: *** [libavcodec/i386/vc1dsp_sse2.o] Error 1
>
> I guess I forgot to add portability for the x86_64 architecture. Hopefully
> it is fixed now.
>
[...]
> Index: libavcodec/i386/dsputil_mmx.h
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.h (Revision 14469)
> +++ libavcodec/i386/dsputil_mmx.h (Arbeitskopie)
> @@ -33,7 +33,7 @@
> extern const uint64_t ff_pdw_80000000[2];
>
> extern const uint64_t ff_pw_3;
> -extern const uint64_t ff_pw_4;
> +extern const xmm_t ff_pw_4;
> extern const xmm_t ff_pw_5;
> extern const uint64_t ff_pw_8;
> extern const uint64_t ff_pw_15;
> @@ -42,7 +42,7 @@
> extern const xmm_t ff_pw_28;
> extern const xmm_t ff_pw_32;
> extern const uint64_t ff_pw_42;
> -extern const uint64_t ff_pw_64;
> +extern const xmm_t ff_pw_64;
> extern const uint64_t ff_pw_96;
> extern const uint64_t ff_pw_128;
> extern const uint64_t ff_pw_255;
> Index: libavcodec/i386/dsputil_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.c (Revision 14469)
> +++ libavcodec/i386/dsputil_mmx.c (Arbeitskopie)
> @@ -46,7 +46,7 @@
> {0x8000000080000000ULL, 0x8000000080000000ULL};
>
> DECLARE_ALIGNED_8 (const uint64_t, ff_pw_3 ) = 0x0003000300030003ULL;
> -DECLARE_ALIGNED_8 (const uint64_t, ff_pw_4 ) = 0x0004000400040004ULL;
> +DECLARE_ALIGNED_16(const xmm_t, ff_pw_4 ) = {0x0004000400040004ULL, 0x0004000400040004ULL};
> DECLARE_ALIGNED_16(const xmm_t, ff_pw_5 ) = {0x0005000500050005ULL, 0x0005000500050005ULL};
> DECLARE_ALIGNED_8 (const uint64_t, ff_pw_8 ) = 0x0008000800080008ULL;
> DECLARE_ALIGNED_8 (const uint64_t, ff_pw_15 ) = 0x000F000F000F000FULL;
> @@ -55,7 +55,7 @@
> DECLARE_ALIGNED_16(const xmm_t, ff_pw_28 ) = {0x001C001C001C001CULL, 0x001C001C001C001CULL};
> DECLARE_ALIGNED_16(const xmm_t, ff_pw_32 ) = {0x0020002000200020ULL, 0x0020002000200020ULL};
> DECLARE_ALIGNED_8 (const uint64_t, ff_pw_42 ) = 0x002A002A002A002AULL;
> -DECLARE_ALIGNED_8 (const uint64_t, ff_pw_64 ) = 0x0040004000400040ULL;
> +DECLARE_ALIGNED_16(const xmm_t, ff_pw_64 ) = {0x0040004000400040ULL, 0x0040004000400040ULL};
> DECLARE_ALIGNED_8 (const uint64_t, ff_pw_96 ) = 0x0060006000600060ULL;
> DECLARE_ALIGNED_8 (const uint64_t, ff_pw_128) = 0x0080008000800080ULL;
> DECLARE_ALIGNED_8 (const uint64_t, ff_pw_255) = 0x00ff00ff00ff00ffULL;
> Index: libavcodec/i386/dsputil_h264_template_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_h264_template_mmx.c (Revision 14469)
> +++ libavcodec/i386/dsputil_h264_template_mmx.c (Arbeitskopie)
> @@ -45,7 +45,7 @@
> /* 1 dimensional filter only */
> const int dxy = x ? 1 : stride;
>
> - rnd_reg = rnd ? &ff_pw_4 : &ff_pw_3;
> + rnd_reg = rnd ? &ff_pw_4.a : &ff_pw_3;
>
> asm volatile(
> "movd %0, %%mm5\n\t"
> Index: libavcodec/i386/cavsdsp_mmx.c
> ===================================================================
> --- libavcodec/i386/cavsdsp_mmx.c (Revision 14469)
> +++ libavcodec/i386/cavsdsp_mmx.c (Arbeitskopie)
> @@ -118,7 +118,7 @@
> for(i=0; i<2; i++){
> DECLARE_ALIGNED_8(uint64_t, tmp);
>
> - cavs_idct8_1d(block+4*i, ff_pw_4);
> + cavs_idct8_1d(block+4*i, ff_pw_4.a);
>
> asm volatile(
> "psraw $3, %%mm7 \n\t"
> @@ -148,7 +148,7 @@
> }
>
> for(i=0; i<2; i++){
> - cavs_idct8_1d(b2+4*i, ff_pw_64);
> + cavs_idct8_1d(b2+4*i, ff_pw_64.a);
>
> asm volatile(
> "psraw $7, %%mm7 \n\t"
ok
> Index: libavcodec/i386/dsputil_mmx.h
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.h (Revision 14469)
> +++ libavcodec/i386/dsputil_mmx.h (Arbeitskopie)
> @@ -57,17 +57,17 @@
> extern const double ff_pd_1[2];
> extern const double ff_pd_2[2];
>
> -#define LOAD4(stride,in,a,b,c,d)\
> - "movq 0*"#stride"+"#in", "#a"\n\t"\
> - "movq 1*"#stride"+"#in", "#b"\n\t"\
> - "movq 2*"#stride"+"#in", "#c"\n\t"\
> - "movq 3*"#stride"+"#in", "#d"\n\t"
> +#define LOAD4(m,stride,in,a,b,c,d)\
> + "mov"#m" 0*"#stride"+"#in", "#a"\n\t"\
> + "mov"#m" 1*"#stride"+"#in", "#b"\n\t"\
> + "mov"#m" 2*"#stride"+"#in", "#c"\n\t"\
> + "mov"#m" 3*"#stride"+"#in", "#d"\n\t"
>
> -#define STORE4(stride,out,a,b,c,d)\
> - "movq "#a", 0*"#stride"+"#out"\n\t"\
> - "movq "#b", 1*"#stride"+"#out"\n\t"\
> - "movq "#c", 2*"#stride"+"#out"\n\t"\
> - "movq "#d", 3*"#stride"+"#out"\n\t"
> +#define STORE4(m,stride,out,a,b,c,d)\
> + "mov"#m" "#a", 0*"#stride"+"#out"\n\t"\
> + "mov"#m" "#b", 1*"#stride"+"#out"\n\t"\
> + "mov"#m" "#c", 2*"#stride"+"#out"\n\t"\
> + "mov"#m" "#d", 3*"#stride"+"#out"\n\t"
>
> /* in/out: mma=mma+mmb, mmb=mmb-mma */
> #define SUMSUB_BA( a, b ) \
> Index: libavcodec/i386/dsputilenc_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputilenc_mmx.c (Revision 14469)
> +++ libavcodec/i386/dsputilenc_mmx.c (Arbeitskopie)
> @@ -1041,11 +1041,11 @@
> "movq %%mm7, 96(%1) \n\t"\
> \
> TRANSPOSE4(%%mm0, %%mm1, %%mm2, %%mm3, %%mm7)\
> - STORE4(8, 0(%1), %%mm0, %%mm3, %%mm7, %%mm2)\
> + STORE4(q, 8, 0(%1), %%mm0, %%mm3, %%mm7, %%mm2)\
> \
> "movq 96(%1), %%mm7 \n\t"\
> TRANSPOSE4(%%mm4, %%mm5, %%mm6, %%mm7, %%mm0)\
> - STORE4(8, 64(%1), %%mm4, %%mm7, %%mm0, %%mm6)\
> + STORE4(q, 8, 64(%1), %%mm4, %%mm7, %%mm0, %%mm6)\
> \
> : "=r" (sum)\
> : "r"(temp)\
> @@ -1059,7 +1059,7 @@
> "movq %%mm7, 96(%1) \n\t"\
> \
> TRANSPOSE4(%%mm0, %%mm1, %%mm2, %%mm3, %%mm7)\
> - STORE4(8, 32(%1), %%mm0, %%mm3, %%mm7, %%mm2)\
> + STORE4(q, 8, 32(%1), %%mm0, %%mm3, %%mm7, %%mm2)\
> \
> "movq 96(%1), %%mm7 \n\t"\
> TRANSPOSE4(%%mm4, %%mm5, %%mm6, %%mm7, %%mm0)\
> @@ -1067,7 +1067,7 @@
> "movq %%mm6, %%mm7 \n\t"\
> "movq %%mm0, %%mm6 \n\t"\
> \
> - LOAD4(8, 64(%1), %%mm0, %%mm1, %%mm2, %%mm3)\
> + LOAD4(q, 8, 64(%1), %%mm0, %%mm1, %%mm2, %%mm3)\
> \
> HADAMARD48\
> "movq %%mm7, 64(%1) \n\t"\
> @@ -1083,8 +1083,8 @@
> "paddusw %%mm1, %%mm0 \n\t"\
> "movq %%mm0, 64(%1) \n\t"\
> \
> - LOAD4(8, 0(%1), %%mm0, %%mm1, %%mm2, %%mm3)\
> - LOAD4(8, 32(%1), %%mm4, %%mm5, %%mm6, %%mm7)\
> + LOAD4(q, 8, 0(%1), %%mm0, %%mm1, %%mm2, %%mm3)\
> + LOAD4(q, 8, 32(%1), %%mm4, %%mm5, %%mm6, %%mm7)\
> \
> HADAMARD48\
> "movq %%mm7, (%1) \n\t"\
ok
[...]
> @@ -467,7 +469,256 @@
> DECLARE_FUNCTION(3, 2)
> DECLARE_FUNCTION(3, 3)
>
> +static void vc1_inv_trans_8x8_mmx(DCTELEM block[64])
> +{
> + DECLARE_ALIGNED_16(int16_t, temp[64]);
> + asm volatile(
> + LOAD4(q,0x10,0x00(%0),%%mm5,%%mm1,%%mm0,%%mm3)
> + TRANSPOSE4(%%mm5,%%mm1,%%mm0,%%mm3,%%mm4)
> + STORE4(q,0x10,0x00(%0),%%mm5,%%mm3,%%mm4,%%mm0)
> +
> + LOAD4(q,0x10,0x08(%0),%%mm6,%%mm5,%%mm7,%%mm1)
> + TRANSPOSE4(%%mm6,%%mm5,%%mm7,%%mm1,%%mm2)
> + STORE4(q,0x10,0x08(%0),%%mm6,%%mm1,%%mm2,%%mm7)
it is still transposing the data at the begin of functions.
I thought you transposed the scantables ...
[...]
> + :
> + : "r"(block), "m"(temp[0])
> + : "memory"
> + );
> +
> + asm volatile(
why is this asm () block splited?
[...]
> + STORE4(q,0x10,0x40%1,%%mm4,%%mm7,%%mm0,%%mm6)
> + :
> + : "r"(block), "m"(temp[0]), "m"(ff_pw_4)
> + : "memory"
> + );
> +
> + asm volatile(
> + "movq 0x30%3, %%mm1\n\t" /* b[3] */
> + TRANSFORM_4X8_COL_H1
> + (
> + q,q,
> + 0x00%3,0x10%3,0x20%3,0x40%3,0x70%3,
this store and later load seems redundant
and the asm should not be split
[...]
> + STORE4(dqa,0x10,0x00(%0),%%xmm0,%%xmm5,%%xmm7,%%xmm3)
> + STORE4(dqa,0x10,0x40(%0),%%xmm6,%%xmm4,%%xmm2,%%xmm1)
> + TRANSFORM_8X4_ROW_H1
> + (
> + dqa,dqa,
> + 0x00(%0),0x20(%0),0x40(%0),0x70(%0),
some of these stores and loads seem redundant
[...]
> +void ff_vc1dsp_init_sse2(DSPContext* dsp, AVCodecContext *avctx) {
> + if(!(mm_flags & MM_SSE2))
> + return;
> +
> + dsp->vc1_inv_trans_8x8 = vc1_inv_trans_8x8_sse2;
> + dsp->vc1_inv_trans_4x8 = vc1_inv_trans_4x8_sse2;
> + dsp->vc1_inv_trans_8x4 = vc1_inv_trans_8x4_sse2;
> +}
are all of the SSE2 variants faste than mmx?
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at mplayerhq.hu
> https://lists.mplayerhq.hu/mailman/listinfo/ffmpeg-devel
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Those who are too smart to engage in politics are punished by being
governed by those who are dumber. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080730/d852ece7/attachment.pgp>
More information about the ffmpeg-devel
mailing list