[FFmpeg-devel] [PATCH] mmx implementation of vc-1 inverse transformations

Wed Jul 30 20:57:29 CEST 2008

On Tue, Jul 29, 2008 at 10:59:38PM +0200, Victor Pollex wrote:
> Kostya schrieb:
>> On Mon, Jul 28, 2008 at 09:27:24PM +0200, Victor Pollex wrote:
>> [...]
>>   
>>>>        
>>> changed the other scantables aswell, though i haven't tested the 8x4 and 
>>> 4x8 for the advanced profile, as I haven't found a progressive video 
>>> encoded with the advanced profile.
>>>     
>>
>> try http://samples.mplayerhq.hu/evob/sample.evo or
>> http://samples.mplayerhq.hu/evob/sample-long.evo
>>
>> also please use -f crc to make sure output is the same
>>   
>
> I used the first 300 frames of sample.evo, as at some point later on there 
> is a bits overconsumption error and i end up with different CRCs. The 
> result I got for both normal and transponsed scantables is CRC=0x158632f5
>
>> [...]
>>   
>>> 00_constants.patch:
>>> changes the ff_pw_4 and ff_pw_64 constants to be able to use them with 
>>> sse2
>>>
>>> 01_dsputil_mmx.patch:
>>> adds an parameter to the LOAD4 / STORE4 macros to specify a mov mnemonic 
>>> suffix
>>>
>>> 02_vc1dsp.patch:
>>> the new "transposed" scantables and altered code to use them
>>>
>>> 03_vc1dsp_mmx.patch:
>>> adds a new header file with macros for the transformations and adds an 
>>> mmx version for the transformations
>>>
>>> 04_vc1dsp_sse2.path:
>>> adds an sse2/mmx version for the transformations and uses a bit of sse 
>>> instructions. I hope this is acceptable.
>>> someone should benchmark the 4x8 and 8x4 versions, as for me the former 
>>> was slower and the latter a bit faster compared to the mmx versions
>>>     
>>  Hmm, I'm ok with transposed scantables if that does not change decoded 
>> frames.
>>   
>
> It only changes the data layout in "block", everything in "dest" should be 
> the same as before.
>
>> As for the rest - I'm not a maintainer but your files did not compile on 
>> Core2.
>> Here's a snippet (same for vc1dsp_mmx.c):
>>
>> libavcodec/i386/vc1dsp_sse2.c: Assembler messages:
>> libavcodec/i386/vc1dsp_sse2.c:299: Error: suffix or operands invalid for 
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:305: Error: suffix or operands invalid for 
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:311: Error: suffix or operands invalid for 
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:232: Error: `(%rdi,%eax)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:236: Error: `(%rdi,%eax)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:237: Error: `(%rdi,%esi,4)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:241: Error: `(%rdi,%esi,4)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:242: Error: suffix or operands invalid for 
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:243: Error: `(%rdi,%eax,2)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:247: Error: `(%rdi,%eax,2)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:331: Error: `(%rdi,%esi)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:335: Error: `(%rdi,%esi)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:336: Error: `(%rdi,%esi,4)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:340: Error: `(%rdi,%esi,4)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:341: Error: suffix or operands invalid for 
>> `add'
>> libavcodec/i386/vc1dsp_sse2.c:342: Error: `(%rdi,%esi,4)' is not a valid 
>> base/index expression
>> libavcodec/i386/vc1dsp_sse2.c:346: Error: `(%rdi,%esi,4)' is not a valid 
>> base/index expression
>> make: *** [libavcodec/i386/vc1dsp_sse2.o] Error 1   
>
> I guess I forgot to add portability for the x86_64 architecture. Hopefully 
> it is fixed now.
>

[...]
> Index: libavcodec/i386/dsputil_mmx.h
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.h	(Revision 14469)
> +++ libavcodec/i386/dsputil_mmx.h	(Arbeitskopie)
> @@ -33,7 +33,7 @@
>  extern const uint64_t ff_pdw_80000000[2];
> 
>  extern const uint64_t ff_pw_3;
> -extern const uint64_t ff_pw_4;
> +extern const xmm_t    ff_pw_4;
>  extern const xmm_t    ff_pw_5;
>  extern const uint64_t ff_pw_8;
>  extern const uint64_t ff_pw_15;
> @@ -42,7 +42,7 @@
>  extern const xmm_t    ff_pw_28;
>  extern const xmm_t    ff_pw_32;
>  extern const uint64_t ff_pw_42;
> -extern const uint64_t ff_pw_64;
> +extern const xmm_t    ff_pw_64;
>  extern const uint64_t ff_pw_96;
>  extern const uint64_t ff_pw_128;
>  extern const uint64_t ff_pw_255;
> Index: libavcodec/i386/dsputil_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.c	(Revision 14469)
> +++ libavcodec/i386/dsputil_mmx.c	(Arbeitskopie)
> @@ -46,7 +46,7 @@
>  {0x8000000080000000ULL, 0x8000000080000000ULL};
> 
>  DECLARE_ALIGNED_8 (const uint64_t, ff_pw_3  ) = 0x0003000300030003ULL;
> -DECLARE_ALIGNED_8 (const uint64_t, ff_pw_4  ) = 0x0004000400040004ULL;
> +DECLARE_ALIGNED_16(const xmm_t,    ff_pw_4  ) = {0x0004000400040004ULL, 0x0004000400040004ULL};
>  DECLARE_ALIGNED_16(const xmm_t,    ff_pw_5  ) = {0x0005000500050005ULL, 0x0005000500050005ULL};
>  DECLARE_ALIGNED_8 (const uint64_t, ff_pw_8  ) = 0x0008000800080008ULL;
>  DECLARE_ALIGNED_8 (const uint64_t, ff_pw_15 ) = 0x000F000F000F000FULL;
> @@ -55,7 +55,7 @@
>  DECLARE_ALIGNED_16(const xmm_t,    ff_pw_28 ) = {0x001C001C001C001CULL, 0x001C001C001C001CULL};
>  DECLARE_ALIGNED_16(const xmm_t,    ff_pw_32 ) = {0x0020002000200020ULL, 0x0020002000200020ULL};
>  DECLARE_ALIGNED_8 (const uint64_t, ff_pw_42 ) = 0x002A002A002A002AULL;
> -DECLARE_ALIGNED_8 (const uint64_t, ff_pw_64 ) = 0x0040004000400040ULL;
> +DECLARE_ALIGNED_16(const xmm_t,    ff_pw_64 ) = {0x0040004000400040ULL, 0x0040004000400040ULL};
>  DECLARE_ALIGNED_8 (const uint64_t, ff_pw_96 ) = 0x0060006000600060ULL;
>  DECLARE_ALIGNED_8 (const uint64_t, ff_pw_128) = 0x0080008000800080ULL;
>  DECLARE_ALIGNED_8 (const uint64_t, ff_pw_255) = 0x00ff00ff00ff00ffULL;
> Index: libavcodec/i386/dsputil_h264_template_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_h264_template_mmx.c	(Revision 14469)
> +++ libavcodec/i386/dsputil_h264_template_mmx.c	(Arbeitskopie)
> @@ -45,7 +45,7 @@
>          /* 1 dimensional filter only */
>          const int dxy = x ? 1 : stride;
> 
> -        rnd_reg = rnd ? &ff_pw_4 : &ff_pw_3;
> +        rnd_reg = rnd ? &ff_pw_4.a : &ff_pw_3;
> 
>          asm volatile(
>              "movd %0, %%mm5\n\t"
> Index: libavcodec/i386/cavsdsp_mmx.c
> ===================================================================
> --- libavcodec/i386/cavsdsp_mmx.c	(Revision 14469)
> +++ libavcodec/i386/cavsdsp_mmx.c	(Arbeitskopie)
> @@ -118,7 +118,7 @@
>      for(i=0; i<2; i++){
>          DECLARE_ALIGNED_8(uint64_t, tmp);
> 
> -        cavs_idct8_1d(block+4*i, ff_pw_4);
> +        cavs_idct8_1d(block+4*i, ff_pw_4.a);
> 
>          asm volatile(
>              "psraw     $3, %%mm7  \n\t"
> @@ -148,7 +148,7 @@
>      }
> 
>      for(i=0; i<2; i++){
> -        cavs_idct8_1d(b2+4*i, ff_pw_64);
> +        cavs_idct8_1d(b2+4*i, ff_pw_64.a);
> 
>          asm volatile(
>              "psraw     $7, %%mm7  \n\t"

ok

> Index: libavcodec/i386/dsputil_mmx.h
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.h	(Revision 14469)
> +++ libavcodec/i386/dsputil_mmx.h	(Arbeitskopie)
> @@ -57,17 +57,17 @@
>  extern const double ff_pd_1[2];
>  extern const double ff_pd_2[2];
> 
> -#define LOAD4(stride,in,a,b,c,d)\
> -    "movq 0*"#stride"+"#in", "#a"\n\t"\
> -    "movq 1*"#stride"+"#in", "#b"\n\t"\
> -    "movq 2*"#stride"+"#in", "#c"\n\t"\
> -    "movq 3*"#stride"+"#in", "#d"\n\t"
> +#define LOAD4(m,stride,in,a,b,c,d)\
> +    "mov"#m" 0*"#stride"+"#in", "#a"\n\t"\
> +    "mov"#m" 1*"#stride"+"#in", "#b"\n\t"\
> +    "mov"#m" 2*"#stride"+"#in", "#c"\n\t"\
> +    "mov"#m" 3*"#stride"+"#in", "#d"\n\t"
> 
> -#define STORE4(stride,out,a,b,c,d)\
> -    "movq "#a", 0*"#stride"+"#out"\n\t"\
> -    "movq "#b", 1*"#stride"+"#out"\n\t"\
> -    "movq "#c", 2*"#stride"+"#out"\n\t"\
> -    "movq "#d", 3*"#stride"+"#out"\n\t"
> +#define STORE4(m,stride,out,a,b,c,d)\
> +    "mov"#m" "#a", 0*"#stride"+"#out"\n\t"\
> +    "mov"#m" "#b", 1*"#stride"+"#out"\n\t"\
> +    "mov"#m" "#c", 2*"#stride"+"#out"\n\t"\
> +    "mov"#m" "#d", 3*"#stride"+"#out"\n\t"
> 
>  /* in/out: mma=mma+mmb, mmb=mmb-mma */
>  #define SUMSUB_BA( a, b ) \
> Index: libavcodec/i386/dsputilenc_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputilenc_mmx.c	(Revision 14469)
> +++ libavcodec/i386/dsputilenc_mmx.c	(Arbeitskopie)
> @@ -1041,11 +1041,11 @@
>          "movq %%mm7, 96(%1)             \n\t"\
>  \
>          TRANSPOSE4(%%mm0, %%mm1, %%mm2, %%mm3, %%mm7)\
> -        STORE4(8,  0(%1), %%mm0, %%mm3, %%mm7, %%mm2)\
> +        STORE4(q, 8,  0(%1), %%mm0, %%mm3, %%mm7, %%mm2)\
>  \
>          "movq 96(%1), %%mm7             \n\t"\
>          TRANSPOSE4(%%mm4, %%mm5, %%mm6, %%mm7, %%mm0)\
> -        STORE4(8, 64(%1), %%mm4, %%mm7, %%mm0, %%mm6)\
> +        STORE4(q, 8, 64(%1), %%mm4, %%mm7, %%mm0, %%mm6)\
>  \
>          : "=r" (sum)\
>          : "r"(temp)\
> @@ -1059,7 +1059,7 @@
>          "movq %%mm7, 96(%1)             \n\t"\
>  \
>          TRANSPOSE4(%%mm0, %%mm1, %%mm2, %%mm3, %%mm7)\
> -        STORE4(8, 32(%1), %%mm0, %%mm3, %%mm7, %%mm2)\
> +        STORE4(q, 8, 32(%1), %%mm0, %%mm3, %%mm7, %%mm2)\
>  \
>          "movq 96(%1), %%mm7             \n\t"\
>          TRANSPOSE4(%%mm4, %%mm5, %%mm6, %%mm7, %%mm0)\
> @@ -1067,7 +1067,7 @@
>          "movq %%mm6, %%mm7              \n\t"\
>          "movq %%mm0, %%mm6              \n\t"\
>  \
> -        LOAD4(8, 64(%1), %%mm0, %%mm1, %%mm2, %%mm3)\
> +        LOAD4(q, 8, 64(%1), %%mm0, %%mm1, %%mm2, %%mm3)\
>  \
>          HADAMARD48\
>          "movq %%mm7, 64(%1)             \n\t"\
> @@ -1083,8 +1083,8 @@
>          "paddusw %%mm1, %%mm0           \n\t"\
>          "movq %%mm0, 64(%1)             \n\t"\
>  \
> -        LOAD4(8,  0(%1), %%mm0, %%mm1, %%mm2, %%mm3)\
> -        LOAD4(8, 32(%1), %%mm4, %%mm5, %%mm6, %%mm7)\
> +        LOAD4(q, 8,  0(%1), %%mm0, %%mm1, %%mm2, %%mm3)\
> +        LOAD4(q, 8, 32(%1), %%mm4, %%mm5, %%mm6, %%mm7)\
>  \
>          HADAMARD48\
>          "movq %%mm7, (%1)               \n\t"\

ok

[...]

> @@ -467,7 +469,256 @@
>  DECLARE_FUNCTION(3, 2)
>  DECLARE_FUNCTION(3, 3)
>  
> +static void vc1_inv_trans_8x8_mmx(DCTELEM block[64])
> +{
> +    DECLARE_ALIGNED_16(int16_t, temp[64]);
> +    asm volatile(
> +    LOAD4(q,0x10,0x00(%0),%%mm5,%%mm1,%%mm0,%%mm3)
> +    TRANSPOSE4(%%mm5,%%mm1,%%mm0,%%mm3,%%mm4)
> +    STORE4(q,0x10,0x00(%0),%%mm5,%%mm3,%%mm4,%%mm0)
> +
> +    LOAD4(q,0x10,0x08(%0),%%mm6,%%mm5,%%mm7,%%mm1)
> +    TRANSPOSE4(%%mm6,%%mm5,%%mm7,%%mm1,%%mm2)
> +    STORE4(q,0x10,0x08(%0),%%mm6,%%mm1,%%mm2,%%mm7)

it is still transposing the data at the begin of functions.
I thought you transposed the scantables ...

[...]
> +    :
> +    : "r"(block), "m"(temp[0])
> +    : "memory"
> +    );
> +
> +    asm volatile(

why is this asm () block splited?

[...]
> +    STORE4(q,0x10,0x40%1,%%mm4,%%mm7,%%mm0,%%mm6)
> +    :
> +    : "r"(block), "m"(temp[0]), "m"(ff_pw_4)
> +    : "memory"
> +    );
> +
> +    asm volatile(
> +    "movq 0x30%3,  %%mm1\n\t" /* b[3] */
> +    TRANSFORM_4X8_COL_H1
> +    (
> +        q,q,
> +        0x00%3,0x10%3,0x20%3,0x40%3,0x70%3,

this store and later load seems redundant
and the asm should not be split

[...]
> +    STORE4(dqa,0x10,0x00(%0),%%xmm0,%%xmm5,%%xmm7,%%xmm3)
> +    STORE4(dqa,0x10,0x40(%0),%%xmm6,%%xmm4,%%xmm2,%%xmm1)
> +    TRANSFORM_8X4_ROW_H1
> +    (
> +        dqa,dqa,
> +        0x00(%0),0x20(%0),0x40(%0),0x70(%0),

some of these stores and loads seem redundant

[...]
> +void ff_vc1dsp_init_sse2(DSPContext* dsp, AVCodecContext *avctx) {
> +    if(!(mm_flags & MM_SSE2))
> +        return;
> +
> +    dsp->vc1_inv_trans_8x8 = vc1_inv_trans_8x8_sse2;
> +    dsp->vc1_inv_trans_4x8 = vc1_inv_trans_4x8_sse2;
> +    dsp->vc1_inv_trans_8x4 = vc1_inv_trans_8x4_sse2;
> +}

are all of the SSE2 variants faste than mmx?

> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at mplayerhq.hu
> https://lists.mplayerhq.hu/mailman/listinfo/ffmpeg-devel

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Those who are too smart to engage in politics are punished by being
governed by those who are dumber. -- Plato 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080730/d852ece7/attachment.pgp>