[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Sat Apr 12 10:29:28 CEST 2008

On Apr 10, 2008, at 7:53 PM, Michael Niedermayer wrote:
> On Thu, Apr 10, 2008 at 06:42:40PM -0400, Alexander Strange wrote:
>>
>> On Apr 6, 2008, at 12:14 PM, Michael Niedermayer wrote:
>>> On Sun, Apr 06, 2008 at 12:19:58AM -0400, Alexander Strange wrote:
>>>> This adds skal's sse2 idct and uses it as the xvid idct when  
>>>> available.
>>>>
>>>> I merged two shuffles into the permutation and changed the zero- 
>>>> skipping
>>>> some - it's fastest in MMX and not really worth doing for the  
>>>> first three
>>>> rows. Their right halfs are still usually all zero, but adding  
>>>> the branch
>>>> to check for it is a net loss. The best thing for speed would be
>>>> switching
>>>> IDCTs by counting the last nonzero coefficient position, but that's
>>>> something for later.
>>>>
>>>> xvididctheader - makes a new header so I don't add any more extern
>>>> declarations in .c files.
>>>> sse2-permute - the new permutation; it might not have a specific  
>>>> enough
>>>> name, but it should work as well for simpleidct as this if I can  
>>>> get back
>>>> to that.
>>>> sse2-xvid-idct.diff + idct_sse2_xvid.c - the IDCT
>>>>
>>>> The URLs in the header (copied from idct_mmx_xvid and the  
>>>> original nasm
>>>> source) are broken at the moment, but archive.org URLs are longer  
>>>> than 80
>>>> characters, so I left them like they are.
>>>>
>>>> skal agreed it could be under LGPL in the last thread.
>>> [...]
>>>> #define SKIP_ROW_CHECK(src)                 \
>>>>   "movq     "src", %%mm0            \n\t" \
>>>>   "por    8+"src", %%mm0            \n\t" \
>>>>   "packssdw %%mm0, %%mm0            \n\t" \
>>>>   "movd     %%mm0, %%eax            \n\t" \
>>>>   "testl    %%eax, %%eax            \n\t" \
>>>>   "jz 1f                            \n\t"
>>>
>>> You could try to check pairs of rows, this might be faster for  
>>> some rows.
>>> Also the code should be interleaved not form such nasty dependancy  
>>> chains
>>> you do have enogh mmx registers.
>>
>> I think the movq breaks the dependence chain, at least on my CPU. But
>> moving stuff above the branch is good - changed to check two rows  
>> at once
>> for 3-6 and use MMX pmovmskb.
>
> Good though not exactly what i meant.
> What i meant was
>
> "movq     "row1", %%mm1           \n\t" \
> "por    8+"row1", %%mm1           \n\t" \
> "movq     "row2", %%mm2           \n\t" \
> "por    8+"row2", %%mm2           \n\t" \
> "por       %%mm1, %%mm2           \n\t" \
> "paddusb   %%mm0, %%mm2           \n\t" \
> "pmovmskb  %%mm2, %%eax           \n\t" \
> "testl     %%eax, %%eax           \n\t" \
> "jz 123f                            \n\t"
>
> for example maybe for rows 5 and 6
>
> of course this is just a random idea and must be tested if its any  
> good
> or not.
>
> Also maybe some speed could be gained by writing a few custom  
> iLLM_PASS()
> for some patterns of zero rows. Note, this does not need any checks  
> anymore
> as the rows have already been checked. But care must be taken here not
> to "use" to much code cache. (and like everything else ignore it if  
> its
> slower)

I had looked into that, but it didn't seem to help - no change for  
xvid and it got a bit worse for mpeg2 (which I'm trying not to  
pessimize since the idct is already quite good for it). Rows 0-2 are  
always nonzero due to the rounder, and 3 and 7 seem to be set often,  
so I added one for 4-6 being zero, which seems like it worked. The  
structure is somewhat strange now, but it gets rid of one branch and  
it's faster than the other ways I tried. (for instance, replacing two  
of the three test/jnz with an or/jnz, or removing the last zero skip  
for row 6)

The added align helps some rule about branches not crossing 16-byte  
blocks - it seemed a little bit positive and only adds one nop  
instruction.

I guess there might be some small advantage from adding ones for 5+6  
being zero and 3-6 being zero, but probably not enough to be worth it,  
since the LLM pass is already quite fast compared to the dot products  
in the rows.

> [...]
>> Index: libavcodec/dsputil.c
>> ===================================================================
>> --- libavcodec/dsputil.c	(revision 12786)
>> +++ libavcodec/dsputil.c	(working copy)
>> @@ -151,6 +151,8 @@
>>         0x32, 0x3A, 0x36, 0x3B, 0x33, 0x3E, 0x37, 0x3F,
>> };
>>
>> +static uint8_t idct_sse2_row_perm[8] = {0, 4, 1, 5, 2, 6, 3, 7};
>
> const

Fixed in both places.

> [...]
>> @@ -50,8 +51,6 @@
>> /* reference fdct/idct */
>> extern void fdct(DCTELEM *block);
>> extern void idct(DCTELEM *block);
>> -extern void ff_idct_xvid_mmx(DCTELEM *block);
>> -extern void ff_idct_xvid_mmx2(DCTELEM *block);
>> extern void init_fdct();
>>
>> extern void ff_mmx_idct(DCTELEM *data);
>
> unrelated?

Merged into the right diff. (I need to learn git for this...)

> [...]
>> @@ -158,6 +158,8 @@
>>         0x32, 0x3A, 0x36, 0x3B, 0x33, 0x3E, 0x37, 0x3F,
>> };
>>
>> +static uint8_t idct_sse2_row_perm[8] = {0, 4, 1, 5, 2, 6, 3, 7};
>> +
>> void idct_mmx_init(void)
>> {
>>     int i;
>
> Duplicate, couldnt we use dsputil somehow? Not overly important ...

There's already the simpleidct mmx permutation there. I think it would  
complicate dsputil.c, and neither of them should be changing anytime  
soon.

> [...]
>>
>> DECLARE_ASM_CONST(16, int32_t, walkenIdctRounders[]) = {
>> 65536, 65536, 65536, 65536,
>>  3597,  3597,  3597,  3597,
>>  2260,  2260,  2260,  2260,
>>  1203,  1203,  1203,  1203,
>>     0,     0,     0,     0,
>>   120,   120,   120,   120,
>>   512,   512,   512,   512,
>>   512,   512,   512,   512
>> };
>
> The bottom 2 look redundant and the 0 looks unneeded

Fixed.

> [...]
>> #define iMTX_MULT(src, table, rounder, put) \
>>    "movdqa        "src", %%xmm3      \n\t" \
>>    "movdqa       %%xmm3, %%xmm0      \n\t" \
>
> try
> "src", %%xmm0
> %%xmm0, %%xmm3
> that way the next punpcklqdq does not depend on the movdqa
> just an idea ...
>
> or try to move the punpcklqdq after the pshufd

Core 2 wants three normal instructions between two multi-uop ones like  
pshufd, so rescheduled for it. I guess it should be a little better  
for other CPUs too.

>>    "punpcklqdq   %%xmm0, %%xmm0      \n\t" /* 0246 */\
>>    "pshufd   $0x11, %%xmm3, %%xmm1   \n\t" /* 4602 */\
>>    "pshufd   $0xBB, %%xmm3, %%xmm2   \n\t" /* 5713 */\
>>    "punpckhqdq   %%xmm3, %%xmm3      \n\t" /* 1357 */\
>>    "pmaddwd     "table", %%xmm0      \n\t" \
>>    "pmaddwd  16+"table", %%xmm1      \n\t" \
>>    "pmaddwd  32+"table", %%xmm2      \n\t" \
>>    "pmaddwd  48+"table", %%xmm3      \n\t" \
>>    "paddd        %%xmm3, %%xmm2      \n\t" \
>
>>    "paddd        %%xmm1, %%xmm0      \n\t" \
>>    rounder",     %%xmm0              \n\t" \
>
> maybe it is faster to move rounder before the paddd, maybe not ...

I didn't see any change from it.

Updated:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xvididctheader.diff
Type: application/octet-stream
Size: 2514 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a8b78657/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sse2-permute.diff
Type: application/octet-stream
Size: 1341 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a8b78657/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sse2-xvid-idct.diff
Type: application/octet-stream
Size: 1827 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a8b78657/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idct_sse2_xvid.c
Type: application/octet-stream
Size: 15020 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a8b78657/attachment-0003.obj>