[FFmpeg-devel] [PATCH] SSE2 Xvid idct
Alexander Strange
astrange
Sat Apr 12 10:29:28 CEST 2008
On Apr 10, 2008, at 7:53 PM, Michael Niedermayer wrote:
> On Thu, Apr 10, 2008 at 06:42:40PM -0400, Alexander Strange wrote:
>>
>> On Apr 6, 2008, at 12:14 PM, Michael Niedermayer wrote:
>>> On Sun, Apr 06, 2008 at 12:19:58AM -0400, Alexander Strange wrote:
>>>> This adds skal's sse2 idct and uses it as the xvid idct when
>>>> available.
>>>>
>>>> I merged two shuffles into the permutation and changed the zero-
>>>> skipping
>>>> some - it's fastest in MMX and not really worth doing for the
>>>> first three
>>>> rows. Their right halfs are still usually all zero, but adding
>>>> the branch
>>>> to check for it is a net loss. The best thing for speed would be
>>>> switching
>>>> IDCTs by counting the last nonzero coefficient position, but that's
>>>> something for later.
>>>>
>>>> xvididctheader - makes a new header so I don't add any more extern
>>>> declarations in .c files.
>>>> sse2-permute - the new permutation; it might not have a specific
>>>> enough
>>>> name, but it should work as well for simpleidct as this if I can
>>>> get back
>>>> to that.
>>>> sse2-xvid-idct.diff + idct_sse2_xvid.c - the IDCT
>>>>
>>>> The URLs in the header (copied from idct_mmx_xvid and the
>>>> original nasm
>>>> source) are broken at the moment, but archive.org URLs are longer
>>>> than 80
>>>> characters, so I left them like they are.
>>>>
>>>> skal agreed it could be under LGPL in the last thread.
>>> [...]
>>>> #define SKIP_ROW_CHECK(src) \
>>>> "movq "src", %%mm0 \n\t" \
>>>> "por 8+"src", %%mm0 \n\t" \
>>>> "packssdw %%mm0, %%mm0 \n\t" \
>>>> "movd %%mm0, %%eax \n\t" \
>>>> "testl %%eax, %%eax \n\t" \
>>>> "jz 1f \n\t"
>>>
>>> You could try to check pairs of rows, this might be faster for
>>> some rows.
>>> Also the code should be interleaved not form such nasty dependancy
>>> chains
>>> you do have enogh mmx registers.
>>
>> I think the movq breaks the dependence chain, at least on my CPU. But
>> moving stuff above the branch is good - changed to check two rows
>> at once
>> for 3-6 and use MMX pmovmskb.
>
> Good though not exactly what i meant.
> What i meant was
>
> "movq "row1", %%mm1 \n\t" \
> "por 8+"row1", %%mm1 \n\t" \
> "movq "row2", %%mm2 \n\t" \
> "por 8+"row2", %%mm2 \n\t" \
> "por %%mm1, %%mm2 \n\t" \
> "paddusb %%mm0, %%mm2 \n\t" \
> "pmovmskb %%mm2, %%eax \n\t" \
> "testl %%eax, %%eax \n\t" \
> "jz 123f \n\t"
>
> for example maybe for rows 5 and 6
>
> of course this is just a random idea and must be tested if its any
> good
> or not.
>
> Also maybe some speed could be gained by writing a few custom
> iLLM_PASS()
> for some patterns of zero rows. Note, this does not need any checks
> anymore
> as the rows have already been checked. But care must be taken here not
> to "use" to much code cache. (and like everything else ignore it if
> its
> slower)
I had looked into that, but it didn't seem to help - no change for
xvid and it got a bit worse for mpeg2 (which I'm trying not to
pessimize since the idct is already quite good for it). Rows 0-2 are
always nonzero due to the rounder, and 3 and 7 seem to be set often,
so I added one for 4-6 being zero, which seems like it worked. The
structure is somewhat strange now, but it gets rid of one branch and
it's faster than the other ways I tried. (for instance, replacing two
of the three test/jnz with an or/jnz, or removing the last zero skip
for row 6)
The added align helps some rule about branches not crossing 16-byte
blocks - it seemed a little bit positive and only adds one nop
instruction.
I guess there might be some small advantage from adding ones for 5+6
being zero and 3-6 being zero, but probably not enough to be worth it,
since the LLM pass is already quite fast compared to the dot products
in the rows.
> [...]
>> Index: libavcodec/dsputil.c
>> ===================================================================
>> --- libavcodec/dsputil.c (revision 12786)
>> +++ libavcodec/dsputil.c (working copy)
>> @@ -151,6 +151,8 @@
>> 0x32, 0x3A, 0x36, 0x3B, 0x33, 0x3E, 0x37, 0x3F,
>> };
>>
>> +static uint8_t idct_sse2_row_perm[8] = {0, 4, 1, 5, 2, 6, 3, 7};
>
> const
Fixed in both places.
> [...]
>> @@ -50,8 +51,6 @@
>> /* reference fdct/idct */
>> extern void fdct(DCTELEM *block);
>> extern void idct(DCTELEM *block);
>> -extern void ff_idct_xvid_mmx(DCTELEM *block);
>> -extern void ff_idct_xvid_mmx2(DCTELEM *block);
>> extern void init_fdct();
>>
>> extern void ff_mmx_idct(DCTELEM *data);
>
> unrelated?
Merged into the right diff. (I need to learn git for this...)
> [...]
>> @@ -158,6 +158,8 @@
>> 0x32, 0x3A, 0x36, 0x3B, 0x33, 0x3E, 0x37, 0x3F,
>> };
>>
>> +static uint8_t idct_sse2_row_perm[8] = {0, 4, 1, 5, 2, 6, 3, 7};
>> +
>> void idct_mmx_init(void)
>> {
>> int i;
>
> Duplicate, couldnt we use dsputil somehow? Not overly important ...
There's already the simpleidct mmx permutation there. I think it would
complicate dsputil.c, and neither of them should be changing anytime
soon.
> [...]
>>
>> DECLARE_ASM_CONST(16, int32_t, walkenIdctRounders[]) = {
>> 65536, 65536, 65536, 65536,
>> 3597, 3597, 3597, 3597,
>> 2260, 2260, 2260, 2260,
>> 1203, 1203, 1203, 1203,
>> 0, 0, 0, 0,
>> 120, 120, 120, 120,
>> 512, 512, 512, 512,
>> 512, 512, 512, 512
>> };
>
> The bottom 2 look redundant and the 0 looks unneeded
Fixed.
> [...]
>> #define iMTX_MULT(src, table, rounder, put) \
>> "movdqa "src", %%xmm3 \n\t" \
>> "movdqa %%xmm3, %%xmm0 \n\t" \
>
> try
> "src", %%xmm0
> %%xmm0, %%xmm3
> that way the next punpcklqdq does not depend on the movdqa
> just an idea ...
>
> or try to move the punpcklqdq after the pshufd
Core 2 wants three normal instructions between two multi-uop ones like
pshufd, so rescheduled for it. I guess it should be a little better
for other CPUs too.
>> "punpcklqdq %%xmm0, %%xmm0 \n\t" /* 0246 */\
>> "pshufd $0x11, %%xmm3, %%xmm1 \n\t" /* 4602 */\
>> "pshufd $0xBB, %%xmm3, %%xmm2 \n\t" /* 5713 */\
>> "punpckhqdq %%xmm3, %%xmm3 \n\t" /* 1357 */\
>> "pmaddwd "table", %%xmm0 \n\t" \
>> "pmaddwd 16+"table", %%xmm1 \n\t" \
>> "pmaddwd 32+"table", %%xmm2 \n\t" \
>> "pmaddwd 48+"table", %%xmm3 \n\t" \
>> "paddd %%xmm3, %%xmm2 \n\t" \
>
>> "paddd %%xmm1, %%xmm0 \n\t" \
>> rounder", %%xmm0 \n\t" \
>
> maybe it is faster to move rounder before the paddd, maybe not ...
I didn't see any change from it.
Updated:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xvididctheader.diff
Type: application/octet-stream
Size: 2514 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a8b78657/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sse2-permute.diff
Type: application/octet-stream
Size: 1341 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a8b78657/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sse2-xvid-idct.diff
Type: application/octet-stream
Size: 1827 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a8b78657/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: idct_sse2_xvid.c
Type: application/octet-stream
Size: 15020 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a8b78657/attachment-0003.obj>
More information about the ffmpeg-devel
mailing list