[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Sat Apr 12 14:15:19 CEST 2008

On Sat, Apr 12, 2008 at 04:29:28AM -0400, Alexander Strange wrote:
>
> On Apr 10, 2008, at 7:53 PM, Michael Niedermayer wrote:
>> On Thu, Apr 10, 2008 at 06:42:40PM -0400, Alexander Strange wrote:
>>>
>>> On Apr 6, 2008, at 12:14 PM, Michael Niedermayer wrote:
>>>> On Sun, Apr 06, 2008 at 12:19:58AM -0400, Alexander Strange wrote:
>>>>> This adds skal's sse2 idct and uses it as the xvid idct when available.
>>>>>
>>>>> I merged two shuffles into the permutation and changed the 
>>>>> zero-skipping
>>>>> some - it's fastest in MMX and not really worth doing for the first 
>>>>> three
>>>>> rows. Their right halfs are still usually all zero, but adding the 
>>>>> branch
>>>>> to check for it is a net loss. The best thing for speed would be
>>>>> switching
>>>>> IDCTs by counting the last nonzero coefficient position, but that's
>>>>> something for later.
>>>>>
>>>>> xvididctheader - makes a new header so I don't add any more extern
>>>>> declarations in .c files.
>>>>> sse2-permute - the new permutation; it might not have a specific enough
>>>>> name, but it should work as well for simpleidct as this if I can get 
>>>>> back
>>>>> to that.
>>>>> sse2-xvid-idct.diff + idct_sse2_xvid.c - the IDCT
>>>>>
>>>>> The URLs in the header (copied from idct_mmx_xvid and the original nasm
>>>>> source) are broken at the moment, but archive.org URLs are longer than 
>>>>> 80
>>>>> characters, so I left them like they are.
>>>>>
>>>>> skal agreed it could be under LGPL in the last thread.
>>>> [...]
>>>>> #define SKIP_ROW_CHECK(src)                 \
>>>>>   "movq     "src", %%mm0            \n\t" \
>>>>>   "por    8+"src", %%mm0            \n\t" \
>>>>>   "packssdw %%mm0, %%mm0            \n\t" \
>>>>>   "movd     %%mm0, %%eax            \n\t" \
>>>>>   "testl    %%eax, %%eax            \n\t" \
>>>>>   "jz 1f                            \n\t"
>>>>
>>>> You could try to check pairs of rows, this might be faster for some 
>>>> rows.
>>>> Also the code should be interleaved not form such nasty dependancy 
>>>> chains
>>>> you do have enogh mmx registers.
>>>
>>> I think the movq breaks the dependence chain, at least on my CPU. But
>>> moving stuff above the branch is good - changed to check two rows at once
>>> for 3-6 and use MMX pmovmskb.
>>
>> Good though not exactly what i meant.
>> What i meant was
>>
>> "movq     "row1", %%mm1           \n\t" \
>> "por    8+"row1", %%mm1           \n\t" \
>> "movq     "row2", %%mm2           \n\t" \
>> "por    8+"row2", %%mm2           \n\t" \
>> "por       %%mm1, %%mm2           \n\t" \
>> "paddusb   %%mm0, %%mm2           \n\t" \
>> "pmovmskb  %%mm2, %%eax           \n\t" \
>> "testl     %%eax, %%eax           \n\t" \
>> "jz 123f                            \n\t"
>>
>> for example maybe for rows 5 and 6
>>
>> of course this is just a random idea and must be tested if its any good
>> or not.
>>
>> Also maybe some speed could be gained by writing a few custom iLLM_PASS()
>> for some patterns of zero rows. Note, this does not need any checks 
>> anymore
>> as the rows have already been checked. But care must be taken here not
>> to "use" to much code cache. (and like everything else ignore it if its
>> slower)
>
> I had looked into that, but it didn't seem to help - no change for xvid and 
> it got a bit worse for mpeg2 (which I'm trying not to pessimize since the 
> idct is already quite good for it). Rows 0-2 are always nonzero due to the 
> rounder, and 3 and 7 seem to be set often, so I added one for 4-6 being 
> zero, which seems like it worked. The structure is somewhat strange now, 

7 is a special case for mpeg2 which should disapear with CODEC_FLAG2_FAST.
So iam tempted to suggest to also add 7 into the mix and just tell users
to use CODEC_FLAG2_FAST. (of course only if the non zero 7 disapears with
it)

> but it gets rid of one branch and it's faster than the other ways I tried. 
> (for instance, replacing two of the three test/jnz with an or/jnz, or 
> removing the last zero skip for row 6)
>
> The added align helps some rule about branches not crossing 16-byte blocks 
> - it seemed a little bit positive and only adds one nop instruction.
>
> I guess there might be some small advantage from adding ones for 5+6 being 
> zero and 3-6 being zero, but probably not enough to be worth it, since the 
> LLM pass is already quite fast compared to the dot products in the rows.

[...]

> Index: libavcodec/i386/dsputil_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.c	(revision 12670)
> +++ libavcodec/i386/dsputil_mmx.c	(working copy)
> @@ -2126,7 +2127,12 @@
>              }else if(idct_algo==FF_IDCT_CAVS){
>                      c->idct_permutation_type= FF_TRANSPOSE_IDCT_PERM;
>              }else if(idct_algo==FF_IDCT_XVIDMMX){
> -                if(mm_flags & MM_MMXEXT){
> +                if (mm_flags & MM_SSE2){
> +                    c->idct_put= ff_idct_xvid_sse2_put;
> +                    c->idct_add= ff_idct_xvid_sse2_add;
> +                    c->idct    = ff_idct_xvid_sse2;
> +                    c->idct_permutation_type= FF_SSE2_IDCT_PERM;
> +                }else if(mm_flags & MM_MMXEXT){

if( vs. if (

[...]
>     "psubsw   %%xmm6, %%xmm5          \n\t" \
>     "movdqa   "ROW0", %%xmm4          \n\t" \
>     "movdqa   "ROW4", %%xmm6          \n\t" \
>     "movdqa   %%xmm2, "spill"         \n\t" \
>     "movdqa   %%xmm4, %%xmm2          \n\t" \
>     "psubsw   %%xmm6, %%xmm4          \n\t" \
>     "paddsw   %%xmm2, %%xmm6          \n\t" \
>     "movdqa   %%xmm6, %%xmm2          \n\t" \
>     "psubsw   %%xmm7, %%xmm6          \n\t" \
>     "paddsw   %%xmm2, %%xmm7          \n\t" \
>     "movdqa   %%xmm4, %%xmm2          \n\t" \
>     "psubsw   %%xmm5, %%xmm4          \n\t" \
>     "paddsw   %%xmm2, %%xmm5          \n\t" \
>     "movdqa   %%xmm5, %%xmm2          \n\t" \
>     "psubsw   %%xmm0, %%xmm5          \n\t" \
>     "paddsw   %%xmm2, %%xmm0          \n\t" \
>     "movdqa   %%xmm4, %%xmm2          \n\t" \
>     "psubsw   %%xmm3, %%xmm4          \n\t" \
>     "paddsw   %%xmm2, %%xmm3          \n\t" \
>     "movdqa  "spill", %%xmm2          \n\t" \

#ifdef ARCH_X86_64
# define XMMS   "%%xmm12"
#else
# define XMMS   "%%xmm2"
#endif
s/%%xmm2/XMMS/

#ifndef ARCH_X86_64
"movdqa   %%xmm2, "spill"         \n\t" \
#endif
...
#ifndef ARCH_X86_64
"movdqa  "spill", %%xmm2          \n\t" \
#endif

or a 
MOV_ONLY_ON32" %%xmm2, ...

And i think something similar can be don with ROW*

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Thouse who are best at talking, realize last or never when they are wrong.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a4dceb08/attachment.pgp>