[FFmpeg-devel] [PATCH] SSE2 Xvid idct
Michael Niedermayer
michaelni
Sat Apr 12 14:15:19 CEST 2008
On Sat, Apr 12, 2008 at 04:29:28AM -0400, Alexander Strange wrote:
>
> On Apr 10, 2008, at 7:53 PM, Michael Niedermayer wrote:
>> On Thu, Apr 10, 2008 at 06:42:40PM -0400, Alexander Strange wrote:
>>>
>>> On Apr 6, 2008, at 12:14 PM, Michael Niedermayer wrote:
>>>> On Sun, Apr 06, 2008 at 12:19:58AM -0400, Alexander Strange wrote:
>>>>> This adds skal's sse2 idct and uses it as the xvid idct when available.
>>>>>
>>>>> I merged two shuffles into the permutation and changed the
>>>>> zero-skipping
>>>>> some - it's fastest in MMX and not really worth doing for the first
>>>>> three
>>>>> rows. Their right halfs are still usually all zero, but adding the
>>>>> branch
>>>>> to check for it is a net loss. The best thing for speed would be
>>>>> switching
>>>>> IDCTs by counting the last nonzero coefficient position, but that's
>>>>> something for later.
>>>>>
>>>>> xvididctheader - makes a new header so I don't add any more extern
>>>>> declarations in .c files.
>>>>> sse2-permute - the new permutation; it might not have a specific enough
>>>>> name, but it should work as well for simpleidct as this if I can get
>>>>> back
>>>>> to that.
>>>>> sse2-xvid-idct.diff + idct_sse2_xvid.c - the IDCT
>>>>>
>>>>> The URLs in the header (copied from idct_mmx_xvid and the original nasm
>>>>> source) are broken at the moment, but archive.org URLs are longer than
>>>>> 80
>>>>> characters, so I left them like they are.
>>>>>
>>>>> skal agreed it could be under LGPL in the last thread.
>>>> [...]
>>>>> #define SKIP_ROW_CHECK(src) \
>>>>> "movq "src", %%mm0 \n\t" \
>>>>> "por 8+"src", %%mm0 \n\t" \
>>>>> "packssdw %%mm0, %%mm0 \n\t" \
>>>>> "movd %%mm0, %%eax \n\t" \
>>>>> "testl %%eax, %%eax \n\t" \
>>>>> "jz 1f \n\t"
>>>>
>>>> You could try to check pairs of rows, this might be faster for some
>>>> rows.
>>>> Also the code should be interleaved not form such nasty dependancy
>>>> chains
>>>> you do have enogh mmx registers.
>>>
>>> I think the movq breaks the dependence chain, at least on my CPU. But
>>> moving stuff above the branch is good - changed to check two rows at once
>>> for 3-6 and use MMX pmovmskb.
>>
>> Good though not exactly what i meant.
>> What i meant was
>>
>> "movq "row1", %%mm1 \n\t" \
>> "por 8+"row1", %%mm1 \n\t" \
>> "movq "row2", %%mm2 \n\t" \
>> "por 8+"row2", %%mm2 \n\t" \
>> "por %%mm1, %%mm2 \n\t" \
>> "paddusb %%mm0, %%mm2 \n\t" \
>> "pmovmskb %%mm2, %%eax \n\t" \
>> "testl %%eax, %%eax \n\t" \
>> "jz 123f \n\t"
>>
>> for example maybe for rows 5 and 6
>>
>> of course this is just a random idea and must be tested if its any good
>> or not.
>>
>> Also maybe some speed could be gained by writing a few custom iLLM_PASS()
>> for some patterns of zero rows. Note, this does not need any checks
>> anymore
>> as the rows have already been checked. But care must be taken here not
>> to "use" to much code cache. (and like everything else ignore it if its
>> slower)
>
> I had looked into that, but it didn't seem to help - no change for xvid and
> it got a bit worse for mpeg2 (which I'm trying not to pessimize since the
> idct is already quite good for it). Rows 0-2 are always nonzero due to the
> rounder, and 3 and 7 seem to be set often, so I added one for 4-6 being
> zero, which seems like it worked. The structure is somewhat strange now,
7 is a special case for mpeg2 which should disapear with CODEC_FLAG2_FAST.
So iam tempted to suggest to also add 7 into the mix and just tell users
to use CODEC_FLAG2_FAST. (of course only if the non zero 7 disapears with
it)
> but it gets rid of one branch and it's faster than the other ways I tried.
> (for instance, replacing two of the three test/jnz with an or/jnz, or
> removing the last zero skip for row 6)
>
> The added align helps some rule about branches not crossing 16-byte blocks
> - it seemed a little bit positive and only adds one nop instruction.
>
> I guess there might be some small advantage from adding ones for 5+6 being
> zero and 3-6 being zero, but probably not enough to be worth it, since the
> LLM pass is already quite fast compared to the dot products in the rows.
[...]
> Index: libavcodec/i386/dsputil_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.c (revision 12670)
> +++ libavcodec/i386/dsputil_mmx.c (working copy)
> @@ -2126,7 +2127,12 @@
> }else if(idct_algo==FF_IDCT_CAVS){
> c->idct_permutation_type= FF_TRANSPOSE_IDCT_PERM;
> }else if(idct_algo==FF_IDCT_XVIDMMX){
> - if(mm_flags & MM_MMXEXT){
> + if (mm_flags & MM_SSE2){
> + c->idct_put= ff_idct_xvid_sse2_put;
> + c->idct_add= ff_idct_xvid_sse2_add;
> + c->idct = ff_idct_xvid_sse2;
> + c->idct_permutation_type= FF_SSE2_IDCT_PERM;
> + }else if(mm_flags & MM_MMXEXT){
if( vs. if (
[...]
> "psubsw %%xmm6, %%xmm5 \n\t" \
> "movdqa "ROW0", %%xmm4 \n\t" \
> "movdqa "ROW4", %%xmm6 \n\t" \
> "movdqa %%xmm2, "spill" \n\t" \
> "movdqa %%xmm4, %%xmm2 \n\t" \
> "psubsw %%xmm6, %%xmm4 \n\t" \
> "paddsw %%xmm2, %%xmm6 \n\t" \
> "movdqa %%xmm6, %%xmm2 \n\t" \
> "psubsw %%xmm7, %%xmm6 \n\t" \
> "paddsw %%xmm2, %%xmm7 \n\t" \
> "movdqa %%xmm4, %%xmm2 \n\t" \
> "psubsw %%xmm5, %%xmm4 \n\t" \
> "paddsw %%xmm2, %%xmm5 \n\t" \
> "movdqa %%xmm5, %%xmm2 \n\t" \
> "psubsw %%xmm0, %%xmm5 \n\t" \
> "paddsw %%xmm2, %%xmm0 \n\t" \
> "movdqa %%xmm4, %%xmm2 \n\t" \
> "psubsw %%xmm3, %%xmm4 \n\t" \
> "paddsw %%xmm2, %%xmm3 \n\t" \
> "movdqa "spill", %%xmm2 \n\t" \
#ifdef ARCH_X86_64
# define XMMS "%%xmm12"
#else
# define XMMS "%%xmm2"
#endif
s/%%xmm2/XMMS/
#ifndef ARCH_X86_64
"movdqa %%xmm2, "spill" \n\t" \
#endif
...
#ifndef ARCH_X86_64
"movdqa "spill", %%xmm2 \n\t" \
#endif
or a
MOV_ONLY_ON32" %%xmm2, ...
And i think something similar can be don with ROW*
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Thouse who are best at talking, realize last or never when they are wrong.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080412/a4dceb08/attachment.pgp>
More information about the ffmpeg-devel
mailing list