[MPlayer-cvslog] r35669 - trunk/libmpcodecs/vf_ass.c

Thu Dec 13 04:23:41 CET 2012

On Thu, Dec 13, 2012 at 2:43 AM, Reimar Döffinger
<Reimar.Doeffinger at gmx.de> wrote:
>
> On Wed, Dec 12, 2012 at 06:13:44PM +0100, upsuper wrote:
> > +DECLARE_ASM_CONST(16, uint32_t, SSE_32BIT_80H[4]) = { [0 ... 3] = 0x80 };
> > +DECLARE_ASM_CONST(16, uint32_t, SSE_32BIT_MAP[4]) = { [0 ... 3] = 0x102 };
>
> All-uppercase names generally and in MPlayer code are considered
> reserved for macros/defines.

I'll change them. I just found that in vf_fspp.c, asm constants were
defined in all-uppercase.

> Also I am quite certain that the [0 ... 3] syntax is a GNU extension,
> it really shouldn't be used in MPlayer ever (also not in switch/cases).
> I don't really think it improves readability in this case anyway.

I will fix it. I thought designated initializer is a GNU extension,
so there is no problem to use an extension here. However, I know now
that designated initializer has been included in C99, but this kind
of initializer is not.

>
> > +    CLEAN_XMM(7);
>
> A macro for one single use seems like overkill.
> A more serious (though mostly theoretical) issue is
> that there's no guarantee that the xmm7 register will
> still be clear by the time you get into the next asm block.
> Which is among the reasons why implementing all loops in asm
> tends to be a good idea.

I'll move it into every asm block.

>
> > +                : : [dst]   "r" (dst + i * stride),
> > +                    [alpha] "r" (alpha + i * outw),
> > +                    [src_y] "r" (src_y + i * outw),
> > +                    [src_u] "r" (src_u + i * outw),
> > +                    [src_v] "r" (src_v + i * outw),
> > +                    [j]     "r" (xmin),
> > +                    [xmax]  "g" (xmax),
> > +                    [f]     "r" (is_uyvy)
>
> This requires 7-8 registers.
> You'll have to add a check that that many are actually available.
> See HAVE_EBX_AVAILABLE and HAVE_EBP_AVAILABLE, on 32 bit your code
> will need both available.

Someone has reported that this broke the compilation on 32bit after
your reply. I've committed a fix to reduce the register usage here,
which should work.

>
> > +#if HAVE_SSE4
> > +        if (gCpuCaps.hasSSE4 && outw % 8 == 0)
> > +            vf->priv->render_frame = render_frame_yuv422_sse4;
>
> The width check is overkill, first having a C part to
> do the remainder is a good idea.
> However in this specific case, it is possible to simply
> make sure all buffers have enough extra padding that it
> does not matter if you run over the end of a line, then
> you can round the with up in the asm code.

Most normal videos have a width which is a multiple of 8, so this
check should not filter out most videos. However, I'd like to improve
this by the second method you mentioned, but it seems to require many
modifications, which may need a period of time to make.

> Also it would be nice to have a more baseline requirement
> than SSE4, I would assume SSE4 instructions or even SSE3
> are not really necessary to still get very close to the
> same speedup.

I have tried to avoid uses of SSE4 instructions as much as possible,
and it is a fact that most instructions used here are introduced by
SSE2. But I still found that there are some necessary instructions
that I have no idea to replace by older instructions which are listed
as follow:
* pmulld, introduced by SSE4, multiples 4 dword integers in parallel,
  which is required because 0x102 * 0xFF + 0x80 > 0xFFFF
* phaddd, introduced by SSSE3, adds packed integers horizontally,
  which is used to compute the average opacity of 2 adjacent pixels.
If you have any idea to avoid using these instructions here, please
tell me and I will modify the code.

--
Xidorn Quan
GnuPG fingerprint: 6F1E DF9A D250 7505 63E2  345E 7570 8D3F 7C9A 1209