[MPlayer-dev-eng] [PATCH] VF Overlay

Sat Aug 8 00:25:11 CEST 2009

Reimar,

Thx for this nice review.
I've updated many parts of the patch with your comments and'll post an 
updated patch asap.

Thoug, I'm not at all original patch author, hence some parts
of it are black magic to me. All I did was removing some controversial
parts and fixing some stuff.

Also maybe you can enlight me on some points :

> I guess this causes all kind of issues when someone adds multiple -vf
> overlay etc.?

After a few thoughts, I've removed this array of overlays feature.
This was pretty useless.

> Actually you might be able to just call free_overlay_data in the fail
> case, but you'd have to carefully verify that.
> 
>> +        shmdt((uint8_t *)priv->lockbyte);
>> +        priv->lockbyte = 0;
> 
> Why the cast? And I am in favour of using NULL for pointers instead of
> 0.

It fixes a GCC warning AFAICT.

>> +#define check_opaque(type) \
>> +    if (*(type*)(p + x)) { \
>> +        if (slice_y1 == -2) \
>> +            slice_y1 = y; \
>> +        else \
>> +            slice_y2 = y; \
>> +        x = row_stride; \
>> +        break; \
>> +    }
>> +
>> +    for (y = 0; y < h; y++) {
>> +        for (x = 0; x < row_stride - 7; x += 8)
>> +            check_opaque(uint64_t);
>> +        for (; x < row_stride - 3; x += 4)
>> +            check_opaque(uint32_t);
>> +        for (; x < row_stride - 1; x += 2)
>> +            check_opaque(uint16_t);
> 
> I find it very hard to imagine that the extra uint32_t case gains even the
> slightest bit of speed.

What does that mean ?

> Also the macro is again used to hide code duplication, not sure if there
> is a nice way to avoid it, but it should at least be tried.

Isn't that the goal of macros (avoiding code duplication) ?

>> +    i = priv->h != priv->mpi_h ? 4 : 0;
>> +    do {
>> +        dst_y = ry = av_clip((orig_y - i) & ~1, 0, priv->h);
>> +        dst_h = rh = av_clip((orig_h + 1 + i*2) & ~1, 0, priv->h - ry);
> 
> Is FFALIGN appropriate for any of these? A comment what exactly it
> calculates would be good IMO.

Can't see where ...
Also, I unfortunately have no idea what it tries to calculate.

>> +    for (i = priv->w * rh; i>0; i--, p_bgr32 += 4, p_alpha++, p_bgr24 += 3) {
>> +        /* Moving 32 bits is faster than 3 separate assignments (or one 16
>> +         * bit and and one 8 bit move). The BGR24 buffer has one extra byte
>> +         * allocated to prevent an overrun.
>> +         */
>> +        *(uint32_t *)p_bgr24 = *(uint32_t *)p_bgr32;
> 
> You can't do that, p_bgr24 is not aligned.

So, am I suppose to declare p_bgr24 as:

DECLARE_ALIGNED(32, uint8_t, *p_bgr24);

instead of:

uint8_t *p_bgr24

Is that so ?

>> +    /* Dest is YV12 buffers offset to top of slice. */
>> +    dst[0] = priv->y + (dst_y * priv->mpi_stride);
>> +    dst[1] = priv->u + ((dst_y * priv->mpi_stride) >> 2);
>> +    dst[2] = priv->v + ((dst_y * priv->mpi_stride) >> 2);
> 
> I think this should be (dst_y >> 1) * dst_strides[1]
> because
> 1) dst_strides[1] and [2] are calculated anyway, so why not use it,
> it gives the thing a proper name
> 2) While it is a given here, it removes the assumption of dst_y and
> mpi_stride being divisible by two (which is necessary to merge the two
>>> 1 into one >> 2) and fewer assumptions are always good.

Can you explain/detail this a bit more.
I'm afraid being lost with this hunk.

>> +    asm volatile(
>> +        "pxor %%mm7, %%mm7\n\t"                // zero out %mm7
>> +        "pcmpeqb %%mm4, %%mm4\n\t"             // %mm4 = 255's
>> +        "movq (%3), %%mm5\n\t"        // %mm5 = alpha
>> +        "cmp $255, %4\n\t"           // don't apply layer alpha if it's 100% opaque
>> +        "je 42f\n\t"
> 
> Just use 0, not 42, it's needlessly confusing.

Note that I have absolutely no understanding of ASM, being MMX or not.
I just tried to replace 42 by 0, as suggested and MPlayer just crashes.

> I also think it would be far more readable if the operands were nicely
> aligned.
> 
>> +        :  "+r" (dst_byte),             // %0
>> +        "+r" (dst_alpha)             // %1
> 
> I think most likely these must be early-clobber (&)

What does that mean ?

>> +        for (x = 0; x < (rw & ~7); x += 8)
> 
> Code above used - 7. It doesn't really matter if - 7 or & ~7, but it
> should be the same everywhere.

Where have you see - 7 ?

>> +            *(dst + x) = blend_byte(*(src+x), *(overlay+x), *(alpha+x));
> 
> I'd say it looks much nicer like this:
> 
> dst[x] = blend_byte(src[x], overlay[x], alpha[x]);
> 
>> +    int i, y, q = w / 8, r = w % 8;
> 
> Ugh, why use division?

What'd you prefer ? w >> 3 ?

>> +                : "+r" (dst),         // %0
>> +                "+r" (src),           // %1
>> +                "+r" (overlay),       // %2
>> +                "+r" (alpha)          // %3
> 
> Probably early-clobber?

Same question.

Anyway, thx for the review.

Ben