[FFmpeg-devel] [PATCH] VC-1 MMX DSP functions

Thu Nov 15 21:58:39 CET 2007

Michael Niedermayer a ?crit :
>> +static void vc1_put_hor_16b_shift2_mmx(uint8_t *dst, long int stride,
>> +                                       const int16_t *src, int rnd)
>> +{
>> +    int h = 8;
>> +    src -= 1;
>> +
>> +    asm volatile(
>> +        LOAD_ROUNDER_MMX("%4")
>> +        "1:                                \n\t"
>> +        "movq      2*0+0(%1), %%mm1        \n\t"
>> +        "movq      2*0+8(%1), %%mm2        \n\t"
>> +        "movq      2*1+0(%1), %%mm3        \n\t"
>> +        "movq      2*1+8(%1), %%mm4        \n\t"
>> +        "paddsw    2*3+0(%1), %%mm1        \n\t"
>> +        "paddsw    2*3+8(%1), %%mm2        \n\t"
>> +        "paddsw    2*2+0(%1), %%mm3        \n\t"
>> +        "paddsw    2*2+8(%1), %%mm4        \n\t"
>> +        "psubsw    %%mm3, %%mm1            \n\t"
>> +        "psubsw    %%mm4, %%mm2            \n\t"
>> +        /* Multiplying by 9 here overflows */
>> +        "psllw     $3, %%mm3               \n\t"
>> +        "psllw     $3, %%mm4               \n\t"
>> +        "psubsw    %%mm1, %%mm3            \n\t"
>> +        "psubsw    %%mm2, %%mm4            \n\t"
> 
> what overflows here?
> also please replace all p*sw by p*w if saturation happens then your code
> is buggy

Allow me a rather long explanation (which should prove I'm wrong anyway)...

This is the horizontal pass of the bicubic filter when there is an
initial vertical pass beforehand. From the code, you probably have
noticed the filters coefficients are [-1 9 9 1] or [-3 18 53 -4] with a
scaling depending on the pair of filters used.

Let's start from the first pass then, the vertical one. Imagine the
input values:
255 255 255 255
255 255 255 255
255 255 255 255
255 255 255 255
The filters applied can be [-3 18 53 -4]/2? or [-1 9 9 -1]/2?, so
ignoring rounding, output would be: 2040 2040 2040 2040

Now if we apply the filter in the above function, considering input:
p1 p2 p3 p4 which are int16_t values
If we were to use a multiplication, that would give:
-(p1+p4)+(p2+p3)*9
The intermediate result (p2+p3)*9 with the above test case is
(2040+2040)*9=36720, which doesn't fit in an int16_t value. Other values
in the expression are also signed, so we can't use uint16_t arithmetics
either. My code was a feeble attempt at mitigating this.  I made an
identical and equally wrong comment for the other functions

Discovering this, I tried to prescale the input values, which obviously
fail. To sum it up, it seems to me 17bits arithmetics are needed here
(contrary to what the document I have was affirming). I don't have the
SMPTE 421M official documentation, but from what I saw, it now makes me
think I can scrap my whole code for the horizontal pass. I don't see how
to overcome this except by using pmaddwd.

>> +            return;
>> +        }
>> +        else { /* No horizontal filter, output 8 lines to dst */
>> +            vc1_put_shift_8bits[vmode](dst, src, stride, 1-rnd, stride);
>> +            return;
>> +        }
> 
> the return can be factored out of teh if/else

Ok.

Best regards,
-- 
Christophe GISQUET