[FFmpeg-devel] [PATCH] VP8 V simple loopfilter in MMX/MMX2/SSE2

Thu Jul 1 19:13:23 CEST 2010

Hi,

On Thu, Jul 1, 2010 at 12:48 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> On Thu, Jul 1, 2010 at 12:41 PM, Pascal Massimino
> <pascal.massimino at gmail.com> wrote:
>> On Thu, Jul 1, 2010 at 8:10 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>> + ? ?mova ? ? ?m0, [pb_80]
>> + ? ?pxor ? ? ?m2, m0
>> + ? ?pxor ? ? ?m4, m0
>> + ? ?psubsb ? ?m2, m4 ? ? ? ?; m2=p1-q1 (signed) backup for below
>> + ? ?pand ? ? ?m3, [pb_FE]
>> + ? ?psrlq ? ? m3, 1 ? ? ? ? ; m3=FFABS(p1-q1)/2, this can be used signed
>>
>> i think you can avoid loading pb_FE by re-using pb_80 as:
>>
>> + ? ?mova ? ? ?m0, [pb_80]
>> + ? ?pxor ? ? ?m2, m0
>> + ? ?pxor ? ? ?m4, m0
>> + ? ?psubsb ? ?m2, m4 ? ? ? ?; m2=p1-q1 (signed) backup for below
>> + ? ?psrlq ? ? m3, 1 ? ? ? ? ; m3=FFABS(p1-q1)/2, this can be used signed
>> + ? pandn m0
>
> That wouldn't work, The key thing here is that we're doing a
> byte-based right-shift, but there's no such psrlb instruction. So
> we're using psrlq (or any psrlX) after clearing the least significant
> bit, to prevent overflows in the next (lower) byte if it was set.

Whoops, I misunderstood twice.

So this is a great idea, which unfortunately is slightly slower
because we re-use mm0 (i.e. pb_80) later, so for pandn (which does a ~
on the dst reg), we'd need to mova pb_80 and then do pandn new_reg,
mm3. I've tested this and it's slightly slower because of the extra
move, most likely...

>> also: have you considered using something along the lines of :
>>
>> pxor m0,m0
>> pavbg(m3, m0)
>>
>> for computing fabs(p1-q1)/2 ?

Actually there's no sign, but the LSB is still a problem (we calculate
>>1, not +1>>1), so this will be given some more thought if we can...

Ronald