[FFmpeg-devel] [PATCH] SSE-optimized vector_clipf()

Sat Aug 8 23:55:53 CEST 2009

Michael Niedermayer wrote:
> On Sat, Aug 08, 2009 at 09:04:14AM +0200, Vitor Sessak wrote:
>> Michael Niedermayer wrote:
>>> On Thu, Aug 06, 2009 at 02:55:30AM +0200, Vitor Sessak wrote:
>>>> Vitor Sessak wrote:
>>>>> $subj, 10% speedup for twinvq decoding (but should be useful also for 
>>>>> AMR and wmapro).
>>>> err, I mean, attached.
>>>>
>>>> -Vitor
>>>>  dsputil.c         |   15 +++++++++++++++
>>>>  dsputil.h         |    3 ++-
>>>>  x86/dsputil_mmx.c |   34 ++++++++++++++++++++++++++++++++++
>>>>  3 files changed, 51 insertions(+), 1 deletion(-)
>>>> 8a95f5f2f3d267089056d6a571b2e6cc37d1569e  dsp_vector_clipf.diff
>>>> Index: libavcodec/dsputil.c
>>>> ===================================================================
>>>> --- libavcodec/dsputil.c	(revision 19598)
>>>> +++ libavcodec/dsputil.c	(working copy)
>>>> @@ -4093,6 +4093,20 @@
>>>>          dst[i] = src[i] * mul;
>>>>  }
>>>>  +void vector_clipf_c(float *dst, float min, float max, int len) {
>>>> +    int i;
>>>> +    for (i=0; i < len; i+=8) {
>>>> +        dst[i    ] = av_clipf(dst[i    ], min, max);
>>>> +        dst[i + 1] = av_clipf(dst[i + 1], min, max);
>>>> +        dst[i + 2] = av_clipf(dst[i + 2], min, max);
>>>> +        dst[i + 3] = av_clipf(dst[i + 3], min, max);
>>>> +        dst[i + 4] = av_clipf(dst[i + 4], min, max);
>>>> +        dst[i + 5] = av_clipf(dst[i + 5], min, max);
>>>> +        dst[i + 6] = av_clipf(dst[i + 6], min, max);
>>>> +        dst[i + 7] = av_clipf(dst[i + 7], min, max);
>>>> +    }
>>>> +}
>>> this one could be tried by using integer math instead of floats
>>> (assuming IEEE floats of course)
>> How could this possibly be faster? It would just clip the sign, then the 
>> exponent, then the mantissa. It seems like much more work for me, unless 
>> I'm missing something.
> 
> we arent comparing integers by first checking the first bit then seperately
> the next 8 and then again seperately the last 23. Why should we here?

Ok, the exponent is fine, but a special treatment of the sign is needed. 
I benchmarked the following and it is slower:

static inline float clipf_c_one(float a0,
                                 uint32_t amin, uint32_t amax,
                                 float aminf, float amaxf)
{
     uint32_t ai   = *(uint32_t *)&a0;
     uint32_t sign = ai >> 31;
     uint32_t a = ai ^ (sign << 31) - sign;

     if      ((signed)a < (signed)amin) return aminf;
     else if ((signed)a > (signed)amax) return amaxf;
     else                               return a0;
}

static void vector_clipf_c(float *dst, float min, float max, int len) {
     int i;

     uint32_t mini = *(uint32_t *)&min;
     uint32_t maxi = *(uint32_t *)&max;

     mini ^= ((mini >> 31) << 31) - (mini >> 31);
     maxi ^= ((maxi >> 31) << 31) - (maxi >> 31);

     for (i=0; i < len; i+=8) {
         dst[i    ] = clipf_c_one(dst[i    ], mini, maxi, min, max);
         dst[i + 1] = clipf_c_one(dst[i + 1], mini, maxi, min, max);
         dst[i + 2] = clipf_c_one(dst[i + 2], mini, maxi, min, max);
         dst[i + 3] = clipf_c_one(dst[i + 3], mini, maxi, min, max);
         dst[i + 4] = clipf_c_one(dst[i + 4], mini, maxi, min, max);
         dst[i + 5] = clipf_c_one(dst[i + 5], mini, maxi, min, max);
         dst[i + 6] = clipf_c_one(dst[i + 6], mini, maxi, min, max);
         dst[i + 7] = clipf_c_one(dst[i + 7], mini, maxi, min, max);
     }
}

-Vitor