[FFmpeg-devel] [PATCH] Optimization of AMR NB and WB decoders for MIPS

Mon May 28 20:21:33 CEST 2012

On 05/28/2012 05:40 PM, Babic, Nedeljko wrote:
> Hello,
>
> On 05/24/2012 10:47 PM, Vitor Sessak wrote:
>> Hello again,
>>
>> On 05/18/2012 03:47 PM, Nedeljko Babic wrote:
>>> AMR NB and WB decoders are optimized for MIPS architecture.
>>> Appropriate Makefiles are changed accordingly.
>>
>> I've given a second look and I have a few more comments about the ASM
>> code (I will look tomorrow at the files you just sent).
>>
>>> +av_always_inline void ff_acelp_apply_order_2_transfer_function(float *out, const float *in,
>>> +                                              const float zero_coeffs[2],
>>> +                                              const float pole_coeffs[2],
>>> +                                              float gain, float mem[2], int n)
>>> +{
>>> +    /**
>>> +    * loop is unrolled eight times
>>> +    */
>>> +
>>> +    __asm__ __volatile__ (
>>> +        "lwc1   $f0,    0(%[mem])                                              \n\t"
>>> +        "blez   %[n],   ff_acelp_apply_order_2_transfer_function_end%=         \n\t"
>>> +        "lwc1   $f1,    4(%[mem])                                              \n\t"
>>> +        "lwc1   $f2,    0(%[pole_coeffs])                                      \n\t"
>>> +        "lwc1   $f3,    4(%[pole_coeffs])                                      \n\t"
>>> +        "lwc1   $f4,    0(%[zero_coeffs])                                      \n\t"
>>> +        "lwc1   $f5,    4(%[zero_coeffs])                                      \n\t"
>>> +
>>> +        "ff_acelp_apply_order_2_transfer_function_madd%=:                      \n\t"
>>> +
>>> +        "lwc1   $f6,    0(%[in])                                               \n\t"
>>
>>> +        "mul.s  $f9,    $f3,      $f1                                          \n\t"
>>> +        "mul.s  $f7,    $f2,      $f0                                          \n\t"
>>> +        "msub.s $f7,    $f7,      %[gain], $f6                                 \n\t"
>>> +        "sub.s  $f7,    $f7,      $f9                                          \n\t"
>>
>> Why not use "msub.s $f7,    $f7,      %f3, $f1"? Looking at the C
>> source, it looks like it could be done just with muls/msub/madd, with no
>> adds or subs.
>
> "msub.s $f7, $f7, %f3, $f1" is in fact: $f7 = $f1 x $f3 - $f7
> Taking this in consideration, I don't see a way for implementing expression like:
> gain * in[i] - pole_coeffs[0] * mem[0] - pole_coeffs[1] * mem[1] without changing the
> order of the execution of subtractions, and if the order is changed, there is no guaranty
> that result will be bit exact with the result of C code execution (and bit-exactness was
> target here)

Agreed, even if we don't really care much about bit-exactness of float code.

>> If you unroll once the inner loop you will need to do all these movs
>> only once per iteration (thus half of the time).
> Inner loop is already unrolled once (it is not commented; I'll add comment).
> Since value of filter length can be 10 (or possible 16) in all the places where this function
> is being called, unrolling it once more would made assembly code more complex and longer.
> On the other hand, because of the way the loop is written, those moves are unnecessary and I will remove them.
> Thanks for pointing that out.

Nice!

-Vitor