[FFmpeg-devel] [PATCH] Optimization of AMR NB and WB decoders for MIPS
Vitor Sessak
vitor1001 at gmail.com
Mon May 28 20:21:33 CEST 2012
On 05/28/2012 05:40 PM, Babic, Nedeljko wrote:
> Hello,
>
> On 05/24/2012 10:47 PM, Vitor Sessak wrote:
>> Hello again,
>>
>> On 05/18/2012 03:47 PM, Nedeljko Babic wrote:
>>> AMR NB and WB decoders are optimized for MIPS architecture.
>>> Appropriate Makefiles are changed accordingly.
>>
>> I've given a second look and I have a few more comments about the ASM
>> code (I will look tomorrow at the files you just sent).
>>
>>> +av_always_inline void ff_acelp_apply_order_2_transfer_function(float *out, const float *in,
>>> + const float zero_coeffs[2],
>>> + const float pole_coeffs[2],
>>> + float gain, float mem[2], int n)
>>> +{
>>> + /**
>>> + * loop is unrolled eight times
>>> + */
>>> +
>>> + __asm__ __volatile__ (
>>> + "lwc1 $f0, 0(%[mem]) \n\t"
>>> + "blez %[n], ff_acelp_apply_order_2_transfer_function_end%= \n\t"
>>> + "lwc1 $f1, 4(%[mem]) \n\t"
>>> + "lwc1 $f2, 0(%[pole_coeffs]) \n\t"
>>> + "lwc1 $f3, 4(%[pole_coeffs]) \n\t"
>>> + "lwc1 $f4, 0(%[zero_coeffs]) \n\t"
>>> + "lwc1 $f5, 4(%[zero_coeffs]) \n\t"
>>> +
>>> + "ff_acelp_apply_order_2_transfer_function_madd%=: \n\t"
>>> +
>>> + "lwc1 $f6, 0(%[in]) \n\t"
>>
>>> + "mul.s $f9, $f3, $f1 \n\t"
>>> + "mul.s $f7, $f2, $f0 \n\t"
>>> + "msub.s $f7, $f7, %[gain], $f6 \n\t"
>>> + "sub.s $f7, $f7, $f9 \n\t"
>>
>> Why not use "msub.s $f7, $f7, %f3, $f1"? Looking at the C
>> source, it looks like it could be done just with muls/msub/madd, with no
>> adds or subs.
>
> "msub.s $f7, $f7, %f3, $f1" is in fact: $f7 = $f1 x $f3 - $f7
> Taking this in consideration, I don't see a way for implementing expression like:
> gain * in[i] - pole_coeffs[0] * mem[0] - pole_coeffs[1] * mem[1] without changing the
> order of the execution of subtractions, and if the order is changed, there is no guaranty
> that result will be bit exact with the result of C code execution (and bit-exactness was
> target here)
Agreed, even if we don't really care much about bit-exactness of float code.
>> If you unroll once the inner loop you will need to do all these movs
>> only once per iteration (thus half of the time).
> Inner loop is already unrolled once (it is not commented; I'll add comment).
> Since value of filter length can be 10 (or possible 16) in all the places where this function
> is being called, unrolling it once more would made assembly code more complex and longer.
> On the other hand, because of the way the loop is written, those moves are unnecessary and I will remove them.
> Thanks for pointing that out.
Nice!
-Vitor
More information about the ffmpeg-devel
mailing list