[FFmpeg-devel] [PATCH 1/2] Optimization of AC3 floating point decoder for MIPS

Vitor Sessak vitor1001 at gmail.com
Wed Jul 4 15:55:56 CEST 2012


On 06/26/2012 01:13 PM, Nedeljko Babic wrote:
> FFT in MIPS implementation is working iteratively instead
>   of "recursively" calling functions for smaller FFT sizes.
> Some of DSP and format convert utils functions are also optimized.

> +#else
> +static void ff_imdct_half_mips(FFTContext *s, FFTSample *output, const FFTSample *input)
> +{
> +    int k, n8, n4, n2, n, j,j2;
> +    const uint16_t *revtab = s->revtab;
> +    const FFTSample *tcos = s->tcos;
> +    const FFTSample *tsin = s->tsin;
> +    const FFTSample *in1, *in2;
> +    const FFTSample *in3, *in4;
> +    FFTSample temp1, temp2, temp3, temp4;
> +    FFTSample temp5, temp6, temp7, temp8;
> +
> +    FFTSample temp11, temp12, temp13, temp14;
> +    FFTSample temp15, temp16, temp17, temp18;
> +
> +    FFTComplex *z = (FFTComplex *)output;
> +
> +    n = 1<<  s->mdct_bits;
> +    n2 = n>>  1;
> +    n4 = n>>  2;
> +    n8 = n>>  3;
> +
> +    /* pre rotation */
> +    in1 = input;
> +    in2 = input + n2 - 1;
> +    in3 = input + 2;
> +    in4 = input + n2 - 3;
> +
> +    for(k = 0; k<  n4; k+=2) {
> +        j=revtab[k];
> +        j2=revtab[k+1];
> +
> +        temp1=*in2 * tcos[k];
> +        temp2=*in1 * tsin[k];
> +        temp3=*in2 * tsin[k];
> +        temp4=*in1 * tcos[k];
> +
> +        temp5=*in4 * tcos[k+1];
> +        temp6=*in3 * tsin[k+1];
> +        temp7=*in4 * tsin[k+1];
> +        temp8=*in3 * tcos[k+1];
> +
> +        z[j].re=temp1-temp2;
> +        z[j].im=temp3+temp4;
> +
> +        z[j2].re=temp5-temp6;
> +        z[j2].im=temp7+temp8;
> +
> +        in1 += 4;
> +        in3 += 4;
> +        in2 -= 4;
> +        in4 -= 4;
> +    }
> +    s->fft_calc(s, z);
> +
> +    /* post rotation + reordering */
> +    for(k = 0; k<  n8; k+=2) {
> +        temp1 = z[n8 - k - 1].im * tsin[n8 - k - 1];
> +        temp2 = z[n8 - k - 1].re * tcos[n8 - k - 1];
> +        temp3 = z[n8 - k - 1].im * tcos[n8 - k - 1];
> +        temp4 = z[n8 - k - 1].re * tsin[n8 - k - 1];
> +
> +        temp5 = z[n8 + k].im * tsin[n8 + k];
> +        temp6 = z[n8 + k].re * tcos[n8 + k];
> +        temp7 = z[n8 + k].im * tcos[n8 + k];
> +        temp8 = z[n8 + k].re * tsin[n8 + k];
> +
> +        temp11 = z[n8 - k - 2].im * tsin[n8 - k - 2];
> +        temp12 = z[n8 - k - 2].re * tcos[n8 - k - 2];
> +        temp13 = z[n8 - k - 2].im * tcos[n8 - k - 2];
> +        temp14 = z[n8 - k - 2].re * tsin[n8 - k - 2];
> +        temp15 = z[n8 + k + 1].im * tsin[n8 + k + 1];
> +        temp16 = z[n8 + k + 1].re * tcos[n8 + k + 1];
> +        temp17 = z[n8 + k + 1].im * tcos[n8 + k + 1];
> +        temp18 = z[n8 + k + 1].re * tsin[n8 + k + 1];
> +
> +        z[n8 - k - 1].re = temp1 - temp2;
> +        z[n8 - k - 1].im = temp7 + temp8;
> +        z[n8 + k].re = temp5 - temp6;
> +        z[n8 + k].im = temp3 + temp4;
> +
> +        z[n8 - k - 2].re = temp11 - temp12;
> +        z[n8 - k - 2].im = temp17 + temp18;
> +        z[n8 + k + 1].re = temp15 - temp16;
> +        z[n8 + k + 1].im = temp13 + temp14;
> +    }
> +}
> +#endif /* HAVE_INLINE_ASM */

Hmm, what is the point of this chunk? If you benchmarked that the C code 
is faster if hand-unrolled, this should also be true for other archs and 
them you should unroll common code...

> +#if HAVE_INLINE_ASM
> +static void float_to_int16_mips(int16_t *dst, const float *src, long len) {
> +    const float *src_end = src + len;
> +    int ret0, ret1, ret2, ret3, ret4, ret5, ret6, ret7;
> +    float src0, src1, src2, src3, src4, src5, src6, src7;

I'll leave reviewing this chunk for people who know best the format 
conversion code.

-Vitor


More information about the ffmpeg-devel mailing list