[FFmpeg-devel] [PATCH 1/2] Optimization of AC3 floating point decoder for MIPS
Vitor Sessak
vitor1001 at gmail.com
Wed Jul 4 15:55:56 CEST 2012
On 06/26/2012 01:13 PM, Nedeljko Babic wrote:
> FFT in MIPS implementation is working iteratively instead
> of "recursively" calling functions for smaller FFT sizes.
> Some of DSP and format convert utils functions are also optimized.
> +#else
> +static void ff_imdct_half_mips(FFTContext *s, FFTSample *output, const FFTSample *input)
> +{
> + int k, n8, n4, n2, n, j,j2;
> + const uint16_t *revtab = s->revtab;
> + const FFTSample *tcos = s->tcos;
> + const FFTSample *tsin = s->tsin;
> + const FFTSample *in1, *in2;
> + const FFTSample *in3, *in4;
> + FFTSample temp1, temp2, temp3, temp4;
> + FFTSample temp5, temp6, temp7, temp8;
> +
> + FFTSample temp11, temp12, temp13, temp14;
> + FFTSample temp15, temp16, temp17, temp18;
> +
> + FFTComplex *z = (FFTComplex *)output;
> +
> + n = 1<< s->mdct_bits;
> + n2 = n>> 1;
> + n4 = n>> 2;
> + n8 = n>> 3;
> +
> + /* pre rotation */
> + in1 = input;
> + in2 = input + n2 - 1;
> + in3 = input + 2;
> + in4 = input + n2 - 3;
> +
> + for(k = 0; k< n4; k+=2) {
> + j=revtab[k];
> + j2=revtab[k+1];
> +
> + temp1=*in2 * tcos[k];
> + temp2=*in1 * tsin[k];
> + temp3=*in2 * tsin[k];
> + temp4=*in1 * tcos[k];
> +
> + temp5=*in4 * tcos[k+1];
> + temp6=*in3 * tsin[k+1];
> + temp7=*in4 * tsin[k+1];
> + temp8=*in3 * tcos[k+1];
> +
> + z[j].re=temp1-temp2;
> + z[j].im=temp3+temp4;
> +
> + z[j2].re=temp5-temp6;
> + z[j2].im=temp7+temp8;
> +
> + in1 += 4;
> + in3 += 4;
> + in2 -= 4;
> + in4 -= 4;
> + }
> + s->fft_calc(s, z);
> +
> + /* post rotation + reordering */
> + for(k = 0; k< n8; k+=2) {
> + temp1 = z[n8 - k - 1].im * tsin[n8 - k - 1];
> + temp2 = z[n8 - k - 1].re * tcos[n8 - k - 1];
> + temp3 = z[n8 - k - 1].im * tcos[n8 - k - 1];
> + temp4 = z[n8 - k - 1].re * tsin[n8 - k - 1];
> +
> + temp5 = z[n8 + k].im * tsin[n8 + k];
> + temp6 = z[n8 + k].re * tcos[n8 + k];
> + temp7 = z[n8 + k].im * tcos[n8 + k];
> + temp8 = z[n8 + k].re * tsin[n8 + k];
> +
> + temp11 = z[n8 - k - 2].im * tsin[n8 - k - 2];
> + temp12 = z[n8 - k - 2].re * tcos[n8 - k - 2];
> + temp13 = z[n8 - k - 2].im * tcos[n8 - k - 2];
> + temp14 = z[n8 - k - 2].re * tsin[n8 - k - 2];
> + temp15 = z[n8 + k + 1].im * tsin[n8 + k + 1];
> + temp16 = z[n8 + k + 1].re * tcos[n8 + k + 1];
> + temp17 = z[n8 + k + 1].im * tcos[n8 + k + 1];
> + temp18 = z[n8 + k + 1].re * tsin[n8 + k + 1];
> +
> + z[n8 - k - 1].re = temp1 - temp2;
> + z[n8 - k - 1].im = temp7 + temp8;
> + z[n8 + k].re = temp5 - temp6;
> + z[n8 + k].im = temp3 + temp4;
> +
> + z[n8 - k - 2].re = temp11 - temp12;
> + z[n8 - k - 2].im = temp17 + temp18;
> + z[n8 + k + 1].re = temp15 - temp16;
> + z[n8 + k + 1].im = temp13 + temp14;
> + }
> +}
> +#endif /* HAVE_INLINE_ASM */
Hmm, what is the point of this chunk? If you benchmarked that the C code
is faster if hand-unrolled, this should also be true for other archs and
them you should unroll common code...
> +#if HAVE_INLINE_ASM
> +static void float_to_int16_mips(int16_t *dst, const float *src, long len) {
> + const float *src_end = src + len;
> + int ret0, ret1, ret2, ret3, ret4, ret5, ret6, ret7;
> + float src0, src1, src2, src3, src4, src5, src6, src7;
I'll leave reviewing this chunk for people who know best the format
conversion code.
-Vitor
More information about the ffmpeg-devel
mailing list