[FFmpeg-devel] [PATCH] split-radix FFT
Loren Merritt
lorenm
Fri Aug 8 04:22:35 CEST 2008
On Thu, 7 Aug 2008, Michael Niedermayer wrote:
>
> iam not sure if its worth it to simplify this, but i think if we dont attempt
> to mask of the high bits inside the function then the following might work:
>
> if(!(i & m)) return split_radix_permutation(i, m, inverse)<<1;
> m >>= 1;
> if(inverse == !(i&m)) return (split_radix_permutation(i, m, inverse)<<2) + 1;
> else return (split_radix_permutation(i, m, inverse)<<2) - 1;
done
> s->revtab[(-split_radix_permutation(i, n, s->inverse)) & (n-1)] = i;
done
> It would be nice if the forced duplication could be limited to
> #ifndef CONFIG_SMALL unless its significantly slower that way
I tried several combinations of recursive fft##n and/or re-rolling
pass{,_big} and/or re-rolling fft16 and/or removing pass or pass_big.
I can make it smaller and retain speed on core2 or prescott, but not both
cpus at once.
k8 is equally happy with any version.
2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 code_size
penryn:
142 417 1120 2837 6589 14935 33433 74609 164273 fft.00 4070
142 418 1132 2863 6662 15108 33844 74712 165418 fft.11 3189
142 417 1120 2838 6590 14938 46809 114069 282947 fft.10 3133
142 462 1231 3011 6982 15769 35297 78270 170920 fft.05 2572
142 462 1194 2997 6947 15780 48557 117461 289381 fft.01 2516
175 516 1396 3338 7673 17166 51432 123494 301169 fft.03 1652
180 542 1411 3414 7853 17452 51895 124489 304666 fft.04 1175
prescott:
423 1122 2854 7044 16366 37274 84451 187963 418948 fft.10 2414
423 1120 2855 7056 16390 37437 87674 196322 442723 fft.00 3176
420 1162 2972 7082 16693 38034 85973 189885 421885 fft.01 1745
466 1235 3149 7451 17410 39395 89301 202842 447159 fft.03 1162
472 1209 3130 7543 17438 40310 91024 206670 456248 fft.04 830
425 1227 3217 8032 18968 43605 98880 219511 487624 fft.11 2532
421 1286 3316 8082 19250 44563 99940 223647 495350 fft.05 1872
.00 is the previous patch, all compiled with -Os
fft.10 (that's removing pass_big) might be a decent compromise if you
don't care about a huge speed regression in cases that aren't currently
used by any audio codec.
>> + int n = 1<<s->nbits;
>> + int i;
>> + ff_fft_dispatch_3dn2(z, s->nbits);
>> asm volatile("femms");
>> + for(i=0; i<n; i+=2)
>> + FFSWAP(FFTSample, z[i].im, z[i+1].re);
>> }
>
> could you elaborate on why this FFSWAP pass is needed?
Intermediate results are not arrays of complex numbers, but rather group
reals and imaginaries into blocks according to the simd register size. I
suppose I could merge the swap pass into the last fft pass, like I did for
sse.
This is only needed in plain fft. My next commit after split-radix will be
to update imdct to take unswapped output from fft.
> position independant code right after a table that needs relocations ...
> no complaint i just find it ironic
Blame gnu for allowing 64bit textrels but not 32bit textrels in x86_64
shared libs.
--Loren Merritt
More information about the ffmpeg-devel
mailing list