[Ffmpeg-devel] Fixed vs. Floating Point AAC
Michael Niedermayer
michaelni
Thu Mar 9 17:13:51 CET 2006
Hi
On Thu, Mar 09, 2006 at 09:30:11AM -0500, Rich Felker wrote:
[...]
> Why not compare Athlon? P4 is known to suck...it takes multiple cycles
> just for a bitshift. Even my K6 has 1-cycle MUL/IMUL.
unlikely
>
> > for the athlon the timings arent clear from the docs i have, only that
> > 32*32->32 seems 1/4 and 32*32->64 worse then 1/6 if the high value is used
> > and FMUL >=1/4, also note fmul is direct path imul vector path so imul
> > cannot excute with anything else together while fmul can
>
> Could you explain the p/q notation you're using for throughput?
1/q means that code which does nothing but (I)MULs and they are independant
would need 4cycles per (I)MUL, but the docs wherent clear they dont contain
throughput values, just latency which is meaningless for us as theres plenty
of independant stuff, so i did some guessing and it seems i missguessed a
little, see the benchmarks at the end
>
> > so i think i provided enough "proof", your only argument seems that low
> > prcission integer tremor is faster then libvorbis, now AFAIK these are
> > 2 different implemenattions, i dont see how a comparission between them has any
> > meaning, i can also compare libavcodecs mp3 decoder which uses integers
> > mostly against the one in mplayer which is mostly floats, you know
> > which is faster ...
>
> Yes. And no one's ever been able to explain why. But clearly it's
i just explained it, you dont want this explanation but still its whats the
most likely reason
> unrelated to floats, since MPlayer's version is even faster on my K6
> with very slow float.
3dnow ...
argh, why am i wasting my time with this disscussion
IIRC we had this silly integer vs. float disscussion already at least once
so reusing some benchmark proggy here are the results, nicely written and
source attached, feel free to design your own cpu which can do
integer multiplies faster then floatingpoint ones
latency throughput
P3
int 32*32 ->32 4 1
int 32*32>>32->32 5.5 1/4.5
float 32*32 ->32 5 1/2
Duron
int 32*32 ->32 4 1/2
int 32*32>>32->32 6 1/4.5
float 32*32 ->32 3.5 1
Athlon
int 32*32 ->32 4 1/2
int 32*32>>32->32 6.5 1/5
float 32*32 ->32 3.5 1
[...]
--
Michael
-------------- next part --------------
#include <stdio.h>
#include <asm/timex.h>
#include <inttypes.h>
#define x10(code) code code code code code code code code code code
#define VARS 16
volatile int v[VARS], w[2*VARS];
#define BENCH(code)\
for(i=0; i<10; i++){\
for(j=0; j<VARS; j++) iv[j]=fv[j]=v[j];\
t= get_cycles();\
x10(x10(code))\
t= get_cycles() - t;\
for(j=0; j<VARS; j++) {\
w[j ]= iv[j]; \
w[j+VARS]= fv[j];\
}\
if(i==9)\
printf("100 " #code " %5Ld cycles, %5.2f cycles/op\n", t, (t-overhead)/100.0/count);\
}
int main(){
long long t, overhead=0;
int i, j;
int iv[VARS];
float fv[VARS];
int count=1;
BENCH(;)
overhead= t;
count=2;
BENCH(iv[0]+=iv[1];iv[1]+=iv[0];)
BENCH(iv[0]*=iv[1];iv[1]*=iv[0];)
BENCH(iv[0]=(iv[0]*(int64_t)iv[1])>>32;iv[1]=(iv[0]*(int64_t)iv[1])>>32;)
BENCH(fv[0]+=fv[1];fv[1]+=fv[0];)
BENCH(fv[0]*=fv[1];fv[1]*=fv[0];)
count=5;
BENCH(iv[0]+=iv[1];iv[1]+=iv[2];iv[2]+=iv[3];iv[3]+=iv[4];iv[4]+=iv[0];)
BENCH(iv[0]*=iv[1];iv[1]*=iv[2];iv[2]*=iv[3];iv[3]*=iv[4];iv[4]*=iv[0];)
BENCH(iv[0]=(iv[0]*(int64_t)iv[1])>>32;iv[1]=(iv[2]*(int64_t)iv[1])>>32;iv[2]=(iv[2]*(int64_t)iv[3])>>32;iv[3]=(iv[3]*(int64_t)iv[4])>>32;iv[4]=(iv[4]*(int64_t)iv[0])>>32;)
BENCH(fv[0]+=fv[1];fv[1]+=fv[2];fv[2]+=fv[3];fv[3]+=fv[4];fv[4]+=fv[0];)
BENCH(fv[0]*=fv[1];fv[1]*=fv[2];fv[2]*=fv[3];fv[3]*=fv[4];fv[4]*=fv[0];)
return 0;
}
More information about the ffmpeg-devel
mailing list