[Mplayer-cvslog] CVS: main/DOCS/tech TODO,1.9,1.10
Michael Niedermayer
michaelni at gmx.at
Wed Dec 12 21:41:49 CET 2001
Hi
On Wednesday 12 December 2001 17:02, Nick Kurshev wrote:
> Hello, Michael!
[...]
> > > I know only that manuals always suggested to replace conditional jumps
> > > with direct code ;)
> >
> > yes but that isnt possible here
> > they allso suggest to avoid function pointers
>
> Did you read K7 manual?
> What about:
> JMP near mreg16/32 (indirect) DirectPath
> JMP near mem16/32 (indirect) DirectPath
Direct path means that its decoded quickly, it says nothing about how fast it
is executed or about branchprediction afaik
btw for function pointers u need call / ret and they are vectorPath
[...]
> My tests shows me that on Duron:
> direct call takes 4 clocks
> indirect call takes 5 clocks
> (these clocks include measuring of loop)
> So there is only 20% of difference that is too few against memcpy process.
hmm, i looked into TFM and noticed that i wasnt completly correct about my
assumptation that indirect calls are that slow (they should both excute in
about 2 cycles on the ppro,p3,... cpus) ... well but its slower in the
benchmark ... looking at asm output ...
with if / else gcc generates code like
test ...
jz L1
(function1)
jmp L2
L1:
(function2)
L2:
with function pointers
movl ..., %eax
call *%eax
...
L1:
pushl %ebp
movl %esp,%ebp
(function1)
leave
ret
L2:
pushl %ebp
movl %esp,%ebp
(function2)
leave
ret
quiet a bit longer and slower indeed
[...]
so if we would try to code it manually it would look like:
movl flags, %eax
testl MMX2|3DNOW, %eax
jz MMXorC
testl MMX2, %eax
jz 3DNOW
(MMX2-memcpy)
jmp end
3DNOW:
(3DNOW-memcpy)
jmp end
testl MMX, %eax
jz C
(MMX-memcpy)
jmp end
C:
(C-memcpy)
end:
that would execute 1 mov, 2 tests and 3 jmps at max
these would be decoded to 5 micro Ops on intel chips
they are all directpath on amd chips
the latency is 1 on k7 for all except the mov, and the mov is 3-cycle latency
function pointers:
movl ..., %eax
call *%eax
...
MMX:
blah blah
ret
that would execute 1 mov, 1 indirect call, 1 return
these would be decoded to 10 micro Ops on intel chip
decoding itself is very likely slower too here
call / ret is vectorpath on k7
call / ret have 4-5 cycles latency on k7 and the mov has 3
numbers are simple from TFM (amd and intels), so i might have missed some
important exceptions
the only possible way to really figure out which is faster is to code both
and benchmark them on different cpus, but i doubt that it is worth it because
1. it only affects runtime cpu detection
2. a single memory access (misses L1&L2 cache) need 50 cpu cycles or so so
even if your variant turns out to be faster on some cpu the difference would
be tiny
Michael
More information about the MPlayer-cvslog
mailing list