[Mplayer-cvslog] CVS: main/DOCS/tech TODO,1.9,1.10
Nick Kurshev
nickols_k at mail.ru
Thu Dec 13 18:05:47 CET 2001
Hello, Michael!
On Wed, 12 Dec 2001 21:41:49 +0100 you wrote:
> Hi
>
> On Wednesday 12 December 2001 17:02, Nick Kurshev wrote:
> > Hello, Michael!
> [...]
> > > > I know only that manuals always suggested to replace conditional jumps
> > > > with direct code ;)
> > >
> > > yes but that isnt possible here
> > > they allso suggest to avoid function pointers
> >
> > Did you read K7 manual?
> > What about:
> > JMP near mreg16/32 (indirect) DirectPath
> > JMP near mem16/32 (indirect) DirectPath
> Direct path means that its decoded quickly, it says nothing about how fast it
> is executed or about branchprediction afaik
> btw for function pointers u need call / ret and they are vectorPath
>
I know that ;)
> [...]
> > My tests shows me that on Duron:
> > direct call takes 4 clocks
> > indirect call takes 5 clocks
> > (these clocks include measuring of loop)
> > So there is only 20% of difference that is too few against memcpy process.
> hmm, i looked into TFM and noticed that i wasnt completly correct about my
> assumptation that indirect calls are that slow (they should both excute in
> about 2 cycles on the ppro,p3,... cpus) ... well but its slower in the
> benchmark ... looking at asm output ...
>
> with if / else gcc generates code like
> test ...
> jz L1
> (function1)
> jmp L2
> L1:
> (function2)
> L2:
>
> with function pointers
> movl ..., %eax
> call *%eax
>
> ...
>
> L1:
> pushl %ebp
> movl %esp,%ebp
> (function1)
> leave
ifdef K7
leave
else
movl %ebp, %esp
popl %esp
endif
;)
> ret
>
> L2:
> pushl %ebp
> movl %esp,%ebp
> (function2)
> leave
> ret
>
> quiet a bit longer and slower indeed
>
> [...]
>
> so if we would try to code it manually it would look like:
> movl flags, %eax
> testl MMX2|3DNOW, %eax
> jz MMXorC
> testl MMX2, %eax
> jz 3DNOW
> (MMX2-memcpy)
> jmp end
> 3DNOW:
> (3DNOW-memcpy)
> jmp end
> testl MMX, %eax
> jz C
> (MMX-memcpy)
> jmp end
> C:
> (C-memcpy)
> end:
>
> that would execute 1 mov, 2 tests and 3 jmps at max
> these would be decoded to 5 micro Ops on intel chips
> they are all directpath on amd chips
> the latency is 1 on k7 for all except the mov, and the mov is 3-cycle latency
>
> function pointers:
> movl ..., %eax
> call *%eax
>
> ...
> MMX:
> blah blah
> ret
>
> that would execute 1 mov, 1 indirect call, 1 return
> these would be decoded to 10 micro Ops on intel chip
> decoding itself is very likely slower too here
> call / ret is vectorpath on k7
> call / ret have 4-5 cycles latency on k7 and the mov has 3
>
> numbers are simple from TFM (amd and intels), so i might have missed some
> important exceptions
>
> the only possible way to really figure out which is faster is to code both
> and benchmark them on different cpus, but i doubt that it is worth it because
> 1. it only affects runtime cpu detection
> 2. a single memory access (misses L1&L2 cache) need 50 cpu cycles or so so
> even if your variant turns out to be faster on some cpu the difference would
> be tiny
>
I didn't understand you.
It seems that you are catching 10-20 micro OPS per 5-7 OPS.
But - do you know that ordinal memcpy process takes 100 000 - 500 000 cpu clocks?
(I don't mean memcpy of 64 byte block)
My issues:
1) cache pollution due function inlining (significand for any cpus)
else you'll get dramatically performance losing. (As you write - 50 cpu cycles)
2) wrong prediction in 75% of cases
due uncached code and data
3) just not elegant solution
> Michael
> _______________________________________________
> Mplayer-cvslog mailing list
> Mplayer-cvslog at mplayerhq.hu
> http://mplayerhq.hu/mailman/listinfo/mplayer-cvslog
>
Best regards! Nick
More information about the MPlayer-cvslog
mailing list