[Mplayer-cvslog] CVS: main/DOCS/tech TODO,1.9,1.10

Thu Dec 13 18:05:47 CET 2001

Hello, Michael!

On Wed, 12 Dec 2001 21:41:49 +0100 you wrote:

> Hi
> 
> On Wednesday 12 December 2001 17:02, Nick Kurshev wrote:
> > Hello, Michael!
> [...]
> > > > I know only that manuals always suggested to replace conditional jumps
> > > > with direct code ;)
> > >
> > > yes but that isnt possible here
> > > they allso suggest to avoid function pointers
> >
> > Did you read K7 manual?
> > What about:
> > JMP near mreg16/32 (indirect)    DirectPath
> > JMP near mem16/32 (indirect)     DirectPath
> Direct path means that its decoded quickly, it says nothing about how fast it 
> is executed or about branchprediction afaik
> btw for function pointers u need call / ret and they are vectorPath
> 
I know that ;)
> [...]
> > My tests shows me that on Duron:
> > direct call takes 4 clocks
> > indirect call takes 5 clocks
> > (these clocks include measuring of loop)
> > So there is only 20% of difference that is too few against memcpy process.
> hmm, i looked into TFM and noticed that i wasnt completly correct about my 
> assumptation that indirect calls are that slow (they should both excute in 
> about 2 cycles on the ppro,p3,... cpus) ... well but its slower in the 
> benchmark ... looking at asm output ...
> 
> with if / else gcc generates code like
> test ...
>  jz L1
> (function1)
>  jmp L2
> L1:
> (function2)
> L2:
> 
> with function pointers
> movl ..., %eax
> call *%eax
> 
> ...
> 
> L1:
> pushl %ebp
> movl %esp,%ebp
> (function1)
> leave
ifdef K7
leave
else
movl %ebp, %esp
popl %esp
endif
;)
> ret
> 
> L2:
> pushl %ebp
> movl %esp,%ebp
> (function2)
> leave
> ret
> 
> quiet a bit longer and slower indeed
> 
> [...]
> 
> so if we would try to code it manually it would look like:
> movl flags, %eax
> testl MMX2|3DNOW, %eax
>  jz MMXorC
> testl MMX2, %eax
>  jz 3DNOW
> (MMX2-memcpy)
>  jmp end
> 3DNOW:
> (3DNOW-memcpy)
>  jmp end
> testl MMX, %eax
>  jz C
> (MMX-memcpy)
>  jmp end
> C:
> (C-memcpy)
> end:
> 
> that would execute 1 mov, 2 tests and 3 jmps at max
> these would be decoded to 5 micro Ops on intel chips
> they are all directpath on amd chips
> the latency is 1 on k7 for all except the mov, and the mov is 3-cycle latency
> 
> function pointers:
> movl ..., %eax
> call *%eax
> 
> ...
> MMX:
> blah blah
> ret
> 
> that would execute 1 mov, 1 indirect call, 1 return
> these would be decoded to 10 micro Ops on intel chip 
> decoding itself is very likely slower too here
> call / ret is vectorpath on k7
> call / ret have 4-5 cycles latency on k7 and the mov has 3
> 
> numbers are simple from TFM (amd and intels), so i might have missed some 
> important exceptions
> 
> the only possible way to really figure out which is faster is to code both 
> and benchmark them on different cpus, but i doubt that it is worth it because
> 1. it only affects runtime cpu detection
> 2. a single memory access (misses L1&L2 cache) need 50 cpu cycles or so so 
> even if your variant turns out to be faster on some cpu the difference would 
> be tiny
> 
I didn't understand you.
It seems that you are catching 10-20 micro OPS per 5-7 OPS.
But - do you know that ordinal memcpy process takes 100 000 - 500 000 cpu clocks?
(I don't mean memcpy of 64 byte block)

My issues:
1) cache pollution due function inlining (significand for any cpus)
   else you'll get dramatically performance losing. (As you write - 50 cpu cycles)
2) wrong prediction in 75% of cases
   due uncached code and data
3) just not elegant solution

> Michael
> _______________________________________________
> Mplayer-cvslog mailing list
> Mplayer-cvslog at mplayerhq.hu
> http://mplayerhq.hu/mailman/listinfo/mplayer-cvslog
> 

Best regards! Nick