[Mplayer-users] Question: K6-2 and fastmemcpy

Fri Apr 20 00:19:48 CEST 2001

Hi, all!

>>>>>On my Duron MMX2 optimized version has v2-v1=318426
>>>>>but version with standard memcpy has v2-v1=517387
>>>
>>>>
>>>>on my K6-2 450MHz, running linux 2.2.16:
>>>>
>>>>v1 = 30557687593183 v2 = 30557688368732 v2-v1=775549
>>>>
>>>
>>>Is that clock-cycles?  Cool, on my K6 200 MHz (so MMX, but no MMX2)
>>>the regular memcpy-version takes only 330000 in the mean...
>>>Too bad I can't overclock it to 1GHz. 
>>>
>>It indicates how many CPU cycles were spent per tested block. CPU frequency 
>don't matter
>>in this case and I intentionally didn't say my CPU frequency.
>
>Yeah, I know that, but it tells me, that from K6 to Duron, AMD did not increase 
>efficiency, say, transfered bytes/cycle. In fact, the ratio got worse, but with 
>optimized code (prefetch it is...) you can get it back about the same level. 
>
I have found many design flaws in this cpu opposite to my old K6-200 ;) (such as mul/div are relatively slower 
and some oher), but in general Duron-750 is much faster of K6-3-400.

>Which is somehow great anyway, since not CPU frequency but speed of memory 
>transfer is the bottleneck. So I should have written, _with my "old" 66 MHz 
>memory bus_ somewhere. 
What about memcpy. Duron is not Athlon! Duron has only 128KB of L1 + 64Kb of L2 caches and it busy 
memory bus frequently than Athlon e.t.c.Many people think that PC-100 memory is faster of PC-66. But imho 
it's wrong. Progress in the memory design equal nearly 0. Memory on board contain frequency of regeneration 
but not time of access to memory cell. If modern memory has 10ns access to cell then what reason to 
inscrease cpu caches (L0+L1+l2+L3) which have same (10ns) time of access? Indeed, time of access to 
memory cell is old - 60-70ns and only advance and expensive memory chips have 50-30ns. But 10ns it's time 
of charge of memory elements.

>
>>Example: 2 NOP insns will execute on K6 at 1 cpu cycle on any cpu frequency.
>>Numbers are interested only from point of relative measuring (MMX and nonMMX 
>optimized versions).
>Yes, but we are not talking nops, are we? We are talking memcpy which should
>consists of as few "nops" as possible :-) 
>
Modern manuals (K7, PIII) say that movsb is too complex instruction (same as push pop loop) and better to 
avoid it usage but standard libraries and compilers contain it as base code.
Indeed, when cpu executes opcode it executes microops: MOVSB = mov tmp, esi (inc esi) mov edi, tmp (inc 
edi), i.e. contains 2 unnecessary INC insns. Old K6 could execute it better.

>Of course the code is just for relative measurement... and my results are 
>not interesting to anybody else, because the K6 has no alternative to standard
>memcpy. But still... counting cycles is fun!
Narrow place this is probably memory, but not changes in internal architecture.

>Although I'm not Pontscho, I reply ;)
I'm sorry but it your signing in "fastmemcpy.h" or am I wrong?

>Consider writing mails to mplayer-devel, you can write in english, as many
>others do.. So the fellow users won't be disturbed by advanced talk :)
>
>BTW, on my K6-2 286mhz (stepping 0, no mtrr) it doesn't really matter if
>I use fastmemcpy..
>552782 with standard
>524774 with -DHAVE_3DNOW , and so.
THANKS!!!
Results are nearly same. It's caused by misaligned writing (movntq is not present on K6-2).

THANKS TO ALL!!!

Best regards! Nick

_______________________________________________
Mplayer-users mailing list
Mplayer-users at lists.sourceforge.net
http://lists.sourceforge.net/lists/listinfo/mplayer-users