[MPlayer-users] pentium4 bench

Sun Nov 25 21:11:54 CET 2001

Thus spake Arpi (arpi at thot.banki.hu):
> note that p4 2hgz doesn't seem to be 2x (or more) faster than 1ghz celeron-2.

Of course not.
The P4 is a markting gag that Intel originally wanted as server CPU.
Just look at the design, it is much better than what you can buy now,
because Intel had to make it cheaper to be able to sell it in the mass
market at all, so they castrated the chip severely.  It is optimized for
clock rate, not for speed.  Because Intel knew that "2000 MHz" looks
more impressive than "1534 MHz, but hey, it's pretty fast, trust us!".
At equal clock speed it is inferior to all other current CPUs.

> it's strange, as it has very fast ram (400mhz rambus compared to 133mhz sdram)
> and also the cpu has double clockrate.

P4 sucks.  And people know.  I wonder why you bought such a crap
machine.  Just look at www.divx.com, they just did a poll what CPU
people have, only 10% have a P4, the rest has Athlons or PIII.  It does
not make sense to optimize for that broken CPU.

> maybe the code should be optimized differently for p4?

No.  Nobody should optimize for that monster.  It is twice as expensive
to make (what Intel/AMD have to pay, not what they charge in the market)
than the Athlon and performs worse.  Intel is rich enough without you
helping them push their crap into the market.

And to optimize for P4 is a _huge_ pessimization for other CPUs:

Consider this example (courtesy Frank Klemm):

  multiply by 32, P4 at 2 GHz:

    imul    $32, %eax       ; 14 cycles  (7 ns)

    shl     $5, %eax        ;  4 cycles  (2 ns)

    add     %eax, %eax
    add     %eax, %eax
    add     %eax, %eax
    add     %eax, %eax
    add     %eax, %eax      ;  2,5 cycles (1,25 ns)

Does that look like a sane architecture to you?  Not to me.  I am unable to
measure the cycle counts for my Athlon because rdtscl always says "0
cycles".  The manual suggests that on the Athlon shl $5, %eax is 1 cycle
and pairable, the adds are 2,5 cycles, and the imul takes "variable
time".  Duh. ;)

Frank also says that addressing modes like (x,y,[2,4,8]) are also very
slow on P4, as shown by the following example (multiply by 27):

    leal    (%eax,%eax,2), %edx
    leal    (%edx,%edx,8), %eax

 Pentium III: 2 cycles
 Pentium 4:   8 cycles
 Athlon:      4 cycles

He suggests the following alternative code for P4:

     lea (r,r,1),s       ; s = 2r
     add s,r             ; r = 3r
     add s,s             ; s = 4r
     add s,s             ; s = 8r
     add s,r             ; r = 11r
     add s,s             ; s = 16r
     add r,s             ; r = 27r

 Pentium III: 5 cycles
 Pentium 4:   2.5 cycles
 Athlon:      6 cycles

As you can see, the Pentium 4 really sucks.  A lot.  And gcc does not
use any of the above optimizations, not even for calculating addresses
(gcc outputs the leal code I pasted above).

And what is the effort good for?  To support an architecture that is
broken by design?  No, thanks (if you ask me).  I suggest you give your
P4 back and get yourself an Athlon.  SSE2 optimizations work well for
the P4, and the Hammer (Athlon successor) will have them, too, so
investing time in SSE2 might look like a good idea.  On the other hand,
consider SSE support in Athlon XP.  SSE code performs about as well on
the Athlon XP as normal FPU code does.  I predict that SSE2 code will
perform as well on the Hammer as FPU code does.  So I suggest we rather
write good FPU code, then ;)

Felix