Doubts again: T-bird 750 slower than P][ 400?

G

Guest

Guest
I thought I had it figured out -- blame the problem on the RAM (see one of the messages below). But, later I found that the memory tester that I used gave such results for many Athlon systems. Also, other benchmarks gave more reasonable rates: 200-600MB/sec. So now, I feel a) stupid b) lost c) wonder what the problem might be.

So I'd appreciate any suggestions. See below for problem description.



T-bird:
Asus A7V
T-bird (socket A) 750 (non-overclocked)
Running at pretty much standard BIOS settings (system mode set to Optimal)
PC133 128 MB RAM (CAS3 latency) (possibly low quality)
7200 RPM 20Gb ATA/66 HD
tests done on win95 and linux (gcc with all sane optimization options,
makes no great difference in speed, 5% at most)

P][
PCV-E314DS 400 MHz Intel Pentium ® II processor (with 100 MHz FSB)
82440BX-100 AGP/PCI/ISA chipset
PCI Level 2.1, 33 MHz zero wait state
128 MB PC100 CAS3 ram
tests done on winme

Benchmarks written in ANSI C.
Benchmarks performed (all compiled with MSVC 6 standard release mode):

1) Random access to an 8-meg array -- t-bird about 40-50% faster.
Random numbers pre-generated (obviously) also in all future benchmarks.

2) Tight 'for' loop with pointer dereferencing. 80% faster on the t-bird.
(accessing small amount of ram)

3) 'for' loop with much pointer dereferencing (also accessing a small amount of ram)
also 80% faster.

4) Recursive call with elementary arithmetic (one increment)
to test CALL instruction -- 2x+ as fast on the T-bird

5) Applications run faster (no precise timing). Norton Utilities Benchmark gives 2x the
score for the t-bird as for its concept of the typical 450 P][ (which doesn't say much,
since the test wasn't done on the P400)

All these results look about as expected. The following situation is messed up, however:
Shellsort (using alternating direction bubble h-sort) (if you don't know the algorithms,
ignore them and go on) on a singly linked list, sorting 1200000 elements takes
36 seconds on the P][, and 50 on the t-bird. Consistently. For lower list sizes this
difference also exists.

Can someone point to a possible specific cause(or causes) for this difference?
What can be done to fix it, or to test it? Has anyone else seen something like this?
If someone wants to test this on his/her machine, I could either post a binary (Win32 or ELF),
or, after some thought, e-mail the source (since this was a CS assignment, distributing
source might not be encouraged).


<P ID="edit"><FONT SIZE=-1><EM>Edited by JoeShmoe on 03/03/01 09:17 PM.</EM></FONT></P>
 
G

Guest

Guest
Are you talking about the possibility of base level machine code for the T-bird being less optimized in the case of this certain sorting algorithm.. it wouldn't neccessarily surprise me...

Old addage: "Users never prosper" :eek:) Long live the tweakers
 
G

Guest

Guest
The compiler settings (and compiler used) were *exactly* the same on the two machines. The machine code should not be different. Moreover, the machine code (as executed on the T-bird) consisted predominantly of MOVs, CMPs, TESTs, and CALLs. The benchmarks we performed tested these instructions (in different conditions -- mostly array access -- , granted, but these instructions nonetheless).

Every benchmark (not related to the shellsort) we can come up with is faster on the T-bird, but that specific code is a lot slower on the P][.

It's probably not an issue of optimized code generation -- the T-bird system should outperform the P][ when compared for standard x86 instructions.

I'm sorry if I'm not being clear. Need sleep.
 
G

Guest

Guest
You're using ANSI C, fine, and the compiler settings are the same, fine. But which compiler are you actually using? Not VC++ I hope, as this is more optimised for Pentium. Other compilers are also more optimised for the pentium core. Check this.

Second issue: the older PIII has 512KB cache running at 1/2 core speed, the newer T-Bird has 256KB running at core speed (same as newer PIIIs). Maybe the PIII can cache more of the list from the system RAM, and even though its slower, fewer calls to the main RAM may make the difference. Try comparing it with an original Athlon at 750MHz. I bet that would beat the PIII.

Nice one, BTW. Finally, someone who can provide hard performance evidence rather than mindless AMD or Intel bashing. Makes a change around these parts (and every other discussion forum on the web!). You've inspired me to write some benchmarking programs for myself :)

~ I'm not AMD biased, I just think their chips are better ~
 
G

Guest

Guest
We did use VC++, but that was definitely *not* the problem (CPU-specific optimizations weren't really the question, anyway) It might have made a slight difference, but not enough to account for a 3-fold disparity in speed.

The problem was that PC133 (advertised 7ns) RAM was giving transfer rates of around 18.6MB/sec (uncached) (both at PC100 and PC133), and failing if latencies were set to two cycles (where's that 7ns latency?). (For comparison -- medium grade PC100 RAM gives 93MB/sec (uncached) transfer rates(the RAM on the P2)).

Test your RAM when you buy it!
ftp://www.ibiblio.org/pub/Linux/system/hardware/memtest86-2.5.tar.gz

So: T-birds kick ass. Despite the disgusting RAM, its great cache managed to make most things fast (128k L1 cache transfer rate of around 7.9Gb/sec, 256k L2 cache transfer rate of around 2.5Gb/sec). I doubt that a P3 can come close to that with its tiny L1 cache and relatively slow L2(same frequency, of course). We shall see though (I plan to run some tests).

Thanks for your suggestions, though.
<P ID="edit"><FONT SIZE=-1><EM>Edited by JoeShmoe on 02/24/01 03:46 PM.</EM></FONT></P>