Itanium finally passes Alpha at HP

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

> >>You can tell by the low prices used alpha systems
> >>fetch on e-bay that the alpha is pretty much history.
> >
> >
> > Pity I live in Canada and not the United States. It'll be awkward for me
> > to take advantage of one of those bargains - I'll have to wait till
> > someone in Canada wants to get rid of his Alpha.
> >
> > John Savard
> > http://home.ecn.ab.ca/~jsavard/index.html
>
> 1.) The Alpha servers/workstations available on E-Bay are
> seldom with processors faster than 233 or 266 MHz.
> In other words, only the really ancient stuff is being
> sold on E-Bay - so its no surprise that the prices are
> low.
>
> Components for upgrading more modern Alpha servers, such
> as 1 and 2 GB Memory upgrades for Alpha servers, by contrast
> are selling for big bucks. People are willing to pay big
> premiums to keep there Alpha servers are alive and well -
> hardly a sign that the Alpha is history.
>
> 2.) Check the "for sale" newsgroups for your province or city.
> Even in Saskatchewan (sk.forsale) we occasionally get local
> sales of Alpha systems comparable to what is available on
> E-Bay. However, those too are 233 and 266 MHz systems
> almost all the time.
>
> 3.) What's wrong with using E-Bay but just limiting your search
> to Canadian sellers ?

Still there WHERE some alpha's available some time ago, and I missed
my opportunity to get them... [Rembered that at highschool I was
intern in certain city hall, and there where workstations there with
96kb of ondie L2 cache and ran over 500mhz and that time I was using a
brand new PPro... Anyway they put it dumpster when cpq decided to kill
alpha and got PC as replacement...
[I would of loved to get that peace of HW as upgrade from my 366
celeron at the time, since I've been using linux for quite a while.]
Nowadays people put things on ebay, but if there is shop that uses
alpha and is migrating away there is chance of getting one if its
medium age cheap enough.]
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Paul Repacholi wrote:
> That delayed the 264, and also delayed the EV7, as it was to use the
> EV6 core as a drop in part, sort off, plus delayed people moving onto
> the EV8. Well done curly!
>
> The EV8 was not far from 1st tape out when the axe fell I was told.

I have to correct myself. I was not accurate when I guessed a 2006 EV8
release to the public. From the horse's mouth: "end of this year or
early 2005." So the Alpha with 4 threads would have been competing
against the POWER with 4 threads and in 6-9 months the Pentium 4 with 4
threads and the Itanium with 4 threads. All is right in the world
again, as every company stays relatively abreast of it's competitors'
threadcount.

....except SPARC64, which will have 4 threads in 2007, from news earlier
this year.

Alex
--
My words are my own. They represent no other; they belong to no other.
Don't read anything into them or you may be required to compensate me
for violation of copyright. (I do not speak for my employer.)
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Alex Johnson wrote:
[...]
> Itanium is a lousy performer in Java. That is because Java employs
> self-modifying code and the Itanium spec explicitly states you can't do
> that.

The last few times I looked, HP's Itanium JVM (derived from Sun's Hotspot)
hold the top SpecJBB numbers, and not just at one price point but
for every hardware level (4 way, 16 way, 64 way). I must admit I haven't
looked in a couple of months though.

In the last results I saw, a 64 way Superdome had better numbers than
the top Sun platform (106 way ?).

The Itanium architecture has no problem with self-modifying code.
One just has to be explicit about it (sync.i). It's trivial for a JIT.
Bundle-sized atomic writes (documented in the publicly available
architecture documents) will provide further help once available.

I won't comment on the rest of your post.

Eric
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

In article <412F6083.5050501@hp.com>,
Eric Gouriou <eric.gouriou@hp.com> writes:
|>
|> In the last results I saw, a 64 way Superdome had better numbers than
|> the top Sun platform (106 way ?).

That's the F15K with all possible MaxCats. 106 CPUs in a box is
a joke, and nobody has that many. We are seriously unusual, even
at 100. The UltraSPARC IIIcu never was the fastest CPU around,
and the strength of the F15K is that it maintains performance
under parallel, memory-limited loading. My guess is that benchmark
is heavily CPU-limited.

There is now the dual-core F25K, which can have up to 144 CPUs
in a box. That might be faster. But no current SPARCs are known
for blazing single-CPU performance.


Regards,
Nick Maclaren.
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Alex Johnson <compuwiz@jhu.edu> wrote in message news:<cgnb35$673$1@news01.intel.com>...
> Bill Todd wrote:
>
> > In 'Beyond Superdome", he first waxes poetic about current Superdome
> > capabilities, such as their internal interconnect fabric. Let's see: this
> > is the server architecture (at least somewhat reminiscent of the old and
> > rather mediocre GS320 server architecture) that using 64 top-of-the-line
> > Itanics barely manages to stay ahead of the new POWER5 box that requires
> > only 16 processors (on a grand total of 8 chips, since they're dual-core) in
> > TPC-C, right?
>
> I believe you have misinterpretted the "16 processor" POWER5. IBM
> actually refers to chips. "16 processor" as reported is 16 POWER5
> chips, comprised of 32 cores, allowing 64 threads of execution. So the
> 64-thread Madison vs the 64-thread POWER5 having similar performance is
> just a sign that things are about equal. I'm stunned by how good POWER5
> is. But I know that next year Montecito will go from 1 thread per
> package to 4 threads per package. Itanium will be down to a 16P system
> to compete with IBM's 16P system.

No, you are incorrect.

IBM's always refers 16 processors as the number of cores. So a
maximum p570 is 8 Power5 Dual core chips, 16 cores and with SMT 32
threads.

On a per chip basis then, Power5 is > 6X the performance on TPC-C
compared to the HP Superdome.

Here's their spec submission for a fully loaded p570:
http://www.spec.org/cpu2000/results/res2004q3/cpu2000-20040712-03234.html
>
> > Then he crows, "HP delivers dual core before Intel" as some kind of
> > significant achievement. Well, maybe. Of course, Sun is delivering
> > dual-core SPARC processors today, and IBM started delivering dual-core
> > POWER4s nearly three years ago. So what beating Intel to the punch mostly
> > proves is just how far behind the curve Itanic really is, I'd suggest.
>
> And Itanium being behind the curve is a joint decision between intel and
> HP, pushed by HP. If not for staffing levels on Itanium a few years
> back and HP pushing to be the first to do the interesting dual-core
> project, there would have been a dual-core Itanium 2 on the market last
> year.
>
> > Doubling
> > current system performance by about a year from now actually sounds pretty
> > impressive, until you recognize that Superdome's TPC-C performance today
> > with 64 processors falls slightly behind today's previous-design-generation
> > POWER4+ systems that use only half that number of processors and only
> > slightly manages to beat today's POWER5 boxes that use only 1/4th as many
> > processors.
>
> As explained above, if you compare per thread, these machines are
> equivalent in size (64P Madison, 32P * 2 cores POWER4+, 16P * 2 cores *
> 2 threads POWER5).

Nope, wrong again. Power4+ is 16 Chips * 2 cores. Power4+ 4X better
per chip compared to Itanium in TPC-C.
>
> > When Montecito comes along late next year
> > it will indeed close much of this gap with POWER5 (Terry's second
> > TPC-C-specific performance graph suggests it should slightly exceed 2
> > million tpmC), but POWER5 (a full process generation behind Montecito but
> > still heading for about 3 million tpmC late *this* year) will no longer be
> > IBM's top-of-the-line product by then, since POWER5+ (in the same process
> > generation as Montecito) should then be shipping and upping the ante
> > significantly.
>
> I have not seen these graphs. Could you tell me what configuration
> those X million tpmC results are for? 4P, 16P, 64P, 64 *thread*. How
> are the estimates being made. I don't have a lot of TPC numbers, but I
> know a 4-socket Madison today is 121K and a 4-socket POWER5 (yes, that's
> 16 threads) is 371K and Montecito is supposed to also be around 370K in
> 4-socket. It will be a tight race. If you could explain the
> configurations, that would help me. If you could quote published
> 4-socket numbers for POWER4 and POWER4+, that would help me (I'm trying
> to make a table).

IBM has not published any smaller configuaration numbers for Power4+.

Here is the best non clustered results in terms of performance for
Itanium and power5/power4+ on a per core basis.

4 cores
IBM eServer p5 570 4P - Oracle 10G - 194,391
HP Integrity rx5670 Linux - Oracle 10G - 136,110

8 Cores
IBM eServer p570 8P - Oracle 10G - 371,044
Bull NovaScale 5080 - C/S SQL Server 2000 - 175,366

16 cores
IBM eServer p5 570 16P - IBM DB2 UDB 8.1 - 809,144
Unisys ES7000 Aries 420 Enterprise Server - SQL Server 2000 - 309,036

32 cores
IBM eServer pSeries 690 - IBM DB2 UDB 8.1 - 1,025,486
NEC Express5800/1320Xd - Oracle 10G - 683,575

64 cores
HP Integrity Superdome - Oracle 10G - 1,008,144

All itaniums used 1.5ghz chips
power5s were 1.9Ghz
the p690 (power4+) was 1.9ghz

The 64 way (32 Chips * 2 cores) Power5 based p5 590 is due to be
announced in the next 2 months!

Another thing to note is performance of power5 is greater when using
DB2 compared to Oracle 10G. If IBM had submitted 4 way and 8 way
TPC-C results using DB2, the performance in likely to be ~220K and
~420K respectively.

>
> > And Fujitsu has regular enhancements to SPARC64
> > coming along to keep pace with Itanic (though not POWER), regardless of what
> > one may think of Sun's future efforts for that architecture.
>
> > SPARC is dead, eh? Or 'no longer relevant', as a later slide says.
> > Someone better tell Fujitsu so it will stop stomping all over the
> > latest Itanics in commercial benchmarks like jbb2000: that's really
> > not suitable behavior for a 'dead' processor. And by all means make
> > sure those HP customers who are defecting to Sun know this: what on
> > earth do you suppose they're thinking?!
>
> Fujitsu is far ahead of Sun in performance, but they are far behind even
> the laggard (intel) when it comes to features. They say dual-core at
> end of '05, dual-core with 2 threads each sometime in '07. Compare that
> to Montecito which is mid-'05 with dual-core and 2 threads per core.
> Two years ahead. What Fujitsu *does* have that keeps pace, or even
> stays ahead, is RAS. I don't have much hope for the SPARC family going
> ahead. Niagra and Rock could either be a revolution or a flop, but I
> think that if Sun sticks with SPARC-64, it will drag them to the bottom
> of the ocean.
>
> Itanium is a lousy performer in Java. That is because Java employs
> self-modifying code and the Itanium spec explicitly states you can't do
> that. It was shortsighted to put that in, but they hoped to escape the
> IA32 complexities self-modifying code added to the design. It cost them
> vast amounts of performance. It was analyzed and Montecito should have
> much better jbb results as they've added new instructions and features
> to directly speed up SMC. After all, intel targetted Sun when they
> marketted Itanium. To leave Java performance in the shitter would be
> marketting suicide.

A few things to note:
Montecito's dual thread implementation is not SMT, it is the much
simpler HMT. Performance increase expected from this is much less
than SMT.

Itanium's java performance is not that lousy. It is approx similar to
power4+ and a bit less than Sparc64V. Way behind power5 though.

I'll do another comparison using specjbb2000

8 core
IBM eServer p5 570 1.9Ghz - 328996
Fujitsu PRIMEPOWER650 1.89Ghz- 213956
HP Integrity rx7620 Server - 190393

16 core
IBM eServer p5 570 1.9Ghz - 633106
Fujitsu PRIMEPOWER900 1.89Ghz - 402961
HP Integrity rx8620 Server 1.5Ghz - 341098

32 core
Fujitsu PRIMEPOWER1500 1.89Ghz - 663133
NX7700 i9510 1.5Ghz - 580536
IBM eServer pSeries 690 Turbo 1.7Ghz - 553480

48 core
Sun Fire E6900 1.2Ghz - 421773

64 core
HP Integrity Superdome server 1.5Ghz - 1008604
Fujitsu PRIMEPOWER2500 1.3Ghz - 835479

112 core
Fujitsu PRIMEPOWER2500 1.3Ghz - 1420177

Note that power4+ results didn't use their fastest processor (1.9Ghz)
The primepower 2500 is shipping (or about to) with 1.82Ghz Sparc64V.
No official results are in for that config yet.

Of course Itaniums with 1.7Ghz /9M is due very soon as well.

As you can see, in specjbb2000, the lead of power5 is not as great as
the lead of that chip in tpc-c compared to itanium.

>
> > Well, given that 'about now' is upon us and I don't see any "Alpha/IA64
> > hybrids" being benchmarked, 2007 seems at least a lot more credible. I
> > guess my prediction of 2006 three years ago was slightly optimistic, but for
> > a 5-year-out guesstimate I don't feel *that* ashamed of it.
>
> Yeah, having "hybrid" designs now was always BS. It was from an
> external guess with no information. Assuming the Alpha folks were
> divided up and sent to each project being worked on in 2001, there might
> be Alpha concepts coming out now, but intel kept the Alphans together
> and gave them a new project of their own that wasn't on any roadmaps in
> 2001. Shannon just couldn't have known that and spread his rosie ideal
> picture of the future. It was unprofessional to report on what he'd
> like to see rather than what he knows, but it's common practice.
>
> Alex
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Alex Johnson wrote:

> Bill Todd wrote:
>
>> Of course, if Alpha hadn't been killed three years ago, the new Itanics
>> would have been competing (though the performance gap would have made
>> that
>> word somewhat laughable) against EV8,
>
> This is untrue. The EV8 was reported as having vastly better
> performance than Itanium or Pentium 4, and I was eagerly waiting for it
> (almost drooling). But one thing you leave out is the Alpha team's
> tendancy to keep projects going long after their planned release dates.

The problem with *your* analysis is that you view the Alpha demise in
isolation.

Alpha development was drained of resources long before the axe fell.

--
Toon Moene - e-mail: toon@moene.indiv.nluug.nl - phone: +31 346 214290
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html
A maintainer of GNU Fortran 95: http://gcc.gnu.org/fortran/
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"Thu" <thunn@iprimus.com.au> wrote in message
news:8341eb81.0408271834.7b933d42@posting.google.com...
> Alex Johnson <compuwiz@jhu.edu> wrote in message
news:<cgnb35$673$1@news01.intel.com>...
snip.
>
> Here's their spec submission for a fully loaded p570:
>
http://www.spec.org/cpu2000/results/res2004q3/cpu2000-20040712-03234.html
> >
> > > Then he crows, "HP delivers dual core before Intel" as some kind
of
> > > significant achievement. Well, maybe. Of course, Sun is
delivering
> > > dual-core SPARC processors today, and IBM started delivering
dual-core
> > > POWER4s nearly three years ago. So what beating Intel to the
punch mostly
> > > proves is just how far behind the curve Itanic really is, I'd
suggest.
> >
snip

sorry if I screwed up the attributions. The above may have been written
by Bill Todd. Not sure.

Anyway, finally crawled through the pitch. HP didn't do dual core.
What they did was redesign the package to allow two chips to fit in an
area where Intel packages one chip. Their "special relationship"
apparently lets them buy unpackaged die.

Terry has really slipped since he was writing about DEC.
I guess all his sources got layed off or sent to Intel.

del cecchi
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Fri, 27 Aug 2004 16:25:39 GMT, Eric Gouriou <eric.gouriou@hp.com>
wrote:
>Alex Johnson wrote:
>[...]
>> Itanium is a lousy performer in Java. That is because Java employs
>> self-modifying code and the Itanium spec explicitly states you can't do
>> that.
>
> The last few times I looked, HP's Itanium JVM (derived from Sun's Hotspot)
>hold the top SpecJBB numbers, and not just at one price point but
>for every hardware level (4 way, 16 way, 64 way). I must admit I haven't
>looked in a couple of months though.

You've missed some recent results. HP's Itanium performance in
SPECjbb2000 isn't really what I would call "lousy", but it's
definitely NOT where Intel would like it to be I'm sure, particularly
if you look at 4-core systems (the most widely used config in this
benchmark). Here the fastest 5 chips are:

1. IBM Power5 - 170127
2. AMD Opteron - 133427
3. Intel XeonMP - 118031
4. Intel Itanium2 - 116466
5. IBM Power4 - 96377


The results for the Itanium are a bit dated and the most current
version of the hardware and software would probably move it up ahead
of the Xeon at least, but it seems unlikely that it could match the
Opterons performance in this test while the Power5 looks well out of
reach.

Also of note is that HP's PA-8800 processor is still turning in some
VERY respectable scores in this test considering how dated the design
is. There are no 4-core designs, but their 4-chip/8-core turns in a
result of 214932, finishing just ahead of Fujitsu's SPARC64V at 213956
and noticeably ahead of the top 8-processor Itanium result of 190393
(again, a slightly dated result but on current hardware).


Anyway, as it stands right now it looks like IBM's Power5 is the chip
to beat in this test (as it is in pretty much all benchmarks) on the
high-end while AMD's Opteron and Intel's XeonMP are almost certainly
the "bang-for-buck" leaders (again, pretty much the norm here).

-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"del cecchi" <dcecchi.nojunk@att.net> wrote in message
news:2paa59FiggsbU1@uni-berlin.de...
>
> "Thu" <thunn@iprimus.com.au> wrote in message
> news:8341eb81.0408271834.7b933d42@posting.google.com...
> > Alex Johnson <compuwiz@jhu.edu> wrote in message
> news:<cgnb35$673$1@news01.intel.com>...
> snip.
> >
> > Here's their spec submission for a fully loaded p570:
> >
> http://www.spec.org/cpu2000/results/res2004q3/cpu2000-20040712-03234.html
> > >
> > > > Then he crows, "HP delivers dual core before Intel" as some kind
> of
> > > > significant achievement. Well, maybe. Of course, Sun is
> delivering
> > > > dual-core SPARC processors today, and IBM started delivering
> dual-core
> > > > POWER4s nearly three years ago. So what beating Intel to the
> punch mostly
> > > > proves is just how far behind the curve Itanic really is, I'd
> suggest.
> > >
> snip
>
> sorry if I screwed up the attributions. The above may have been written
> by Bill Todd. Not sure.

The last part you quoted was.

>
> Anyway, finally crawled through the pitch. HP didn't do dual core.
> What they did was redesign the package to allow two chips to fit in an
> area where Intel packages one chip. Their "special relationship"
> apparently lets them buy unpackaged die.

Actually, my impression was that Terry was referring to the true dual-core
PA-RISC 8800 (which I think is currently shipping), not to the dual-chip
kludge HP created as an interim band aid for Itanic.

- bill
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

nmm1@cus.cam.ac.uk (Nick Maclaren) writes:

>That's the F15K with all possible MaxCats. 106 CPUs in a box is
>a joke, and nobody has that many. We are seriously unusual, even
>at 100. The UltraSPARC IIIcu never was the fastest CPU around,
>and the strength of the F15K is that it maintains performance
>under parallel, memory-limited loading. My guess is that benchmark
>is heavily CPU-limited.

The OP could have done a simple SPEC query to find out; the HP
result seems to have lost its top billing to a 112-way Fujitsu
Prime Power system. (Which seems to deliver about as "SpecJBBs"
per CPU per MHz as the HP system)

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

In article <8341eb81.0408271834.7b933d42@posting.google.com>,
Thu <thunn@iprimus.com.au> wrote:
>
>On a per chip basis then, Power5 is > 6X the performance on TPC-C
>compared to the HP Superdome.

What I would like to know is how the various designs work on benchmarks
that are not embarrassingly parallel (in a memory access sense). The
tradeoff appears to be between single-threaded performance and
scalability (even for non-communicating codes), and you can't have
both. Now, I haven't tested the Altix or SuperDome myself, but I
can witness that most other designs seem to do one or the other well,
but never both.

This is why SGI and Sun systems do much better in practice than that
sort of benchmark would imply - and do NOT trail in the way that they
are said to in the press.

>Montecito's dual thread implementation is not SMT, it is the much
>simpler HMT. Performance increase expected from this is much less
>than SMT.

What on earth is that new TLA? Anyway, a simpler form of threading
is likely to be MORE efficient than Pentium 4 SMT, because the latter
is a technical failure (though a marketing success). Even Eggers'
model (MIPS-based) showed that there was a pretty marginal gain (and
might be none) over a simple CMT design.


Regards,
Nick Maclaren.
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"Alex Johnson" <compuwiz@jhu.edu> wrote in message
news:cgnb35$673$1@news01.intel.com...
> Bill Todd wrote:
>
>> In 'Beyond Superdome", he first waxes poetic about current Superdome
>> capabilities, such as their internal interconnect fabric. Let's see:
>> this
>> is the server architecture (at least somewhat reminiscent of the old and
>> rather mediocre GS320 server architecture) that using 64 top-of-the-line
>> Itanics barely manages to stay ahead of the new POWER5 box that requires
>> only 16 processors (on a grand total of 8 chips, since they're dual-core)
>> in
>> TPC-C, right?
>
> I believe you have misinterpretted the "16 processor" POWER5. IBM
> actually refers to chips. "16 processor" as reported is 16 POWER5 chips,
> comprised of 32 cores, allowing 64 threads of execution. So the 64-thread
> Madison vs the 64-thread POWER5 having similar performance is just a sign
> that things are about equal. I'm stunned by how good POWER5 is. But I
> know that next year Montecito will go from 1 thread per package to 4
> threads per package. Itanium will be down to a 16P system to compete with
> IBM's 16P system.

Bill is right. The p5 570, as benchmarked for TPC-C[1] has 4 "building
blocks",
each of which is a 4-way machine. Reading the relevant redpaper[2] each of
these
building blocks has two processor slots, and each of the processor cards
contains a
single DCM (dual chip module). The DCM is comprised of a dual-core Power 5
and
the off-chip L3 cache. That means the 16 way box mentioned has 4 building
blocks,
each with two chips, each with two cores (and each of the cores is
SMT-capable),
so the "16 processor" Power 5 box is really "16 cores on 8 chips".

[1] http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=104071202
[2] http://www.redbooks.ibm.com/redpapers/pdfs/redp9117.pdf


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.744 / Virus Database: 496 - Release Date: 24/08/2004
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Alex Johnson wrote:

[SNIP]

> As explained above, if you compare per thread, these machines are
> equivalent in size (64P Madison, 32P * 2 cores POWER4+, 16P * 2 cores *
> 2 threads POWER5).

Can you explain to me why you think 2 way SMT is equivelent
to 2 processors ?

I can't see how it can be in terms of transistor count or
performance characteristics (think about contention). Just
to add to my confusion, the concensus is that SMT gives < 30%
more oomph at best (depending on workload of course)...

I can see how you could claim that a 64 *package* Madison box
was analogous to a 32 *package* dual-core box though.

Cheers,
Rupert
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"Rupert Pigott" <roo@try-removing-this.darkboong.demon.co.uk> wrote in
message news:1093739318.596789@teapot.planet.gong...

....

Just
> to add to my confusion, the concensus is that SMT gives < 30%
> more oomph at best (depending on workload of course)...

While this indeed seems to be about the limit (and in fact seems quite a bit
too generous on average) for the throughput boost that existing SMT
implementations can provide (though I may have encountered a claim of 40%
for one outlier application somewhere), the simulations performed for EV8
seemed to indicate that a single core with far more resources (in terms of
execution units, number of in-flight instructions supported, etc.) than the
current SMT cores provide could obtain significantly higher percentage
throughput boosts from SMT.

- bill
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

In article <MM2dnV8ff-gS2azcRVn-ig@metrocastcablevision.com>,
Bill Todd <billtodd@metrocast.net> wrote:
>
>>Just
>> to add to my confusion, the concensus is that SMT gives < 30%
>> more oomph at best (depending on workload of course)...
>
>While this indeed seems to be about the limit (and in fact seems quite a bit
>too generous on average) for the throughput boost that existing SMT
>implementations can provide (though I may have encountered a claim of 40%
>for one outlier application somewhere), the simulations performed for EV8
>seemed to indicate that a single core with far more resources (in terms of
>execution units, number of in-flight instructions supported, etc.) than the
>current SMT cores provide could obtain significantly higher percentage
>throughput boosts from SMT.

As did Eggers' simulation, which was based on MIPS. My guess is that
much of the problem of Hyperthreading was that it had to start from
the x86 architecture, which was already too complex and messy. But
it could have been other factors as well.

However, the flaw in Eggers' work (I have not seen DEC's) is that it
did not compare SMT with CMP using the same number of transistors.
Yes, the latter would have been slower for serial code, and perhaps
even for a small number of threads, but is MUCH more scalable. Even
Eggers' model ran out of steam at 4 cores (just arguably 8). But, as
I have posted, all these TLAs and other acronyms are intended to give
a veneer of importance to minor variants of a general model.

There is a continuum of shared memory designs, according to how close
the sharing is to the core. SMT is nearly as close as it is possible
to get (and certainly as close as it is sane to go), but you can move
that out to L1, L2, L3 or main memory. And, of course, you can share
difference resources at different levels - e.g. compare the memory
bandwidth handling between the Opteron, MIPS/SPARC and POWERx, all
of which do it differently, and at a different level from any on-chip
multi-threading.


Regards,
Nick Maclaren.
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Bill Todd wrote:

> "Rupert Pigott" <roo@try-removing-this.darkboong.demon.co.uk> wrote in
> message news:1093739318.596789@teapot.planet.gong...
>
> ...
>
> Just
>
>>to add to my confusion, the concensus is that SMT gives < 30%
>>more oomph at best (depending on workload of course)...
>
>
> While this indeed seems to be about the limit (and in fact seems quite a bit
> too generous on average) for the throughput boost that existing SMT
> implementations can provide (though I may have encountered a claim of 40%
> for one outlier application somewhere),

http://www-106.ibm.com/developerworks/linux/library/l-htl/

Table 7. 45% geometric mean improvement for handling chat rooms on a
linux kernel tweaked for Hyperthreading.

RM
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"Robert Myers" <rmyers1400@comcast.net> wrote in message
news:sPlYc.80023$Fg5.55137@attbi_s53...
> Bill Todd wrote:
>
> > "Rupert Pigott" <roo@try-removing-this.darkboong.demon.co.uk> wrote in
> > message news:1093739318.596789@teapot.planet.gong...
> >
> > ...
> >
> > Just
> >
> >>to add to my confusion, the concensus is that SMT gives < 30%
> >>more oomph at best (depending on workload of course)...
> >
> >
> > While this indeed seems to be about the limit (and in fact seems quite a
bit
> > too generous on average) for the throughput boost that existing SMT
> > implementations can provide (though I may have encountered a claim of
40%
> > for one outlier application somewhere),
>
> http://www-106.ibm.com/developerworks/linux/library/l-htl/
>
> Table 7. 45% geometric mean improvement for handling chat rooms on a
> linux kernel tweaked for Hyperthreading.

Well, I suppose that could have been what I remembered - and it does appear
to be the outlier in the article. Still, somewhat better than I'd expect:
I wonder exactly what it is about that workload that's so much more
HT-friendly than the others (unless it's something dead-simple like a
ridiculously short time quantum per thread, such that context-switching
overheads dominate the workload and halving them helps a *lot*).

- bill
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

In comp.sys.ibm.pc.hardware.chips Bill Todd <billtodd@metrocast.net> wrote:
>> http://www-106.ibm.com/developerworks/linux/library/l-htl/
>> Table 7. 45% geometric mean improvement for handling chat rooms on a
>> linux kernel tweaked for Hyperthreading.

> I wonder exactly what it is about that workload that's so
> much more HT-friendly than the others (unless it's something
> dead-simple like a ridiculously short time quantum per
> thread, such that context-switching overheads dominate the
> workload and halving them helps a *lot*).

Well, IRC does mean ridiculously short timeslices (little work)
before a blocking syscall that yields the CPU. Just shovelling
data from one port to another.

Also important in this case will be doing useful work during
memory latency fetches. The busmaster ethernet devices will
drop data into RAM, and some code (probably the kernel TCP/IP
stack) will stall loading it into cache.

-- Robert in Houston
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

nmm1@cus.cam.ac.uk (Nick Maclaren) writes:

> However, the flaw in Eggers' work (I have not seen DEC's) is that it
> did not compare SMT with CMP using the same number of transistors.

So what is the performance boast you get from CMP with ~110% of a single
core?

--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

In article <877jrink3i.fsf@k9.prep.synonet.com>,
Paul Repacholi <prep@prep.synonet.com> wrote:
>nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
>
>> However, the flaw in Eggers' work (I have not seen DEC's) is that it
>> did not compare SMT with CMP using the same number of transistors.
>
>So what is the performance boast you get from CMP with ~110% of a single
>core?

Sigh. I said that the flaw in her work is that it did not provide
that information. No, I don't know. What I do know is that the
claimed benefits of SMT are dubious without that comparison.

OK?


Regards,
Nick Maclaren.
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"Robert Redelmeier" <redelm@ev1.net.invalid> wrote in message
news😛yrYc.14774$af6.13168@newssvr22.news.prodigy.com...

> In comp.sys.ibm.pc.hardware.chips Bill Todd <billtodd@metrocast.net>
> wrote:

>>> http://www-106.ibm.com/developerworks/linux/library/l-htl/
>>> Table 7. 45% geometric mean improvement for handling chat rooms on a
>>> linux kernel tweaked for Hyperthreading.

>> I wonder exactly what it is about that workload that's so
>> much more HT-friendly than the others (unless it's something
>> dead-simple like a ridiculously short time quantum per
>> thread, such that context-switching overheads dominate the
>> workload and halving them helps a *lot*).

The workload is bogus, deliberately designed to inflate the numbers.

> Well, IRC does mean ridiculously short timeslices (little work)
> before a blocking syscall that yields the CPU. Just shovelling
> data from one port to another.

> Also important in this case will be doing useful work during
> memory latency fetches. The busmaster ethernet devices will
> drop data into RAM, and some code (probably the kernel TCP/IP
> stack) will stall loading it into cache.

If you look at the way they created the test, the 'chat' test is really
just a measure of how fast you can do context switches. With HT (and this
ridiculously unrealistic type of workload), you need half as many context
switches. Only an idiot would design a chat application such that a context
switch would be needed every time the server wanted to change which client
it was working on behalf of.

DS
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

In article <PyrYc.14774$af6.13168@newssvr22.news.prodigy.com>,
Robert Redelmeier <redelm@ev1.net.invalid> wrote:

>Well, IRC does mean ridiculously short timeslices (little work)
>before a blocking syscall that yields the CPU. Just shovelling
>data from one port to another.

I don't know which IRC server you've worked with, but that's not how
it works; when it wakes up, it does as much as it can (non-blocking)
before it sleeps in select(). I believe the chat benchmark in question
is more aimed at a multi-threaded chat server, which does a lot less
work at a time.

-- greg
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Thu wrote:
> IBM's always refers 16 processors as the number of cores. So a
> maximum p570 is 8 Power5 Dual core chips, 16 cores and with SMT 32
> threads.
>
> Here's their spec submission for a fully loaded p570:
> http://www.spec.org/cpu2000/results/res2004q3/cpu2000-20040712-03234.html

Of course! How did I miss that?

> Here is the best non clustered results in terms of performance for
> Itanium and power5/power4+ on a per core basis.

That's a good read. Thanks for the research you put in.

> A few things to note:
> Montecito's dual thread implementation is not SMT, it is the much
> simpler HMT. Performance increase expected from this is much less
> than SMT.

Montecito's implementation is SoEMT. Maybe HMT means the same thing,
but I am unfamiliar with the abbreviation. What is "H" MT?

> I'll do another comparison using specjbb2000

That's another interesting read. Unlike the TPC numbers, the ordering
of the competitors is not the same at each number of cores. At one
scale POWER4+ is leading by a mile, at the next SPARC64V by a mile, and
at one point Itanium 2 is ahead. Quite strange results. But then the
larger systems use lower speed parts. Another oddity.

Alex
--
My words are my own. They represent no other; they belong to no other.
Don't read anything into them or you may be required to compensate me
for violation of copyright. (I do not speak for my employer.)
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

In article <cgut0k$dj0$1@nntp.webmaster.com>,
David Schwartz <davids@webmaster.com> wrote:
>
> If you look at the way they created the test, the 'chat' test is really
>just a measure of how fast you can do context switches. With HT (and this
>ridiculously unrealistic type of workload), you need half as many context
>switches. Only an idiot would design a chat application such that a context
>switch would be needed every time the server wanted to change which client
>it was working on behalf of.

Eh? Why? That is precisely what you want to do to get security,
without having to be very clever. I agree that this is an unusual
requirement, but it is not unreasonable.


Regards,
Nick Maclaren.
 
Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:cgv020$bvu$1@pegasus.csx.cam.ac.uk...
> In article <cgut0k$dj0$1@nntp.webmaster.com>,
> David Schwartz <davids@webmaster.com> wrote:
> >
> > If you look at the way they created the test, the 'chat' test is
really
> >just a measure of how fast you can do context switches. With HT (and this
> >ridiculously unrealistic type of workload), you need half as many context
> >switches. Only an idiot would design a chat application such that a
context
> >switch would be needed every time the server wanted to change which
client
> >it was working on behalf of.
>
> Eh? Why? That is precisely what you want to do to get security,
> without having to be very clever.

While there may be a legitimate differentiation between an 'idiot' and
someone who is merely not 'very clever', the underlying sentiments do not
seem all that different.

I'm not in the habit of completely ignoring performance in favor of
dirt-simple coding when I create software, unless performance is truly
unimportant. And that's especially true for production software, where any
effort expended may be repaid by benefits for literally millions of users.

File servers, which have far more stringent security requirements than chat
rooms, often not only eschew per-request context-switching but may run
entirely in the kernel to achieve optimal performance, even given the
resulting need to roll their own security mechanisms. So embedding
relatively simple security mechanisms in a chat-room server to achieve
significantly better performance hardly seems impractical.

- bill