[citation][nom]aldaia[/nom]... Right, but if i have a truly highly parallel application, then, a server with several interconnected nodes offers more bang for the buck. ... [/citation]
A few points to ponder:
a) Many 'truly' parallel apps don't scale well when running across networked nodes, ie. clusters. It
depends on the granularity of the code. Some tasks just need as much compute power as possible
in a single system. This can be mitigated somewhat with Inifiniband and other low-latency network
connection technologies, but the latencies are still huge compared to local RAM access. If the code
on a particular chip doesn't need any or much access to the data held by a different node then
that's great and some codes are certainly like this, but others are definitely not, ie. a cluster setup
doesn't work at all. When 2-socket and the rarer 4-socket boards can't deliver the required
performance, companies use shared memory systems instead, which are already available with
up to 256 sockets (2560 cores max using the XEON E7 family), though often it's quite difficult to
get codes to scale that well beyond 64 CPUs (huge efforts underway in the cosmological community
atm to achieve good scaling up to 512 CPUs, eg. with the Cosmos machine).
b) Tasks that require massive processing often require a lot of RAM, way more than is supported
on any consumer board. Multi-socket XEON boards offer the required amount of RAM, at the expense
of more costly CPUs, but it does the deliver the required performance too if that is also important.
ANSYS is probably the most extreme example; one researcher told me his ideal workstation would
be a single-CPU machine with 1TB RAM (various shared memory systems can have this much RAM
and more, but more CPUs is also a given). X58 suffered from this somewhat, with consumer boards
only offering 24GB max RAM (not enough for many pro tasks) and the reliability of such configs was
pretty poor if you remember way back to when people first started trying to use non-ECC DIMMs
to max out X58 boards.
c) Many tasks are mission critical, ie. memory errors cannot be tolerated, so ECC RAM is essential,
something most consumer boards can't use (or consumer chips don't support). Indeed, some apps
are not certified to be used with anything other than ECC-based systems.
d) Some tasks also require enormous I/O potential (usually via huge FC arrays), something again
not possible with consumer boards (1GB/sec is far too low when a GIS dataset is 500GB+). Even
modern render farms have to cope with such quantities of data as this now for single frame renders.
It's often so much more than just the CPU muscle inside the workstation, or as John Mashey put it
years ago, "It's the bandwidth, stupid!", ie. sometimes it's not how fast one can process, it's how
much one can process (raw core performance less critical than I/O capability). Indeed, even
render farm management systems may deselect cores in order to allow the remaining cores to make
better use of the available memory I/O (depends on the job), though this was more of an issue for
the older FSB-based XEONs.
And then sometimes raw performance just rules, the cost be damned. Studio frame rendering is
certainly one such example; I've no doubt IB XEON will be very popular for this. Thousands of cores
is fairly typical for such places.
SB-E is great, but it's at the beginning of the true high performance ladder, not the end.
Ian.