Existing FSB-based XEON compute blades don't have enough memory
bandwidth to properly utilise all four cores for complex tasks
that hammer the memory system, eg. animation rendering; the movie
10000BC is a particular example I'm familiar with, much of it
done by MPC in London. Frames are rendered one per core with an
average CPU utilisation of 90%. MPC uses a render management
system called
Alfred which monitors how much the memory is being
pushed (probably by monitoring CPU usage, etc.) and if a particular
frame's complexity is such that all four cores can't be fed fast
enough then one of the cores is not used, ie. performance can be
better on an FSB-type quad-core XEON by only using three cores.
This action isn't normally necessary with Opterons of course.
The Nehalem XEON, using QPI, will solve this problem completely,
giving an Intel platform that can run this type of task in the
same way (probably better) than Opterons can. QPI is scalable,
so Intel can add more links to match the number of cores. With
Intel adding NUMA support aswell, highly scalable single-image
Nehalem systems should also run very nicely.
For reference, MPC has more than 900 x Dell PowerEdge-1950 servers,
all of them fitted with two quad-core 3.2GHz XEONs, 32GB RAM and
a single 750GB SATA; more than 7000 cores total. They all run
64-bit
Centos, a free Linux variant that is binary compatible
with RHEL.
Movie studios will love the new Nehalem XEON. Apart from being
a faster chip anyway, they shouldn't have to worry so much (if
at all) about load monitoring for turning off cores. We'll have
to wait and see for the proper results, but I expect performance
to be exceptional for rendering.
I'm very familiar with SGI hardware; SGI dealt with these bus
bottlenecks more than 10 years ago at an architectural level.
The early-1990s
Challenge/Onyx platform, with up to 36 CPUs each
with 2MB L2 (no dual-cores back then) had a shared bus design
(256bit @ 37MHz, 1.2GB/sec). Complex tasks could severely hamper
the available bandwidth per CPU, often limiting useful scalability
to 20 CPUs or so. The issues were similar to today's FSB bottlenecks.
Here's
my Onyx btw, with 24 CPUs and 4GB RAM (not bad for 1994.
)
The solution was the later Origin/Onyx2 platform (circa 1996;
here's
mine) which uses a scalable crossbar instead of a shared
bus. Memory bandwidth scales with the number of CPUs, so CPUs
are much less likely to be bandwidth-starved for complex tasks.
As a result, 18 x 195MHz CPUs in a Challenge can be outperformed
by just four 400MHz CPUs in an Origin, or only two 600MHz CPUs
in the later Origin3K (NB: Intel made a similar mbd architectural
switch with the 440BX, which also uses a crossbar). Three
architectural generations later, I wouldn't be surprised if SGI
use the Nehalem XEONs for its next Altix design instead of the
upcoming quad-core Itanium, assuming NUMA support is sufficient
to support at least 512 CPUs (the current limit of Altix4700).
There are numerous scientific and server tasks that benefit from
lots of cores. Having an IMC will greatly improve how well these
systems run fine grained codes.
I'll be getting an i7 system in May/June, for video encoding -
another area where it runs very well.
Ian.