>The setup will be very much like any SUN/HP/IBM multi CPU
>systems
Ahem.. sorry, but not many Sun, HP or IBM setups really look a lot like each other from an architectural POV, assuming you mean UltraSparc, PA Risc& Itanium and Power5 machines, not these companies x86 boxen (and even then there is a considerable difference, eg IBMs Hurricane).
>where multiple CPU cards are connected via a backplane.
Thats not what I care about, physical layout is not important.
>Obviously efficiencies go down (look at sun servers that
>need hundreds of CPU to keep up with the likes of IBM with
>only 64 CPUs)
Obviously, that has nothing to do with backplanes or anything, other than that Utrasparc cpu's can't keep up with x86/Power/IPF.
>Remember though that on this type of big iron, cache
>coherency is not that important
I suggest you rethink that. It is *extremely* important in the small iron market (big iron is mainframe class in my book, not 'cheap' 32 way servers).. Where it plays a much smaller role is HPC. If cache coherency and chip to chip communication would not be an issue, clustering cheap 2 way x86 boxes is a far more obvious solution, and its no wonder that is exactly what they often do for HPC. But not for commercial workloads a la SAP.
>I.e. if two people are entering a Sales Order there is no
>information they need to share that needs to reside in
>memory all the locking type stuff is handles by the
>database.
First, on which cpu (or rather, in which cache) do you think the database engine resides ? hmm ? Surely not all of them !
Secondly, it seems you don't quite understand how cache coherency works with MESI (or even MOESI for K8). On a glueless opteron system, every time you need to access memory, you have to check the caches of all other CPU's to see if that data is not already somewhere else, and potentially changed. This is cache snooping, and its required for every memory read. This puts a strain on the HT bus regardless of the app, and the strain will grow exponentially with the number of CPU's. Indeed the type of app also plays a role; if its a frequent occurance that several cpu's share the same data, it gets a LOT worse, as with every modification, the owner of the data has to invalidate the cache lines in the other cpu's, and transmit the modified data, etc, but even in perfectly seperated threads, cache coherency is an issue.
And then i'm not even mentioning the fact that opteron 8xx only has 3 coherent HT links, which would mean that a 32S would need between 1 and 8? hops to reach RAM located on another cpu. If you have 32 cpu's all "hopping" like that, it will kill latency and totally rape bandwith as the HT links will simply be saturated.
Now all this works fine with 4 chips, as HT is pretty fast, and a cache snoop is much smaller than the actual data. But its already resulting in severely worse scaling with 8 sockets, and it will render a 32 socket opteron system completely useless without some better solution. Maybe its not that bad for HPC, but for any commercial app a glueless 32 socket opteron will crawl to a halt. I think the Iwill scaling from 4 to 8 sockets is already something like 40% at best, runnin SpecInt Rate which scales as good as it gets. A 32S system might well be *slower*.
Now I'm sure they are working on a solution for this, its not like the problem isn't known and widely documented, Im just curious how it is they expect to solve it. Implementing a directory protocol ? Seems like that would need support by the cpu. Maybe a HT switch ? Maybe both, a chipset that acts as switch and uses a directory based protocol between nodes of 4 opterons that work as usual ? I don't know.. but it will be something for sure. Opteron as it is just isn't suited for >4/8 sockets without "glue". And I'm curious about that glue
= The views stated herein are my personal views, and not necessarily the views of my wife. =
>systems
Ahem.. sorry, but not many Sun, HP or IBM setups really look a lot like each other from an architectural POV, assuming you mean UltraSparc, PA Risc& Itanium and Power5 machines, not these companies x86 boxen (and even then there is a considerable difference, eg IBMs Hurricane).
>where multiple CPU cards are connected via a backplane.
Thats not what I care about, physical layout is not important.
>Obviously efficiencies go down (look at sun servers that
>need hundreds of CPU to keep up with the likes of IBM with
>only 64 CPUs)
Obviously, that has nothing to do with backplanes or anything, other than that Utrasparc cpu's can't keep up with x86/Power/IPF.
>Remember though that on this type of big iron, cache
>coherency is not that important
I suggest you rethink that. It is *extremely* important in the small iron market (big iron is mainframe class in my book, not 'cheap' 32 way servers).. Where it plays a much smaller role is HPC. If cache coherency and chip to chip communication would not be an issue, clustering cheap 2 way x86 boxes is a far more obvious solution, and its no wonder that is exactly what they often do for HPC. But not for commercial workloads a la SAP.
>I.e. if two people are entering a Sales Order there is no
>information they need to share that needs to reside in
>memory all the locking type stuff is handles by the
>database.
First, on which cpu (or rather, in which cache) do you think the database engine resides ? hmm ? Surely not all of them !
Secondly, it seems you don't quite understand how cache coherency works with MESI (or even MOESI for K8). On a glueless opteron system, every time you need to access memory, you have to check the caches of all other CPU's to see if that data is not already somewhere else, and potentially changed. This is cache snooping, and its required for every memory read. This puts a strain on the HT bus regardless of the app, and the strain will grow exponentially with the number of CPU's. Indeed the type of app also plays a role; if its a frequent occurance that several cpu's share the same data, it gets a LOT worse, as with every modification, the owner of the data has to invalidate the cache lines in the other cpu's, and transmit the modified data, etc, but even in perfectly seperated threads, cache coherency is an issue.
And then i'm not even mentioning the fact that opteron 8xx only has 3 coherent HT links, which would mean that a 32S would need between 1 and 8? hops to reach RAM located on another cpu. If you have 32 cpu's all "hopping" like that, it will kill latency and totally rape bandwith as the HT links will simply be saturated.
Now all this works fine with 4 chips, as HT is pretty fast, and a cache snoop is much smaller than the actual data. But its already resulting in severely worse scaling with 8 sockets, and it will render a 32 socket opteron system completely useless without some better solution. Maybe its not that bad for HPC, but for any commercial app a glueless 32 socket opteron will crawl to a halt. I think the Iwill scaling from 4 to 8 sockets is already something like 40% at best, runnin SpecInt Rate which scales as good as it gets. A 32S system might well be *slower*.
Now I'm sure they are working on a solution for this, its not like the problem isn't known and widely documented, Im just curious how it is they expect to solve it. Implementing a directory protocol ? Seems like that would need support by the cpu. Maybe a HT switch ? Maybe both, a chipset that acts as switch and uses a directory based protocol between nodes of 4 opterons that work as usual ? I don't know.. but it will be something for sure. Opteron as it is just isn't suited for >4/8 sockets without "glue". And I'm curious about that glue

= The views stated herein are my personal views, and not necessarily the views of my wife. =