I hope it is now evident that in AMD's 4x4 implementation, none of the motherboard chips have any say in where hard disk data ends up in RAM, and none of them even see the transfer of data from one memory bank to a local or neighboring CPU. One can draw a similar conclusion for Intel systems. As long as the Northbridge is held constant, changing the other chips around will not resolve memory performance problems.
That's where you're wrong. Legitreviews showed that the chipset driver for QFX WAS putting memory into the wrong sockets. In XP NUMA is handled by the chipset, not the OS. In Vista, the OS overrides the chipset driver. I haven't seen any Vista Home Premium numbers yet but I bet they will be better for single threaded apps.
As a person who worked on components (test and automation) for 5 versions of Windows, I think I know how the kernel works. RC2 (the last Vista reviews ) is reserved for code freeze and optimization. XP RC2 was a lot slower and less polished than XP RTM.
The same thing will happen. But even if QFX STILL LOSES 10% off of FX62 scores that's still (including the 8800GTS) at least 70% faster than my 4400+/7800GT at Doom3. That means I can finally buy FEAR.
BM, I suppose you may define chipset to include the on-die memory controller... in that case, yes, that's exactly what AMD needs to work on to minimize the penalties of NUMA. However, your previous writings have insinuated that it was Nvidia's fault for not producing a good enough chipset (motherboard), and that AMD/ATI could do better, when in actuality, the ball rests in AMD's court to release CPUs with efficient memory controllers (and cores, mind you), and not with MS or board makers to release better software workarounds.
The "chipset driver" you refer to is some code either in BIOS or with the OS which programs the integrated memory controllers. As I'm sure you've reviewed the 4x4 block diagrams, you can see that memory data does not pass through the motherboard chips at all; the motherboard simply supplies the physical traces so as not to introduce latency.
I also have not seen from Legitreviews more than a cursory mention of NUMA and 4x4, much less an in-depth analysis of NUMA thread management under XP.
The reason NUMA works just fine on Opterons is that people have benched server programs, which are very much aware of nodes, local memory, and clustering technologies like HT/Infiniband and GbE. Where Opterons may have lost in gaming and desktop benchmarks, most probably wrote that off as due to registered ECC RAM incurring extra latency, which hurts the A64 design quite a bit. I doubt anyone studied the relative performance penalty of games running in 1P and 2P/8P Opteron systems as no one buys a 2P/8P Opteron just to game.
NUMA on the desktop, though, should not show its face as most freeware and consumer-level programs are not built to accomodate server technologies like NUMA (and probably never will as multi-socket is already inefficient). I was hoping that AMD had something up its sleeve to make NUMA latency negligible - otherwise, why all the talk about cHT as if it were a substantial improvement on the interconnect technology Opterons have long used? For now, AMD has not made 4x4 much more enticing than two separate computers.