Maybe something to add to your very fine article,
If one really looks at how is data transfered from Main memory to CPU, it might be interesting to compare both platform Intel and AMD. For sake of simplicity we can put everything in the PCI bus category, PCI-E , AGP, SuperIO etc etc...
With Intel, it is true that the IO, GPU, PCI is closer to the memory, generally Controller (PCI) -> NB -> Memory, this in theory makes it good to transfer information from outside to inside. The bottle neck between processor and memory comes in when there is a lot of data to be processed and returned to memory, just like in scientific calculation. It is even worst when the data has to go from core to core passing by the FSB. For games where DMA , bus mastering and other external controller can handle themselfs it is actually a better architecture. CPU -> NB -> Memory. It is interesting to note that the Intel processor has a quite good ratio of memory / operation ratio, close to 15 ops / fetch. In this case someone will need a lot more cache as it is used not only to mask long latencies but also main memory BW.
AMD, The memory controller is attached to the switch, this switch in the case of a 2xx or 8xx Opteron also connects other processors. The BW of this switch is really high, in fact higher than the FPU can handle in the case of a y=N*scalar case. the balance is about at 8 to 12 fetch / operation. The interesting part is in the case 2xx, 8xx, the memory controller scales very well, if data is on some other processor the data goes from CPU1 -> CPU2 switch -> Memory, basically the 2nd CPU becomes a "North bridge" for memory. In the case of a 8xx CPU, we can have 2 NB with 128bit path for both. The bad thing here is that the IO is actually one further away from the memory. The crossbar itself can already pass a lot of data to the memory without disturbing the CPU. We might want to consider that latency for IO is maybe not as bad as latency to memory unless the Video card is the IO, we can consider the local memory of the GPU as a "cache".
In my opinion I think that AMD should get rid of the memory controller straight on the CPU and just make more Hyper transport interfaces available. For example, let's say the 8xx CPU would have 4 Hyper transport instead of 3 + memory controller. It would be possible to make a motherboard with 25.6GB/s BW total, where 6.4GB/s would be going to IO and 19.2GB/s would go to Memory. In a very IO based configuration someone else could design a 12.8GB/s memory and 12.8GB/s IO, or again 19.2GB/s IO and 6.4GB/s memory bus (very nice in some applications I have here)... The memory could then be handled by a specific memory controller that connects to the Hyper Transport.
I would also bet that AMD could put even more hyper transport if they would get rid of the memory controller. They could also change memory technology as fast as intel. while keeping a reasonable amount of pins for the processor. Also keep in mind from what I have seen routing Hyper transport signal would be easier than routing DDRx * 128 or 256 bit wide.
For Intel, they tried it with rambus, having the memory controller separated from the North bridge, the rambus thing was pretty bad politically. Getting rid of the Northbridge memory controller and incorporating this Transport in the CPU itself would make sense.
Feel free to use it, add it, edit it, erase it... what ever !