This is really ridiculous. This thread has spawned into a debate that began with OP's complete misunderstanding of his quote (like I said earlier). In fact, the quoted message by OP does not even present in both of his links. Here's the link to the passage:
http://www.hfadeel.com/Blog/?p=137
The main point of the quote is to tell designers how to effectively optimize their codes to avoid inherent flaw of the cache design (on both AMD and Intel). The OP failed to address a very important concept here, the critical stride. I'm not entirely sure what the critical stride is (since the author admitted to making the term up himself), but it looks like if data are placed in different memory block, and accessed in a particular order, the data in the previous memory block may be erased to make room for new data. This is not just a symptom on the Nehalem, but more like a flaw in the CPU design in general. In the PDF file linked by the OP, it also describes how to write codes to shift data around in order to avoid cache conflicts.
Ironically, using the critical stride calculation provided by the author, critical stride = total cache size / the number of ways, it appears that AMD's K10.5 will actually have shorter critical stride than Nehalem, which increases the number of critical strides. Therefore, AMD's Shanghai is actually more likely to suffer from cache conflict. This completely contradicts OP's original argument, and this is exactly why I said OP has no idea what he just posted.
But ultimately, the entire point of this message is to tell designers to put the relevant codes together to prevent cache contention, which he described later.
Functions that are used together should be stored together
The code cache works most efficiently if functions that are used near each other are also stored near each other in the code memory. The functions are usually stored in the order in which they appear in the source code. It is therefore a good idea to collect the functions that are used in the most critical part of the code together near each other in the same source file. Keep often used functions separate from seldom used functions, and put seldom used branches such as error handling in the end of a function or in a separate function.
and
Variables that are used together should be stored together
Cache misses are very expensive. A variable can be fetched from the cache in just a few clock cycles, but it can take more than a hundred clock cycles to fetch the variable from RAM memory if it is not in the cache. Some examples of fetch times are given in table above.
The cache works most efficiently if pieces of data that are used together are stored near each other in memory. Variables and objects should preferably be declared in the function in which they are used. Such variables and objects will be stored on the stack, which is very likely to be in the level-1 cache. Avoid global and static variables if possible, and avoid dynamic memory allocation (new and delete)
Lastly, to argue that games have small bits and pieces lying around in memory, therefore making cache contention a possibility is simply absurd. That is the reason why articles like this instructs programmers to optimize their codes to prevent cache miss and cache contention. The performance of Phenom II and Core i7 during gamings also do not support this argument, since we only see Phenom II X3 occassionally pulls ahead of Core i7 (possibly due to raw processor speed, as well as the way game is optimized for multicore), while other X4s are lagging behind.
Seriously, I don't even know why this thread is still dragging on.