Core to Core Transfer Analysis

ltcommander_data

Distinguished
Dec 16, 2004
997
0
18,980
I've noticed that X-Bit Labs has been slow to update itself lately, but it seems it's because they've been cooking up an interesting article.

http://www.xbitlabs.com/articles/cpu/display/dualcore-dtr-analysis.html

This is the first, that I know of, in depth analysis of core to core transfers in dual core, HT-enabled, and multiprocessor systems. I've only had to to briefly glance over the article, but the results are fascinating to say the least.

First, it seems that the much touted Crossbar mechanism isn't working. It appears to be a limitation in the cache-coherency protocol, but the AMD X2 appears to always read data from the first core to the second core via RAM rather than through the Crossbar. Luckily, the latencies of the IMC are low.

Second, they actually have an evaluation of the shared cache mechanism on both Yonah and Conroe. Needless to say, the system works very well compared to K8, since K8 is limited by the Crossbar not functioning. Sadly, I can't help but shake my head at Intel for making big strides but not going all the way. They implemented a nice and large 4MB L2 cache, but they didn't bother to increase the TLB since it appears to only be able to serve a 1024KB cache. Anything after that and it needs to go to the page translation tables to convert addresses which increases latency. Yonah has the same 1MB TLB limit.

A second wierd issue is that although L2 core to core transfers work properly, L1 core to core transfers do not. It appears to be similar to the limitation that AMD has with it's Crossbar where, L1 transfers in Conroe must go through the FSB rather than the L2 cache. This seems wierd to me since the inclusive architecture of Intel's caches should always allow the L2 cache to have a copy of whatever is in the L1 caches. The article mentions this is a consequence of the write back mechanism so I guess that although there is a copy of the L1 cache in the L2, that copy may not be the most recent copy since it isn't updated all the time and only when the L1 cache line is ousted. When the 2nd core requests data and the L2 cache records a cache miss and the most recent data is in the L1 cache of the 1st core, Conroe sends the 1st core's L1 cache contents to the FSB and back to the 2nd core instead of copying directly to the L2 cache. Yonah has the same problem and it seems really retarded if you ask me. I guess this L1 FSB limitation was why Intel was trying to implement direct L1 to L1 cache transfers in Conroe. However, given that that clearly didn't occur in the review and Intel no longer mentions that feature it has probably be delayed or canceled. It would have been interesting to see how direct L1 to L1 transferring works in a true quad core design. In any case, the L1 core to core transfer issue is improved over Yonah thanks to the faster FSB. This issue only occurs when reading data modified by the 1st core since both cores can read unmodified data at full speed.

I thought I would also point out that Intel appears to have done it again. It's commonly thought that increasing cache size may increase hit rates, but also increases latency. Intel showed that that wasn't the case when they doubled the L2 cache from 1MB to 2MB from Banis to Dothan while maintaining a low 10 cycle latency, and now it appears that going from 2MB to 4MB from Yonah to Conroe again yielded now disadvantages as the latency is maintained at 14 cycles. It's actually even better with Conroe getting only 12 cycles in sequential reading and 14 cycles in random reading while Yonah had 14 cycles in both sequential and random. I guess they've continued to tweak the sharing algorithms implemented since Yonah. Again, that TLB issue slows things down after 1MB. Argh.
 
I agree very interesting results. Although in the end it doesn't seem to make too much of a difference overall. Shame they didn't do woodcrest. Clovertown and kentsfield would have some interesting results too I bet. Perhaps they could fix some of these problems for kentsfield and clovertown?
 
I am wondering why not writeback to the shared L2 cache rather than go through the FSB, odd could you elaborate on what could be the reasons?
I'm no cache expert, but I'll give you my understanding.

The entire issue seems to be a hold out for multisocket optimizations. Before dual cores, multicore processors had to be a 2 socket system which had to communicate via the FSB. Therefore when the L2 cache of the 2nd single core processor reports a miss and requests data, it'll send a request out to the FSB. If the data being requested is currently being modified by the 1st processor and is in the L1 cache it'll have to be read from there since the write back policy would mean the L1 cache copy on the 1st processor's inclusive L2 cache is now outdated. The L2 cache's copy of the L1 cache won't be updated until the line is ousted which may be a while. In the interest of efficiency, when the 2nd processor requests the 1st processors L1 cache data, the L1 cache will directly send the requested data to the 2nd core through the FSB bypassing the L2 cache. If the 1st processor had copied the L1 cache to the L2 cache and then transferred from the L2 cache to the FSB to the 2nd core, that would have introduced unnecessary delay. I suppose they could have tried concurrent transfers to from the L1 cache to the L2 cache and FSB simultaneously, but that would have been pointless in single core multiprocessor situations and would have used up L1-L2 bandwidth.

Essentially, it looks to me that when Intel upgraded the Core 2 architecture one of the things they left unchanged was the direct L1 to FSB transfer mechanism. I can understand their justification for it since it'll offer decent latency benefits by avoiding the L2 cache in multisocket situations such as in Woodcrest. At the same time, interprocessor L1 to L1 transfers were supposed to be done over a dedicated link again avoiding the L2 cache so Intel could have gotten the best of both worlds (namely single socket and multisocket platforms). However, as I mentioned before, the L1-L1 direct interconnect appears to be disabled for now so dual core L1 core to core transfers operate as if its a multisocket system always.

For Action_Man

Perhaps they could fix some of these problems for kentsfield and clovertown?
I doubt it. Kentsfield and Cloverton appear to use Conroe 4MB cores so I doubt they'll be any different. I'm crossing my fingers that the direct L1 interconnect is just disabled for now like Hyperthreading was initially rather than irreparable.

And finally,
LT_cmd_data, I have one more question, since the X2 apparently writes back to system memory upon changes in the L2 cache data, would the newer DDR2-IMC show even more degradation in multitasking environments due to the inherent latency of the memory technology or would BW make up in the case of large data transfers?
Since AMD is now pushing DDR2 800 memory the latency issue is no longer as big as it once was.

http://www.digit-life.com/articles2/mainboard/ddr2-800-am2.html

This was the initial memory evaluation and the pseudo-random walk latencies are lower on the AM2 while the random walk latencies are higher. It should be noted that this is with DDR2 800 at 5-5-5 so if you really want to push it 4-4-4 or under would make up the latency difference. The article actually mentions some interestng things including the bottleneck being the L1-L2 interconnect which Anandtech mentioned in their K8 and Core 2 architecture comparison.

A nice thing about AMD's IMC is that memory latencies decrease nicely with clock speed increases. While the previous article used a 2GHz AM2 X2 4000+, this latest one uses the 2.8GHz FX62.

http://www.digit-life.com/articles2/mainboard/ddr2-800-am2-fx62.html

The extra clock speed brings the random walk latencies down nicely and that's the direction it'll probably be going as AMD seems to want to clock higher to compete with Conroe.

My hope is that Rev F corrects the Crossbar issue to begin with so we don't need to worry about this IMC business.
 
One of the best threads I've been through, so far.

I'm much more ignorant on these matters than most of you, but... I have a question:
Essentially, it looks to me that when Intel upgraded the Core 2 architecture one of the things they left unchanged was the direct L1 to FSB transfer mechanism. I can understand their justification for it since it'll offer decent latency benefits by avoiding the L2 cache in multisocket situations such as in Woodcrest. At the same time, interprocessor L1 to L1 transfers were supposed to be done over a dedicated link again avoiding the L2 cache so Intel could have gotten the best of both worlds (namely single socket and multisocket platforms). However, as I mentioned before, the L1-L1 direct interconnect appears to be disabled for now so dual core L1 core to core transfers operate as if its a multisocket system always.

For Action_Man

Quote:
Perhaps they could fix some of these problems for kentsfield and clovertown?

I doubt it. Kentsfield and Cloverton appear to use Conroe 4MB cores so I doubt they'll be any different. I'm crossing my fingers that the direct L1 interconnect is just disabled for now like Hyperthreading was initially rather than irreparable.

I believe the whole point of a L1D to L1D link, would be to avoid accessing the L2 cache thus, avoiding latency.
Since, in principle, this dedicated link would favour lower latencies (even in multisocket platforms), why do you state later on that, «I'm crossing my fingers that the direct L1 interconnect is just disabled for now like Hyperthreading was initially rather than irreparable.»? I assume that Intel left unchaged the direct L1 to FSB, as you say, because the increase in latency was... well, acceptable (again, avoiding the L2 cache in multisocket systems).
However, you seem to consider the L1-L1 direct link part of the "best of both worlds" (single & multisocket platforms). Why, then, better left it disabled rather than irreparable?

Thanks.


Cheers!
 
In multisocket systems, regardless of shared FSB or HT, the snoop and trace back through cache must always query each processor to determine if a fresh copy of the data resides in the other core; however, in a shared L2 dual core system is this not eliminated and if sharing the same memory block, do the cores still need to check each other before reading or writing back to L2 cache?
Hmm, that's an interesting question. I don't know for sure, but I'd hope that part of Intel's shared cache implementation is an aggressive updating of cache line states. This would mean that when the L1 cache of the 1st core does a read, it goes directly to the shared L2 cache. Assuming the L2 cache has the line, the two states of interest are whether the line is invalid or shared. In a single socket dual core setup, the only reason why a cache line in the L2 cache is invalid is because the 2nd core's L1 cache had modified the data and had immediately broadcast that to the L2 cache. The L2 cache would then immediatly update itself and pass that data on to the 1st core's L1 cache. If the L2 cache line read as shared, then the 1st core will know that the data is also in the 2nd core's L1 cache but has not been modified, so it can just directly read from the L2 cache. This all assumes the L2 cache is always accurately aware of the state of it's cache lines.

Based on the above model, in a write situation all the L1 cache needs to do is oust the line into the L2 cache as per the write back model. Since the L2 cache was already notified that it's copy was invalid as soon as the 1st core modified it's L1 equivalent, the 1st core doesn't need to worry about coherency since the "smart" L2 cache should be able to figure out what needs to be done such as updating the contents of the 2nd core.

Hmm, I'm not sure I'm making any sense, and that was quite a lot of speculation on my part actually. I think dissecting Intel's shared cache mechanism would easily be a full article if Intel would be so kind as to disclose how it works.

Also, I have been reading up on the Snoop filter in dual FSB chipsets and SMP (particularly how this relates to Intel's woodcrest platform and AMD's NUMA methods). Would you comment on the benefit of filtering the snoop?
My understanding of snoop filtering is fairly basic so if you've been researching I doubt I can really help you. I caveat I have though is that I don't think snoop filtering would have much benefits for a Woodcrest platform. I believe that the purpose of snoop filtering is to intercept a snoop request from say L2 cache 0 at the northbridge. Normally when cache 0 sends out a snoop request it would send it out to the 2nd bus and to the memory as well. The advantage of snoop filtering is that it'll intercept the request and if it determines that the processor(s) on the 2nd bus don't contain the cache line it'll only forward the request to RAM thereby saving the 2nd bus some bandwidth. This is very valuable for IBM in a Xeon MP setup where 2 Paxville MP dual cores may be sharing a 667MHz FSB so bandwidth is critical. It'll also be valuable to Dempsey since if the cache line cache 0 requests is in cache 1 (2nd die same processor) you'll only interrupt 1 FSB and not the 2 dies on the 2nd FSB/socket.

However, Woodcrest really won't see the same benefits. The shared L2 cache means that it'll never snoop back on itself on the same FSB like Dempsey does. The most it'll ever snoop is to the other shared cache on the other FSB and to the main memory. I don't see it mattering to save a bit of bandwidth on the 2nd FSB from snooping since I don't feel a dual core Woodcrest would use up all the bandwidth of a 1333MHz FSB to feel bandwidth constrained. I can see snoop filtering benefiting Cloverton though since that's a bandwidth constrained Dempsey situation.

I can't really comment on NUMA.
 
Since you're interested in snooping, I'd thought I'd pass on a tidbit that I noticed glancing over the overview of the 5000X chipset.

http://download.intel.com/design/chipsets/datashts/31307001.pdf

If you go to page 21 in the 2nd paragraph of the overview they talk about the chipsets snoop filter. As I'm sure you know the workstation 5000X appears to be the only Xeon DP chipset with a snoop filter. In this case, it's used in a very interesting manner. Intel saids their snoop filter is used to eliminate snoop traffic to the graphics port which saves bandwidth in graphics intensive applications. I knew Intel's policy was to snoop in all possible memory locations, but I had no idea they'd bother to snoop graphics card memory as well. Their use of the word "eliminate" seems to indicate that the purpose of the snoop filter is to decouple direct coherency communication between the processor cache and the graphics card to prevent all snooping from using graphics bandwidth. I have no idea whether the snoop filter also functions in the traditional sense in preventing needless traffic on the 2nd FSB and the memory subsystem.