I've noticed that X-Bit Labs has been slow to update itself lately, but it seems it's because they've been cooking up an interesting article.
http://www.xbitlabs.com/articles/cpu/display/dualcore-dtr-analysis.html
This is the first, that I know of, in depth analysis of core to core transfers in dual core, HT-enabled, and multiprocessor systems. I've only had to to briefly glance over the article, but the results are fascinating to say the least.
First, it seems that the much touted Crossbar mechanism isn't working. It appears to be a limitation in the cache-coherency protocol, but the AMD X2 appears to always read data from the first core to the second core via RAM rather than through the Crossbar. Luckily, the latencies of the IMC are low.
Second, they actually have an evaluation of the shared cache mechanism on both Yonah and Conroe. Needless to say, the system works very well compared to K8, since K8 is limited by the Crossbar not functioning. Sadly, I can't help but shake my head at Intel for making big strides but not going all the way. They implemented a nice and large 4MB L2 cache, but they didn't bother to increase the TLB since it appears to only be able to serve a 1024KB cache. Anything after that and it needs to go to the page translation tables to convert addresses which increases latency. Yonah has the same 1MB TLB limit.
A second wierd issue is that although L2 core to core transfers work properly, L1 core to core transfers do not. It appears to be similar to the limitation that AMD has with it's Crossbar where, L1 transfers in Conroe must go through the FSB rather than the L2 cache. This seems wierd to me since the inclusive architecture of Intel's caches should always allow the L2 cache to have a copy of whatever is in the L1 caches. The article mentions this is a consequence of the write back mechanism so I guess that although there is a copy of the L1 cache in the L2, that copy may not be the most recent copy since it isn't updated all the time and only when the L1 cache line is ousted. When the 2nd core requests data and the L2 cache records a cache miss and the most recent data is in the L1 cache of the 1st core, Conroe sends the 1st core's L1 cache contents to the FSB and back to the 2nd core instead of copying directly to the L2 cache. Yonah has the same problem and it seems really retarded if you ask me. I guess this L1 FSB limitation was why Intel was trying to implement direct L1 to L1 cache transfers in Conroe. However, given that that clearly didn't occur in the review and Intel no longer mentions that feature it has probably be delayed or canceled. It would have been interesting to see how direct L1 to L1 transferring works in a true quad core design. In any case, the L1 core to core transfer issue is improved over Yonah thanks to the faster FSB. This issue only occurs when reading data modified by the 1st core since both cores can read unmodified data at full speed.
I thought I would also point out that Intel appears to have done it again. It's commonly thought that increasing cache size may increase hit rates, but also increases latency. Intel showed that that wasn't the case when they doubled the L2 cache from 1MB to 2MB from Banis to Dothan while maintaining a low 10 cycle latency, and now it appears that going from 2MB to 4MB from Yonah to Conroe again yielded now disadvantages as the latency is maintained at 14 cycles. It's actually even better with Conroe getting only 12 cycles in sequential reading and 14 cycles in random reading while Yonah had 14 cycles in both sequential and random. I guess they've continued to tweak the sharing algorithms implemented since Yonah. Again, that TLB issue slows things down after 1MB. Argh.
http://www.xbitlabs.com/articles/cpu/display/dualcore-dtr-analysis.html
This is the first, that I know of, in depth analysis of core to core transfers in dual core, HT-enabled, and multiprocessor systems. I've only had to to briefly glance over the article, but the results are fascinating to say the least.
First, it seems that the much touted Crossbar mechanism isn't working. It appears to be a limitation in the cache-coherency protocol, but the AMD X2 appears to always read data from the first core to the second core via RAM rather than through the Crossbar. Luckily, the latencies of the IMC are low.
Second, they actually have an evaluation of the shared cache mechanism on both Yonah and Conroe. Needless to say, the system works very well compared to K8, since K8 is limited by the Crossbar not functioning. Sadly, I can't help but shake my head at Intel for making big strides but not going all the way. They implemented a nice and large 4MB L2 cache, but they didn't bother to increase the TLB since it appears to only be able to serve a 1024KB cache. Anything after that and it needs to go to the page translation tables to convert addresses which increases latency. Yonah has the same 1MB TLB limit.
A second wierd issue is that although L2 core to core transfers work properly, L1 core to core transfers do not. It appears to be similar to the limitation that AMD has with it's Crossbar where, L1 transfers in Conroe must go through the FSB rather than the L2 cache. This seems wierd to me since the inclusive architecture of Intel's caches should always allow the L2 cache to have a copy of whatever is in the L1 caches. The article mentions this is a consequence of the write back mechanism so I guess that although there is a copy of the L1 cache in the L2, that copy may not be the most recent copy since it isn't updated all the time and only when the L1 cache line is ousted. When the 2nd core requests data and the L2 cache records a cache miss and the most recent data is in the L1 cache of the 1st core, Conroe sends the 1st core's L1 cache contents to the FSB and back to the 2nd core instead of copying directly to the L2 cache. Yonah has the same problem and it seems really retarded if you ask me. I guess this L1 FSB limitation was why Intel was trying to implement direct L1 to L1 cache transfers in Conroe. However, given that that clearly didn't occur in the review and Intel no longer mentions that feature it has probably be delayed or canceled. It would have been interesting to see how direct L1 to L1 transferring works in a true quad core design. In any case, the L1 core to core transfer issue is improved over Yonah thanks to the faster FSB. This issue only occurs when reading data modified by the 1st core since both cores can read unmodified data at full speed.
I thought I would also point out that Intel appears to have done it again. It's commonly thought that increasing cache size may increase hit rates, but also increases latency. Intel showed that that wasn't the case when they doubled the L2 cache from 1MB to 2MB from Banis to Dothan while maintaining a low 10 cycle latency, and now it appears that going from 2MB to 4MB from Yonah to Conroe again yielded now disadvantages as the latency is maintained at 14 cycles. It's actually even better with Conroe getting only 12 cycles in sequential reading and 14 cycles in random reading while Yonah had 14 cycles in both sequential and random. I guess they've continued to tweak the sharing algorithms implemented since Yonah. Again, that TLB issue slows things down after 1MB. Argh.