OT here, but this may be why P2 doesnt oc as high on 64 bit, tho I havnt explored P2s micro/macro usage in 32bit vs 64 bit
"
Macro-op fusion in Nehalem works with a wider variety of branch conditions, including JL/JNGE, JGE/JNL, JLE/JNG, JG/JNLE, so any of those, in addition to the previously handled cases will decode into a single CMP+JMP uop. Best of all, Nehalem’s macro-op fusion operates in both 32 bit and 64 bit mode. This is essential, since the majority of servers and workstations are running 64 bit operating systems. Even modern desktops are getting close to the point where 64 bits makes a lot of sense, given the memory requirements of modern operating systems and current DIMM capacities and DRAM density. In addition to fusing x86 macro-instructions, the decoding logic can also also fuse uops, a technique first demonstrated with the Pentium M. "
http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=5
"If a loop is less than 28 uops, then Nehalem can cache it in the LSD and issue into the out-of-order engine without using the instruction fetch unit or the decoders. This saves even more power than the Core 2 when using the LSD, by avoiding the decoders and more loops can be cached. Nehalem’s 28 entry uop buffer can hold the equivalent of about 21-23 x86 macro-instructions based on our measurements from several games. The ratio of macro-ops/uops depends heavily on the workload, but in general Nehalem’s buffer is ‘larger’ than that found in the Core 2. "
"One of the most interesting things to note about Nehalem is that the LSD is conceptually very similar to a trace cache. The goal of the trace cache was to store decoded uops in dynamic program order, instead of the static compiler ordered x86 instructions stored in the instruction cache, thereby removing the decoder and branch predictor from the critical path and enabling multiple basic blocks to be fetched at once. The problem with the trace cache in the P4 was that it was extremely fragile; when the trace cache missed, it would decode instructions one by one. The hit rate for a normal instruction cache is well above 90%. The trace cache hit rate was extraordinarily low by those standards, rarely exceeding 80% and easily getting as low as 50-60%. In other words, 40-50% of the time, the P4 was behaving exactly like a single issue microprocessor, rather than taking full advantage of it's execution resources. The LSD buffer achieves almost all the same goals as a trace cache, and when it doesn’t work (i.e. the loop is too big) there are no extremely painful downsides as there were with the P4's trace cache. "
So by moving the "LSD" further down the pipe, the the fetch is done later, opening for more to be pulled in right away, but again, if its being flooded, or stalled, this will have to be used to a higher degree, even if Nehalem has a higher access at this point.
I need to do alot of reading on this, as the tradeoffs in gaming are waaaaay different in effecting each arch, and its data heirarchy.
Heres an example of AMDs cache possibly having a advantage over Nehalem, but without the real gaming data, and how it applies with higher usage games such as Crysis between the 2 cores, its all up for grabs
Anyways, heres the quote and the link, with unfortunately no data on our subject
🙁
"In general, MESIF is a significant step forward for Intel’s coherency protocol. However, there is at least one optimization which Intel did not pursue – the Owner state that is used in the MOESI protocol (found in the AMD Opteron). The O state is used to share dirty cache lines (i.e. lines that have been written to, where memory has older or dirty data), without writing back to memory.
Specifically, if a dirty cache line is in the M (modified) state, then another processor can request a copy. The dirty cache line switches to the Owned state, and a duplicate copy is made in the S state. As a result, any cache line in the O state must be written back to memory before it can be evicted, and the S state no longer implies that the cache line is clean. In comparison, a system using MESIF or MESI would change the cache line to the F or S state, copy it to the requesting cache and write the data back to memory – the O state avoids the write back, saving some bandwidth. It is unclear why Intel avoided using the O state in the newer coherency protocol for CSI – perhaps the architects decided that the performance gain was too small to justify the additional complexity. "
http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032&p=5