Phenom II and i7 <=> cache design for gaming

fazers_on_stun · Aug 21, 2009

spud :

Let's face it - Kassler saw "48-way" set associative cache for P2 vs. "16-way" for Nehalem & it was all downhill from that 😀...

Kassler: "Du-uh, moah is better!! Du-uh!!"

Zooty: "*&%$*@ Spintel!! Spintel bad!! Ooh, lookit de pretty butterfly... "

fazers_on_stun · Aug 21, 2009

kassler :

Explain what - that P2's cache design is better than Nehalem's? The problem is - it's not.

Ever hear of the scientific method? Know what engineering proof of design is? What you have provided is one article plus a whole boatload of assumptions and no data. Zero. Sorry, epic fail here, move along folks...

kassler · Aug 21, 2009

fazers_on_stun :

Can you explain the difference here?
Phenom II's L3 cache is: 48-way set-associative
i7's L3 cache is: 16-way set-associative

Why is phenom 48-way and why is intel 16-way?

yomamafor1 · Aug 21, 2009

kassler :

Both scientific calculations and server programs are vastly more demanding on the CPU-RAM structure than any other programs. Yet Intel's "inferior cache technology" consistently outperforms Shanghai by miles.

enigma067 :

So are you saying AMD is a tyrant? :pt1cable:

This ad was originally pulled from Newegg's AMD thread.

yomamafor1 · Aug 21, 2009

JAYDEEJOHN :

Stomp i7 indeed.

yomamafor1 · Aug 21, 2009

kassler :

Because Intel's L3 cache is pure inclusive design, while AMD's L3 cache is mostly exclusive design.

kassler · Aug 21, 2009

yomamafor1 :

???
inclusive and exclusive is another subject regarding cache design.

memory is mapped against cache. it can't be placed randomly.
for phenom memory has 48 "ways" to be placed. 48 different locations, i7 has 16. this means that i7 will throw away memory much sooner in order to store newer data or code

L2 cache and L1 cache is also twice as big on phenom compared to i7

stevensl2 · Aug 21, 2009

To be honest here, The only real Purchase that I have made and didn't feel like I've been ripped off or Disappointed by is the PhenomII 940 - I was amazed by how fast it was compared to my old Phenom 9550.. Which by the way was a piece of *** that i sold for 130$ for a friend 3 months later.

I also have many older rigs but those aren't important, my latest update is the i7 920 with whatever components I needed, and That was also. Quite a disappointment. I expected Much more performance. I even Oc'ed.

PhenomII 940 is my winner =] ( But of course I'm not saying that I dislike my i7 system or my original Phenom )

fazers_on_stun · Aug 21, 2009

yomamafor1 :

Egg-zactly!

AMD's cache setup has to snoop in the other 3 core caches to determine whether to fetch the data from main memory or not. And even Kassler should know that...

kassler · Aug 21, 2009

fazers_on_stun :

i7 is faster syncronizing data among cores. this isn't about if data is in cache or not.

fazers_on_stun · Aug 21, 2009

kassler :

So this isn't about cache hits & misses? Didn't you just say 3 posts up, to Yomamafor1, that "i7 will throw away memory much sooner in order to store newer data or code "? IOW, the implication from you is that i7 suffers more cache misses because it throws away data sooner because it has fewer unique slots to store data in, compared to P2.

Of course, you are conveniently ignoring the fact that Intel generally has better cache algorithms than AMD has historically managed, and thus manages to outperform the P2 despite using a smaller cache with fewer slots.

kassler · Aug 21, 2009

fazers_on_stun :

can you present information about these cache algorithms. i didn't know that you needed good cache algorithms because that is a rather simple task...

fazers_on_stun · Aug 21, 2009

kassler :

?? Are you serious? Efficiency (read: correctly guessing) what data will be needed in the near future is the 'raison d'etre' (pardon my French) for cache algorithms s to begin with. You realize that x86 code, particularly 'branchy' code like in games that is dependent on lots of variables such as gamer input, is subject to a lot of misses - if not in the L1, then you have to look in L2, L3, main memory, hard drive, etc. in order of increasing latency, with a couple orders magnitude between the last 2. Why do you think CPUs have pre-fetch algorithms where when going out to main memory, analysis is made of current and historically recent data & code so as to grab as much relevant, and only relevant, stuff that you are likely to need in the near future? Conversely, once you've loaded the data in cache, you need to predict mostly correctly whether you're gonna need it again real soon, because if you throw it away, then you'll have to go out to the next level to fetch again, which eats clock cycles.

Modern CPUs operating in the GHz clock range need instructions & data pumped to them in huge quantities & high bandwidth. They don't wanna be sitting around waiting on stuff. And given the hierarchical sizing relationship (say 4GB main, 6 or 8MB L3, 256K or 512K L2, 32K or 64K L1), it's fairly obvious that some astute selection schemes (read: algorithms) have to be employed to maximiize the data availability and minimize the latency. Sometimes the wait cannot be helped, such as a thread waiting on some other thread result to be available. In fact Intel capitalized on this with their SMT on Nehalem - might as well execute some other thread cycles on the hardware when the first thread is stalled. But a good CPU designer will try really hard to ensure the hardware isn't getting in its own way, as in cache misses.

I suggest you do some of your own research, starting with THIS link. Then try googling memory cache algorithms.

JAYDEEJOHN · Aug 22, 2009

I need to look into this more.
As we start to see actual gaming MT availability, itll be interesting.
Apps other than gaming is much narrower in design compared to a game, and its possible flooding the cpu could happen, where inclusive vs exclusive starts to have a tipping point, even including the latencies.
Its interesting, I'll give it that much anyways
I see assumptions being made here on both sides, with no real data, and no to the metal breakdowns on explanations.
Im not saying 1 is better than the other, and I find that its possible that even server isnt quite up to a games diversity on the fly, tho there may be some clues there in server, and breakdowns of data there, since theres fewer games than server apps, and usage
To me, this is tasty stuff heheh, and Im not on any camp, keeping a totally open mind, as in the end, the data is the proofs, and the workings is the pudding

JAYDEEJOHN · Aug 22, 2009

Thanks for kantors link, hes da man

kassler · Aug 22, 2009

fazers_on_stun :

Yes

fazers_on_stun :

And I suggest that you do some research

http://www.agner.org/optimize/microarchitecture.pdf

JAYDEEJOHN · Aug 22, 2009

"Nehalem's two level BTB is designed to increase performance and power efficiency by improving branch prediction accuracy for workloads with larger instruction footprints (such as databases, ERP and other commercial applications). At this time, Intel is not describing the exact internal arrangement, but it is very possible to make an educated guess or two."
http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=4
This design could be a weakness in heavy games, if you will, as its designed for a higher "same" again if you will, data input for servers.
I can see where kasslers coming from, but little or no data , only benches on a game or 2
The argument, if I understand it correctly is, while Nehalem may be great at larger instruction sets, thats certainly not found in games, where in games, its many, and wildly diverse.
Need to keep reading....

JAYDEEJOHN · Aug 22, 2009

"Another improved branch target prediction mechanism in Nehalem is the return stack buffer (RSB). When a function is called, the RSB records the address, so that when the function is returned, it will just pick up where it left off instead of ending up at the wrong address. A RSB can overflow if too many functions are called recursively and it can also get corrupted and produce bad return addresses if the branch predictor speculates down a wrong path. Nehalem actually renames the RSB, which avoids return stack overflows, and ensures that most misspeculation does not corrupt the RSB. There is a dedicated RSB for each thread to avoid any cross-contamination. "

Ok, its seems were getting a lil closer as the effects of a game may have on i7, as its design being put to use.
If the return isnt oo costly, adding too much latency, then that blows kasslers theory all to wherever

kassler · Aug 22, 2009

Branch prediction isn't about cache hits, it's about avoiding to empty the instruction pipeline for the processor.
Core 2 payss a higher penalty when this hapens compared to AMD.
Also on the realworldtech site the only have information about K8 (AMD)

JAYDEEJOHN · Aug 22, 2009

OT here, but this may be why P2 doesnt oc as high on 64 bit, tho I havnt explored P2s micro/macro usage in 32bit vs 64 bit
"
Macro-op fusion in Nehalem works with a wider variety of branch conditions, including JL/JNGE, JGE/JNL, JLE/JNG, JG/JNLE, so any of those, in addition to the previously handled cases will decode into a single CMP+JMP uop. Best of all, Nehalem’s macro-op fusion operates in both 32 bit and 64 bit mode. This is essential, since the majority of servers and workstations are running 64 bit operating systems. Even modern desktops are getting close to the point where 64 bits makes a lot of sense, given the memory requirements of modern operating systems and current DIMM capacities and DRAM density. In addition to fusing x86 macro-instructions, the decoding logic can also also fuse uops, a technique first demonstrated with the Pentium M. "
http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=5

"If a loop is less than 28 uops, then Nehalem can cache it in the LSD and issue into the out-of-order engine without using the instruction fetch unit or the decoders. This saves even more power than the Core 2 when using the LSD, by avoiding the decoders and more loops can be cached. Nehalem’s 28 entry uop buffer can hold the equivalent of about 21-23 x86 macro-instructions based on our measurements from several games. The ratio of macro-ops/uops depends heavily on the workload, but in general Nehalem’s buffer is ‘larger’ than that found in the Core 2. "

"One of the most interesting things to note about Nehalem is that the LSD is conceptually very similar to a trace cache. The goal of the trace cache was to store decoded uops in dynamic program order, instead of the static compiler ordered x86 instructions stored in the instruction cache, thereby removing the decoder and branch predictor from the critical path and enabling multiple basic blocks to be fetched at once. The problem with the trace cache in the P4 was that it was extremely fragile; when the trace cache missed, it would decode instructions one by one. The hit rate for a normal instruction cache is well above 90%. The trace cache hit rate was extraordinarily low by those standards, rarely exceeding 80% and easily getting as low as 50-60%. In other words, 40-50% of the time, the P4 was behaving exactly like a single issue microprocessor, rather than taking full advantage of it's execution resources. The LSD buffer achieves almost all the same goals as a trace cache, and when it doesn’t work (i.e. the loop is too big) there are no extremely painful downsides as there were with the P4's trace cache. "
So by moving the "LSD" further down the pipe, the the fetch is done later, opening for more to be pulled in right away, but again, if its being flooded, or stalled, this will have to be used to a higher degree, even if Nehalem has a higher access at this point.
I need to do alot of reading on this, as the tradeoffs in gaming are waaaaay different in effecting each arch, and its data heirarchy.
Heres an example of AMDs cache possibly having a advantage over Nehalem, but without the real gaming data, and how it applies with higher usage games such as Crysis between the 2 cores, its all up for grabs
Anyways, heres the quote and the link, with unfortunately no data on our subject 🙁
"In general, MESIF is a significant step forward for Intel’s coherency protocol. However, there is at least one optimization which Intel did not pursue – the Owner state that is used in the MOESI protocol (found in the AMD Opteron). The O state is used to share dirty cache lines (i.e. lines that have been written to, where memory has older or dirty data), without writing back to memory.

Specifically, if a dirty cache line is in the M (modified) state, then another processor can request a copy. The dirty cache line switches to the Owned state, and a duplicate copy is made in the S state. As a result, any cache line in the O state must be written back to memory before it can be evicted, and the S state no longer implies that the cache line is clean. In comparison, a system using MESIF or MESI would change the cache line to the F or S state, copy it to the requesting cache and write the data back to memory – the O state avoids the write back, saving some bandwidth. It is unclear why Intel avoided using the O state in the newer coherency protocol for CSI – perhaps the architects decided that the performance gain was too small to justify the additional complexity. "

http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032&p=5

fazers_on_stun · Aug 22, 2009

kassler :

OK, so on page 4 of your reference it states:

The following families of x86 microprocessors are discussed in this manual:
Microprocessor name Abbreviation
Intel Pentium (without name suffix) P1
Intel Pentium MMX PMMX
Intel Pentium Pro PPro
Intel Pentium II P2
Intel Pentium III P3
Intel Pentium 4 (NetBurst) P4
Intel Pentium 4 with EM64T, Pentium D, etc. P4E
Intel Pentium M, Core Solo, Core Duo PM
Intel Core 2 Core2
AMD Athlon AMD K7
AMD Athlon 64, Opteron, etc., 64-bit AMD K8
AMD Family 10h, Phenom, third generation Opteron AMD K10

I fail to spot any Nehalem or P2 reference there. Isn't that what your original post was about - "Phenom II and i7 <=> cache design for gaming"?? If you're gonna start citing stuff, at least make it relevant to what topic you started this thread with - sheesh!

IIRC AMD made some changes to their cache architecture between K10 and K10.5. Obviously with Nehalem Intel went with an L3 cache - Core 2 didn't use an L3.

JAYDEEJOHN · Aug 22, 2009

But its the complexity in games, where branch prediction, and its location in the pipeline, where latencies can happen do make it perinent.
If we assume a game is more complex, branch prediction becomes more a player withing the latency overall, doesnt it?

fazers_on_stun · Aug 22, 2009

Gee, JDJ - why don't you just copy the whole article aready

...

I will have to look for the links, but from what I've read, Nehalem has better OoO execution, branch prediction and cache algorithms, which partly explains why its IPC is so much higher than on P2. This is where Intel's compiler experience kicks in - the more you know about software, the better you can tune the hardware, in rather simplistic terms.

kassler · Aug 22, 2009

JAYDEEJOHN :

games is not branchier compared to other software. it is probably the oposite because there are many tricks you can use to avoid branches if you need performance. it is possible to optimize the code when you develop if the compiler isn't able to to the best job.

kassler · Aug 22, 2009

fazers_on_stun :

What you need to read about is cpu's in general.

Phenom II and i7 <=> cache design for gaming

Splendid

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Splendid

Distinguished

Splendid

Champion

Champion

Distinguished

Champion

Champion

Distinguished

Champion

Splendid

Champion

Splendid

Distinguished

Distinguished

Share this page