Phenom II and i7 <=> cache design for gaming

Page 6 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.


1,600MHz, CL9 DDR3 for Phenom
1,600MHz, CL7 DDR3 for i7
http://www.bit-tech.net/hardware/cpus/2009/06/22/intel-core-i7-975-extreme-edition-review/4

It is difficult to just look at a test and check for latency, does the test applications mesure correctly is just one important factor.
I think that Phenom will show better numbers when they set memory to ganged mode but i real applications it performs better in unganged mode.

I did some coding to check this about a year ago (memory performance), and I can tell you it is hard to check. CPU's has so much features for avoiding memory reads/writes where you need to skip just to get the real latency.


 
Something's odd about that review...

I have Everest 4.6 (the version used in that review), an i7, and some 1600MHz Cas 7 RAM, and my benchmarks completely disagree with their numbers (52ns for i7):

Everest-trichannel6dimm1600cas7.png
 


Thats what I mean. You can't just read these tests and think they are right. It is difficult to test latency for memory. You need to know a lot about the test, how they test to understand the test
 

Your test shows wrong numbers for L3 cache also, it isn't that fast.
 
Again, to me, these tests show i7s agressivemess and horsepower, where in a wider or more accessable scenario, may not benefit i7 as much, but we need more data
Since P2 is slower, but has slightly lower latency, having higher access may benefit P2, sine the lower latency may add to the overall output having more access.
Time will tell
 


LOL - now look who's talking.

Prefetchers is not that important on i7 because it has very good memory performance, this was much more important on core 2 because the slow fsb. there are assembler commands in x86 for prefetching data and there is automatic prefetching. the cpu can guess depending on how the data is structured

OK, so the OoO prefetch algorithms are "not that important" along with the cache line eviction algoriths, according to you. You do know what OoO means, and why it's important for, say x86 vs. something like VLIW, right?? And don't try turning the tables and asking me to explain, since I "seem to know what I'm talking about". Let's see your explanation first, eh?

 


I should mention that Kassler also forgot (or doesn't actually know) that Nehalem uses SMT, which means the prefetchers have double the threads correctly predict branches for. But that's probably "not important" to him 😀.

From the realworldtech article I linked to previously:

The instruction fetch unit also contains the branch predictor, which is responsible for predicting the RIP of the next instructions to be fetched. Nehalem’s branch predictors are not shown in detail, partially because some of the details are unknown – Intel simply states that they use “best in class” branch predictors and that their predictors are tuned to work with SMT. Intel did confirm that Nehalem continues to use all of the special predictors from the previous generations, such as the loop detector, indirect predictor, etc.

Once the branch predictor has determined that a branch is taken, the branch target buffer (BTB) is responsible for predicting the target address. Nehalem augments the previous generation of branch prediction by using a two level BTB scheme. For reference, Barcelona uses a 2K entry BTB for direct branches, and a 512 entry indirect branch target array.

Nehalem's two level BTB is designed to increase performance and power efficiency by improving branch prediction accuracy for workloads with larger instruction footprints (such as databases, ERP and other commercial applications). At this time, Intel is not describing the exact internal arrangement, but it is very possible to make an educated guess or two.
 


P2's L1 & L2 cache sizes have to be bigger because they are less effective in keeping critical data, due to - you guessed it 😀 - less efficient cache algorithms. BTW, smaller cache size generally means faster (less latency). Yet another area where you seem to be completely clueless, not realizing the tradeoffs between size & speed. Ideally, all code would be contained in the registers, or at least the L1 cache, for maximum access speed. In the real world, of course, nobody would be satisfied with a program amounting to a few hundred bytes or 32KB, since it couldn't do very much.

L3 cache on Phenom II is MUCH faster compared to memory on i7 😉
L3 cache on Phenom II and i7 don't differ much in performance

Du-uh, no sh!t on the first one. Bzzt! - wrong on the 2nd one, according to the realworldtech article:

Nehalem’s 8MB and 16 way associative L3 cache is inclusive of all lower levels of the cache hierarchy and shared between all four cores. Although Intel has not discussed the physical design of Nehalem at all, it appears that the L3 cache sits on a separate power plane than the cores and operates at an independent frequency. This makes sense from both a power saving and a reliability perspective, since large caches are more susceptible to soft errors at low voltage. As a result, the load to use latency for Nehalem varies depending on the relative frequency and phase alignment of the cores and the L3 itself and the latency of arbitration for access to the L3. In the best case, i.e. phase aligned operation and frequencies that differ by an integer multiple, Nehalem’s L3 load to use latency is somewhere in the range of 30-40 cycles according to Intel architects. The advantage of an inclusive cache is that it can handle almost all coherency traffic without disturbing the private caches for each individual-core. If a cache access misses in the L3, it cannot be present in any of the L2 or L1 caches of the cores. On the other hand, Nehalem’s L3 also acts like a snoop filter for cache hits. Each cache line in the L3 contains four “core valid” bits denoting which cores may have a copy of that line in their private caches. If a “core valid” bit is set to 0, then that core cannot possibly have a copy of the cache line – while a “core valid” bit set to 1 indicates it is possible (but not guaranteed) that the core in question could have a private copy of the line. Since Nehalem uses the MESIF cache coherency protocol, as discussed previously, if two cores have valid bits, then the cache line is guaranteed to be clean (i.e. not modified). The combination of these two techniques lets the L3 cache insulate each of the cores from as much coherency traffic as possible, leaving more bandwidth available for actual data in the caches.

P2's L3 latency averages 56 CPU cycles according to Lost Circuits . Sounds to me like Nehalem's is half to 2/3rds the P2's.

Of course now you will claim that too is "not important". :sol:
 


OK, now I'm confused. Are we talking about the L3 cache latency, or overall memory access latency here? It seems Kassler jumped from one topic to another, which is par for the course for him (to avoid getting pinned down and having to show his ignorance, IMO).

The Lost Circuits link I posted above is much more comprehensive, mapping block size vs. latencies on the P2 940, so that once you get above the 512K L2 cache size, you can easily spot the huge jump in latency in the L3 cache.

Now if you wanna talk main memory latency, then you'll need to compare using identical memory timings to start with. And I should point out that Mr. Super Scientist Kassler linked to the Bit-Tech review page showing P2 in the best light - multithreaded performance - and conveniently ignored page 5 showing the Nehalems soundly thrashing the P2 on Sandra's singe-threaded bench.

I think we have a sufficiently large posting datapoint pattern by Mr. Kassler so that we can benchmark his performance: Cherrypicking, cherrypicking, nitpicking and butt-picking 😀. Like a true ignoramus, he will focus on one tiny item and dismiss as "not important" everything else. Clearly he is not qualified to hold an intelligent discourse and should be ignored forthwith.
 


haha, just so you know....
perfecting data when you are working with huge amounts of data or you have very optimized code is bad. you will get slower performance 😉
perfecting on servers are often a bad technique because of this.
I think that if i7 say that they have good prefetchers on i7 that is probably a marketing trick.

you have some studding to do.

don't have time to respond to you other guessing games right now
 
All these things contribute to the overall latency, and to single out 1 here and there doesnt give a example of the bigger picture, unless you have a photo memory, and can connect all the various dots.
And each scenario will react differently, as it also has to work with the gpu.
Whats being done faster in 1 part of the cpu, may take more cycles on another part, and a different arch may be reversed, and in a particular part of a game, it may favor one cpu over another, while this may change later on in the same game.
Now, if anyone wants to simplify this, be my guest heheh, cause at this point, we truly dont have enough data to show if these "better" scenarios for P2 exist, or will pop up in greater numbers in the future.
Making overall claims, when the data we do have is contrary, without also having more data to back these "better" claims up is pointless
Like Ive said, until new cards, W7, DX11 and newer more complex games come out, its too little info
 
Im saying that i7s L3 latency is faster thru its arch, the way the pipeline been set up, which is a very aggressive approach, which it can do because of its horsepower if you will.
How that is effected by a wider communication without the use of smt for games, may play against it, but we dont have any data, and currently what we do have shows?
Most games it shows the contrary, where i7 is doing fine.
kasslers claim of "better" for P2 in more complicated games, doesnt show alot, and nor only that, theres only the 1 example
 


Didn't know the spin doctor works on Saturdays.

How does i7 pays a price with inclusive cache design? Please show me.

Phenom II's L3 cache is actually slower than i7's L3.

cm3d-phenom-ii-x4-940.gif


cm3d-i7-920.gif


Like the graph above showed, the L3 latency on Phenom II is about 20ns, while on i7 is about 15ns.

Keep spinning.
 
In kasslers example, it shows P2 as having the faster, or lower latency, I didnt check your example, which says the exact opposite.
Either way, whichever claim is correct, or more correct, Im going on kasslers assumption of game complexity here
Now that I think of it, maybe it was system, I was tired when I read it
But again, either way, the assumptions are too broad, not enough data.
Havnt looked at the heirarchy of P2s L3 cache, but was thinking of Kantors older K8 example, if thats what it was, as it was dated 07, and if theres the omissions by not using the "F' method, but the "O" method still is intact on P2, the compexity could favor P2 if P2 still carries "O
as in Kantors example, but we dont have enough data
And unless you want to go back and check my examples taken from Kantor, where the "F' is seen as being inferior, I wont go into it, and I need to read your link
 


Exactly so! As I think I mentioned earlier, CPUs are a very fast & very complex collection of circuits, and changing the performance of one will affect that of the others, like a machine with a lot of interactive controls. You have to tune the overall device to maximize performance under a wide variety of code conditions. Keeping those 2, 3, 4 or 6 cores fed with needed data is key to throughput, so correctly guessing what branch the code will take, and then loading the code as quickly as possible, and then not throwing it away prematurely so that you have to go back out to main menory, is sorta hugely important, despite what Mr. database programmer sez to the contrary :).

 
fazers, from your link
"However, if we factor in the 4 MB block size that fits into the Deneb but not into the Agena L3 cache, the up to 88 cycles access latency for this pattern will increase the average to ~42 cycles. Keep in mind that this does not mean that the L3 cache is slower, all it means is that we add another factor that by definition will show higher latencies and that will skew the average numbers. "

It should be obvious that it doesnt necessarily contribute to latency, it depend on usage, as the cycles are pertinent only to a complete cache run thru on their test
 
If it fits the schedule, the latencies are reduced, and is why Ive been saying all along, this particular discussion is about complexity of games, and their usage of cache within each arch, and the varying games, and even within each game, and how it may favor at any given time the overall design of 1 cpu over the other, and depends on how often these things thatre favored are repeated.
Its too complex to make any assertions at this point, and a much wider view has to be taken, and that can obly come with more data IMHLO
 


"Studding"?? Sounds like either building a wall outta 2 x 4's or being put out to father offspring 😀.

Anyway, where did I say SMT is optimal under all circumstantances?? As Intel states, multithreading is useful in situations where a thread doesn't take up more than about 70% of the available clock cycles. So instead of spending that 30% or more available clock cycles sitting idle, the Nehalem core can switch to another thread and do some useful work there. So, computationally intense threads are not a good choice as the core is usually completely loaded with no free cycles. And contrary to your other statement, many server tasks are ideal for SMT. Seriously, do I hafta drag out the Anandtech Nehalem vs. Shanghai server review, where Nehalem absolutely destroyed Shanghai with double the performance??

You are nothing more than a silly troll with too much time on your grubster hands. Run along to AMDZone - maybe your wild-ass guesses masquerading as theories might get appreciated more there. I'm taking my own advice and ignoring you from now on.
 


In the interest of brevity, that's why I went with the average of 56 cycles. According to the link the latency depends not only on the block size but the stride length as well.

BTW I believe this was measured using P2's 1.8GHz or 2.0GHz L3 clock cycles, right? From the Anandtech review :

Nehalem's L1 cache, despite being seemingly unchanged from Penryn, does grow in latency; it now takes 4 cycles to access vs. 3. The L2 cache is now only 256KB per core instead of being 24x the size in Penryn and thus can be accessed in only 11 cycles down from 15 (Penryn added an additional clock cycle over Conroe to access L2).

CPU / CPU-Z Latency L1 Cache L2 Cache L3 Cache
Nehalem (2.66GHz) 4 cycles 11 cycles 39 cycles
Core 2 Quad Q9450 - Penryn - (2.66GHz) 3 cycles 15 cycles N/A



The L3 cache is quite possibly the most impressive, requiring only 39 cycles to access at 2.66GHz. The L3 cache is a very large 8MB cache, 4x the size of Phenom's L3, yet it can be accessed much faster. In our testing we found that Phenom's L3 cache takes a similar 43 cycles to access but at much lower clock speeds (2.0GHz). If we put these numbers into relative terms it takes 21.5 ns to get a request back from Phenom's L3 vs. 14.6 ns with Nehalem's - that's nearly 50% longer in Phenom.

So whatever review latency measurements we use, need to be converted into nanoseconds and not clock cycles.
 


LOL - OK I may have gone a bit overboard here but then somebody's gotta choke off these trolls 😀.

And since I'm just sitting around waiting for the cable repairman to appear within the "guaranteed" ~1 month window :), I don't have anything better to do at the moment. Unfortunately for us all 😀.
 
While it may be accessed faster, and being inclusive, MT from the SW end in gaming scenarios has yet to be fully explored for all its advantages per arch.
My point is, being inclusive, when we use MT, that can help, or hinder, depending, while P2s solution may be either as well, and is the basis of my wait n see attitude
 
Everything still has to fit in the stack. Whether having a slower yet wider approach to MT is going to help or hinder, no one knows, and whether having a narrower yet faster approach, same thing, we will have to see how this all "stacks" up heheh

Whats really too bad here is i7s inability to make use of its smp in games