Phenom II and i7 <=> cache design for gaming

Page 5 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
And thats where this whole thing becomes a possibility.
Maybe Im wrong for even thinking this, but as games get more complex, as the OS' and for gaming DX, allow for "wider" availability, this may decrease i7s advantages, as its set to be very aggressive/effective with data as it comes in, and its processing, but as the dat flow increases, there may be a tipping point, where slower more complex starts to see gains
 


Lets say that you execute a database query. you will get 1 000 000 records as a result. Processing these records you need to calculate the sum of two fields in each record.
This task is very simple and it is easy to optimize it.

Now take a very complex game, where each character have good facemovements. Just to calculate how the skin will change when all muscles are used and maybe the hair too. This needs huge amounts of code to do and much data that will be processed in different ways. All this for just one frame
 
Right, and thats what Im saying here. And all this has to be compiled and sorted, and above all scheduled.
Im more of a gpu guy, just learning cpu "things" I know about games a lil bit, less about cpus, but Im learning.
I know how complex a game is, and having all that info in line is critical, so the cpu/gpu interaction is vital.
I know that W7/Vista has opened the door up wider in this interaction, and thats where some advantages previously seen on i7 may not help it as much as other cpus, and thats where my interests lay, as it all comes down to arch, and this interactions between cpu/gpu.
A gpu is much faster than a cpu, and yes, sometimes it waits on the cpu, even the fastest ones, unlike SW, wheres its there at the cpus leisure
What Id like to see is real world sheduler interactions between gpu and cpu latencies.
I had a great link that Ive since lost 🙁 that did exactly this, and without totally understanding each games intracasies, it was a valid way of actually "seeing" the data flow, which then could be used to understand how each game uses what it needs to function, and whether this "more complex" issue is valid or not, seeing as each game is different
 


Oh yes - a most brilliant, insightful response indeed.

Kassler, I think you are just a poser, and don't know beans about what you purport to. For example, your inane dismissal of cache algorithms. Do you realize that there are literally thousands of US patents on just that very subject? Go to www.uspto.gov, and do a search in Class 711, subclasses 118 - 146. If this wasn't a vital IP subject, there wouldn't be much activity, now would there??

 


Prefetchers are far more sophisticated on the I7 so 16 works perfect and real world performance reflects that pretty well. I would also have to say less path ways means lower cost and power draw, but what do I know I just read a majority of the white papers available.

Word, Playa.
 


So what? I didn't say that there isn't any algorithm for how data is handles by the cache. what I know is that they are rather simple.
When you start to talk about different techniques it seems like you have read about some new words and don't really know how they work.
 

can you explain exactly how they work?
 


The prefetchers are logic ic that monitor, study, abstract, and calculate within 95+% (depending on the code structure mind you); what data should or most likely will be needed for execution of the processor. Its pretty well a must for an OoO execution engine as x86 code isn’t exactly the best for orderly execution, so this logic essentially is the intern if you will, all it does is try and stay on top of what the rest of the machine needs. Well it works great on media like data sets, does fairly well on physics; well to be frank it works really well on anything that can be predicted within reason.

Other than dropping copy pastes from white papers, me elaborating more is moot as I am not a IC engineer, or a programmer. But this is the internet so if need be I can pretend to be either or if you like.

Word, Playa.
 

I know what prefetcher is but I asked you how they work on i7 because you seemed to know that?

There are a lot of people just saying thing without knowing how it actually works.
Prefetchers is not that important on i7 because it has very good memory performance, this was much more important on core 2 because the slow fsb. there are assembler commands in x86 for prefetching data and there is automatic prefetching. the cpu can guess depending on how the data is structured.
 


Oh sorry I suppose I misread that my bad.

But to say there is a fundamental difference between the Core 2 and the I7 prefetch logic is small at best, I suppose they may have added additional tables to map the L3 space. But essentially the ondie memory controller allowed for a more aggressive prefetch system, with access to a large mirror pool (l3), lower latency, and higher speeds. Essentially they pulled the traffic lights and upped the speed limit. Don’t need a lot of lanes if the memory subsystem went from inner city highway to autobahn.

But it should be noted my actual knowledge of the I7 inner workings is very limited, I tend to research extensively when I actually own the product. But as I stated there really isn’t too much fundamentally different between the uArchs, just as there isn’t a fundamental difference between AMD’s lineup either.

Word, Playa.

 
This is really ridiculous. This thread has spawned into a debate that began with OP's complete misunderstanding of his quote (like I said earlier). In fact, the quoted message by OP does not even present in both of his links. Here's the link to the passage:
http://www.hfadeel.com/Blog/?p=137

The main point of the quote is to tell designers how to effectively optimize their codes to avoid inherent flaw of the cache design (on both AMD and Intel). The OP failed to address a very important concept here, the critical stride. I'm not entirely sure what the critical stride is (since the author admitted to making the term up himself), but it looks like if data are placed in different memory block, and accessed in a particular order, the data in the previous memory block may be erased to make room for new data. This is not just a symptom on the Nehalem, but more like a flaw in the CPU design in general. In the PDF file linked by the OP, it also describes how to write codes to shift data around in order to avoid cache conflicts.

Ironically, using the critical stride calculation provided by the author, critical stride = total cache size / the number of ways, it appears that AMD's K10.5 will actually have shorter critical stride than Nehalem, which increases the number of critical strides. Therefore, AMD's Shanghai is actually more likely to suffer from cache conflict. This completely contradicts OP's original argument, and this is exactly why I said OP has no idea what he just posted.

But ultimately, the entire point of this message is to tell designers to put the relevant codes together to prevent cache contention, which he described later.

Functions that are used together should be stored together
The code cache works most efficiently if functions that are used near each other are also stored near each other in the code memory. The functions are usually stored in the order in which they appear in the source code. It is therefore a good idea to collect the functions that are used in the most critical part of the code together near each other in the same source file. Keep often used functions separate from seldom used functions, and put seldom used branches such as error handling in the end of a function or in a separate function.

and

Variables that are used together should be stored together
Cache misses are very expensive. A variable can be fetched from the cache in just a few clock cycles, but it can take more than a hundred clock cycles to fetch the variable from RAM memory if it is not in the cache. Some examples of fetch times are given in table above.

The cache works most efficiently if pieces of data that are used together are stored near each other in memory. Variables and objects should preferably be declared in the function in which they are used. Such variables and objects will be stored on the stack, which is very likely to be in the level-1 cache. Avoid global and static variables if possible, and avoid dynamic memory allocation (new and delete)

Lastly, to argue that games have small bits and pieces lying around in memory, therefore making cache contention a possibility is simply absurd. That is the reason why articles like this instructs programmers to optimize their codes to prevent cache miss and cache contention. The performance of Phenom II and Core i7 during gamings also do not support this argument, since we only see Phenom II X3 occassionally pulls ahead of Core i7 (possibly due to raw processor speed, as well as the way game is optimized for multicore), while other X4s are lagging behind.

hl2.gif


Seriously, I don't even know why this thread is still dragging on.
 
u posted hl2 results, i olayed that on a sigle core with all the settings maxed out on a geforce 7300le, why not post fc2 crysis hawx a game that was made in the time of the quad cores
 
Like I said, until we see the newer cards, W7s implemantation of its better usage of MT, and whether any of that, or even more changes anything, then we can guess, cause as it stands now, we simply dont know.
One thing this and other threads do tho, is get my interests picqued, and for that, Im grateful, all fanboyism aside, or preffered components
The answers are there, tho I dont think this is nothing more than a minor contibutor, if anything at all To me, this all very compelling, as its the future, the meshing of cpu/gpu, and how its going to be done, and the differing approaches
 
ep2 is the 1 where the robot dog thing jumps on the thing and makes it crash rite.
if that 1 isnt ep2, then ignor my post, it was played on 800x600, all in game settings maxed, 4xaa 16xaf gcard oc core 500 mem 1000
 


JDJ, to sighQ2 anyone who is not Pro AMD or likes their Intel chips is biasd.

And sighQ2, shove it. I didn't say a damn thing about my Q6600. And if you notice I said a Q6600 or Phenom II is good for a single GPU. keep acting like a jerk.

BTW, I love my Q6600 and there isn't a damn thing you can do about it. Don't like that? Too bad. Its my CPU and thus far kicks ass. Gotta love freedom of speech.



Its randomness. not much more than that.



I was gonna say. I can see how the Phenom II can beat the Core i7 there because the 4870s are normally 700MHz stock while a 4890 is normally 900MHz stock. And even when OCing the stock cooler normally only lets the 4970 hit 800-825MHz and the 4890 normally peaks at 1050-1100MHz (if you are lucky) which is much more important at higher resolutions than the CPU speed in most games.



48 way is more than 16 way. Yes. Very good. problem is that Intels L3 is faster and for some weird reason in most high res game benchmarks is neck in neck with the Phenom II. it could be that L3 is not exactally a performance gurantee in games or if so a Core 2 Quad would not be able to keep up with even the lowest clocked Phenom II..... and yet in a lot of games it can. Huh....



Double ewwwwwwww.................
 


Games is one or two steps behind the most modern hardware. So this advantage (48-way vs 16-way) may be hard to spot now.
L1 and L2 cache in Phenom II is twice as big as L1 and L2 cache on i7. i7 pay a price to have inclusive cache design.

L3 cache on Phenom II is MUCH faster compared to memory on i7 😉
L3 cache on Phenom II and i7 don't differ much in performance
 

I'm sure it will be hard to spot when it matters too. I doubt we'll be debating K10 and Nehalem when games "catch up" to current hardware.
 
L3 cache on Phenom II is much faster than memory, but that's an irrelevant comparison unless you know the cache miss rate on both and if for some reason, i7 misses far more often than PhII. I suspect there isn't much of a difference, honestly.

Oh, and while we're comparing, L3 cache on PhII is a bit slower than L3 cache on i7
Memory on i7 is MUCH faster compared to memory on PhII (even AM3 DDR3).
 

bandwidth is good on i7 but that has very little impact on desktops. Latency is more important and i7 has a small advantage there but they are almost even
 

I don't think so because they have solved the threading problem (synchronizing) with L3 cache to make it into the new era of multicore processors. This type of design will exist for many years
 

All the benchmarks I've seen show a significantly lower latency for i7 compared to PhII. One test I found put the Phenom II at about 44ns (as measured by Everest) and the i7 at around 32ns. That's not a small advantage - that's huge.