Phenom II and i7 <=> cache design for gaming

kassler

Distinguished
May 25, 2008
257
0
18,780
Phenom II's L3 cache is: 48-way set-associative
i7's L3 cache is: 16-way set-associative
C2Q no L3 cache

http://www.agner.org/optimize/optimizing_cpp.pdf
http://www.agner.org/optimize/#manuals

8.2 Cache organization
It is useful to know how a cache is organized if you are making programs that have big data
structures with non-sequential access and you want to prevent cache contention. You may
skip this section if you are satisfied with more heuristic guidelines.

Most caches are organized into lines and sets. Let me explain this with an example. My
example is a cache of 8 kb size with a line size of 64 bytes. Each line covers 64 consecutive
bytes of memory. One kilobyte is 1024 bytes, so we can calculate that the number of lines is
8*1024/64 = 128. These lines are organized as 32 sets × 4 ways. This means that a
particular memory address cannot be loaded into an arbitrary cache line. Only one of the 32
sets can be used, but any of the 4 lines in the set can be used. We can calculate which set
of cache lines to use for a particular memory address by the formula: (set) = (memory
address) / (line size) % (number of sets). Here, / means integer division with truncation, and %
means modulo. For example, if we want to read from memory address a = 10000, then we
have (set) = (10000 / 64) % 32 = 28. This means that a must be read into one of the four
cache lines in set number 28. The calculation becomes easier if we use hexadecimal
numbers because all the numbers are powers of 2. Using hexadecimal numbers, we have a
= 0x2710 and (set) = (0x2710 / 0x40) % 0x20 = 0x1C. Reading or writing a variable from
address 0x2710 will cause the cache to load the entire 64 or 0x40 bytes from address
0x2700 to 0x273F into one of the four cache lines from set 0x1C. If the program afterwards
reads or writes to any other address in this range then the value is already in the cache so
we don't have to wait for another memory access.

Assume that a program reads from address 0x2710 and later reads from addresses
0x2F00, 0x3700, 0x3F00 and 0x4700. These addresses all belong to set number 0x1C.
There are only four cache lines in each set. If the cache always chooses the least recently
used cache line then the line that covered the address range from 0x2700 to 0x273F will be
evicted when we read from 0x4700. Reading again from address 0x2710 will cause a cache
miss. But if the program had read from different addresses with different set values then the
line containing the address range from 0x2700 to 0x273F would still be in the cache. The
problem only occurs because the addresses are spaced a multiple of 0x800 apart. I will call
this distance the critical stride. Variables whose distance in memory is a multiple of the
critical stride will contend for the same cache lines. The critical stride can be calculated as
(critical stride) = (number of sets) × (line size) = (total cache size) / (number of ways).

If a program contains many variables and objects that are scattered around in memory then
there is a risk that several variables happen to be spaced by a multiple of the critical stride
and cause contentions in the data cache. The same can happen in the code cache if there
are many functions scattered around in program memory. If several functions that are used
in the same part of the program happen to be spaced by a multiple of the critical stride then
this can cause contentions in the code cache. The subsequent sections describe various
ways to avoid these problems.

When you play a game and there is a lot of action.
Physics, AI, a lot of verecies drawing the picture and more
Much data and many functions are used to calculate the picture.

Data and functions are scattered in this scenario.

You need a cache that doesn't evict data or code and goes to memory
 
I think some here want's to know why phenom sometimes is better on high resolutions.

Intel is fast when data and code isn't that complicated (applications doesn't use a lot of functions and data). On low resolutions intel gains a lot of fps because it is fast where games are performing simpler actions.
If you increase resolution, then the graphiccard will brake the cpu an intel can't perform that well in simpler areas because it has to wait on the gpu

when the game is performing more complicated actions, then the gpu may need to wait on the cpu. and here is where phenom is good because it doesn't evict data as soon as i7 from cache. i7 needs to go to memory sooner compared to phenom.
 
You think the Phenom is better on high res?

Let's test that theory:

Crysis:
image014.png

Looks pretty much like a tie to me...

Far Cry 2:
image016.png

Well, it's not a tie. Of course, the win isn't going the way you predicted either...

Left 4 Dead:
image018.png

Dead heat at 2560x1600, the i7 has a slight lead at 1920x1200 (classic sign of GPU bottleneck, especially since the i7 has a huge lead at both resolutions when AA is off)

Call of Duty World at War:
image020.png

Once again, the i7 has the lead at high resolutions



Hmm...
Seems you're wrong. The i7 leads at all times except when the game is well and truly GPU bottlenecked, in which case it's pretty much a tie (as expected). Add more GPU power and the i7 should pull out ahead again (which is shown in many reviews of high end CF/SLI setups).
 


Did you understand the text about Cache organization ?
 


Thanks for your post showing that the i7 needs to run at 3.8Ghz against a 3.64Ghz Phenom II to match the speeds.

This contradicts what is commonly believed.
 
Crysis: complicated game, don't need that much cpu power but probably access a lot of memory because graphics is so detailed. In this game you could probably start to see that phenom performs better on the most heavy sections. The scenario would be that on very low resolutions, i7 is faster. increasing resolution will narrow the gap and when you start to get close to 100% gpu bottleneck phenom passes i7, increasing the resolution even more and the score is evened out because the game is 100% bottlenecked by the gpu.

Far Cry 2: If I remember right this game doesn't seem to be that scattered. Both i7 and C2Q seem to run this game well. Don't know how much cpu power the game needs on high resolutions and how much memory that is used in complicated situations or if it can handle a lot of enemies with advanced AI and physics.

Left 4 dead: game unknown for me.

Call of duty: Not that heavy game for the computer to run. if I am right it is almost single threaded?

In today's game you can run almost all games on a dual core processor and they fit within 2 GB ram I think. They all have one main render thread because DX 9 and 10 don't work well otherwise.

This will change with DX11 and games will probably add much more physics and AI, DX11 is also able to take advantage of more cores for rendering.
Maybe you will see quad CPU's running at 80-90 % on high resolutions calculating huge amounts of data to render the picture. Finding data in the cache will be more and more important in order to gain speed. This is why Phenom II has a better cache design for advanced games.

For applications that may render or do some other long task the situation is different. Even if the task is rather complicated there is probably not that much code involved. Also data can be optimized for maximum performance (often data should be processed in long trains in order to get maximum speed). The cache doesn't need to be that advanced for this type of scenario.
 



Oh Assler you are soo full of life and... :heink:

Word, Playa.
 

could it be that you ar disipointed when someone explains how the cpu works and you will understand that phenom is a good game cpu? if you have spent a lot of money on intel
 
i7 is the better choice for gamers, what upsets most of us amd guys is how that ammount is blown out of proportionits like what an average 5-10% increase in fps is negligible at best, i also hate when amd guys try to go to any means necessary to support their team. including makng up benches favoring our team, amd is a very good company and well i cant say intel is a good company but their products are good and the i7 is better live with it, and im not talking about clock for clock, 920 vs 965 at stock
 



Left 4 Dead is a heavily multithreaded game. It can and will use 80% of all 4 cores on a quad core machine. Source (the engine it is based off of) is very CPU dependant.

Crysis sucks either way. CryEngine is just horribly optimized.

FC2 is a decent game (better than Crysis but still meh) and does benefit from more cores.

CoD 4+ does benefit from more cores as well but not as much as say FC2 or L4D.

As for physics, you must remember that most games use Havok, which I prefer, and not PhysX. Intel owns Havok. Which will mean they can easily optimize it for Intel CPUs.

just something to think about.



Graphics, thats a no brainer. But the problem is that if you buy a crappy CPU (Pentium DC/Athlon X2 3800+) and pair it with a high end GPU then the CPU will severly limit the GPUs performance since at the low end spectrum of the FPS, the game relys on the CPU.

What you need is to sort of balance it. If you do a high end CF/SLI setup then a Core i7 will benefit you. If you do a single GPU, say a HD4800 or G200+ then even a C2Q Q6600 or decent Phenom II (I don't include Phenom I due to horrible clocks and performance) will be fine. But add more cores and a faster CPU that can push more raw data will be beneficial.

And as for having to access memory, it wont really matter since both Core i7 and Phenom II have super fast IMCs that can access memory almost as fast as cache. So it wont kill performance like with a older FSB setup.
 


You really don't get that the benchmarks that are identical are GPU bottlenecked, do you?
1892.gif


They perform equally because they are at a point where CPU performance is completely irrelevant to system performance (as long as it is fast enough to saturate the GPU(s)). Note that in no case does the Phenom noticeably outperform the i7, which is what would be expected if the system were still CPU bottlenecked but the Phenom were somehow better. In all cases where the CPU has any significant impact on performance, the i7 doesn't just win, it absolutely flattens the Phenom in every way.

Now, in most cases, there isn't enough GPU power for the i7 to truly distance itself, and for most GPU setups, the Phenom performs just fine. It's a great choice for the price, and is certainly not a bad CPU. However, if absolute maximum performance is your goal, and you can afford the extra cash, there is no situation in which a Phenom II will overtake an equivalently clocked i7. The i7 has a faster memory controller, faster single threaded computation, faster multithreaded computation, larger cache, and is superior in every way.

Don't take this as an unconditional ad for an i7 - I have recommended Phenom IIs to friends before, and I will continue to praise their value and their ability to bring AMD back into at least some level of competition with Intel. They really are excellent CPUs. However, they cannot claim the absolute performance crown, no matter what the AMD fanboys would like to think.
 

Do you understand that it varies during gameplay?
There are a lot of factors that decide if the burden is mostly on the CPU or the GPU for each frame.
 

Agreed. And all of the benchmarks show that when the burden is on the CPU, the i7 pulls ahead of the phenom II, without fail.
 

Can you explain why phenom wins in some games just before the game is 100% bottlenecked by the gpu?
Did you understand the text about cache design? if you did understand that text, can't you draw any conclusions from that?
 
Conclusion for this thread:

OP posted an article that he didn't fully understand, and simply wave around the "article" as if it supports his biased opinion.
 

Which benchmarks are you talking about? All the ones I see either show a total GPU bottleneck (a dead heat), or the i7 pulling ahead. None seem to show the Phenom II pulling ahead (and don't point to 1 or 2 percent, since that's well within normal variation).