Phenom vs. Athlon Core Scaling Compared

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.


MORE cache is not always the answer. Having more cache will benefit many benchmarks. It is also possible that more cache could be detrimental to multi-tasking.
 


Well, it's like a guy doin a fat girl, or a hot girl doin a fat guy. Every once in a while you gotta take one for the team. But then again, considering I'm upgrading from an X2 4200+ I'm gonna get a performance boost whether I oc or not.
 


Thats one way to put it. On another thread ages ago I said that for anyone with a 4600x2 or better shouldn't bother with Phenom yet, which I still stand by. Which means you fall into the catagory below that where you should see a decent performance gain. I'm on skt 939 with a 4800x2 so I would need new mobo, ram and cpu so if I was going to take the plunge and upgrade Phenom just wouldn't make sense for me.
 


More cache is NEVER detrimental to anything. It's always a good thing.
 


It depends on the microcode. If the cache is doing read-aheads... and a LOT of task-switching is going on... IF the read-ahead has to complete before task switching can be done... then it COULD BE detrimental to a heavily multi-tasked system to have a larger cache.

(And it would be in a multi-tasking situation where this would show up.)

Now if the cache is coded to allow the task switching without having to wait for any read-ahead to complete... then it would be a moot point.

Of course I'm not a hardware guy... I'm basing this on software caching in databases. I wouldn't be surprised if it was different. However I wouldn't be surprised if it was the same and there is a detrimental effect of a larger cache on task switching.

Unless someone can definitively convince me otherwise... I have to go with what I know to be true: Larger caches are NOT always better. Usually... but NOT always.

BTW: I looked it up. Intel calls the read-ahead cache "prefetchers". And it appears that their "memory disambiguation" would determine how much time might be wasted during heavy task switching.
 
Perhaps there were additional line items added to those "severe" NDA's that AMD had people sign?

Maybe that's why we're not seeing the types of reviews that are necessary to determine what is castrating the Starr.

I'm with most members in that it seems to be relegated to a starved architecture that needs higher bins to realize it's potential. Let's up the HT and add some more L3 dammit.
 




Maybe Intel's prefetching is just that much more superior to AMD's solution?
 
I suppose it's an incomplete thought for people that can't read my mind 😉


Keith was referring to how cache size can help and hurt situations depending on how well a processor "prefetch" works.

It seems common place that Intel has a much superior cache system to that of AMD. I could very well be wrong but most of my education is through fellow form members and articles I frequent.

So my presumption is that majority of AMD's problems with barcelona is that it is designed to scale with multiple cores as another poster explained.

If that is true, then majority of it's architecture would seem to be comparatively weak on the desktop scene due to it's poor cache design. But I would suppose this is where Intel derives most of it's performance lead over AMD...their cache is designed better for random execution vs. AMD's offering.

Therefore...intel's prefetching design doesn't hurt them as much when moving to larger cache sizes while it could actually hurt the starr. In fact, their prefetching design might even help them when moving to larger cache sizes.
 
If I sound like an idiot I could certainly use some constructive criticism.

I'm trying to learn here so maybe you can pin point where I'm having issues with understanding these architectures.

I'm quite certain I've used many terms out of context.
 




Is it me or does this just seem like common sense? The K8 was designed for desktop computing while the Barcelona was an "evolution" of sorts designed for scaling well with multiple cores. The author alludes to this indirectly but he doesn't state it. In the end he just makes AMD's new barcy sound like horse shiat compared to the K8...but then reminds us that Phenom has more cores so it probably fares better over all.

That just sounds kind of bland and simple. I expect a bit more critical thinking from these guys. I'm not a bright kid and I'm just now trying to understand core architectures.

Please tell me if I'm wrong but it just seems as if his deduction is lacking quite a bit. I have a harder time trying to understand why he compared two different concepts?

I thought the Starr was developed with future programming in mind...not yesterday's single threaded. Why doesn't the author point out these differences? The K8 and Starr obviously have different goals in mind. State the differences/purposes and then give us the translation into today's software environment. Right?
 
sorry i just thought you went sideways a bit without showing how you got there. As a note, I would just like to say that more cache might not help in all cases, but in a desktop environment (windows & G A M E S, ...lol sorry thats all im interested in really, waiting another second for word to load on an amd machine vs. an intel machine doesnt bother me, just fps). Benchmarks might be helped by more cache more than real world performance might if phenom was tacked out with say 4mb L3 cache, but these same benchmarks I believe have quite alot of bearing on general windows and gaming performance. As a server chip? well i cant comment, although it does seem that the phenom is very scalable and does very well in a multi multi multi chip environment. Fix the tlb and all looks rosey for the phenom in the server field. I will say, and I would like someone elses thoughts on this, to sign off that the phenom seems far more adept at more complicated tasks than straightforward number crunching. That is, in say media encoding and file compression its left behind by the lowest intel quad core, but come to more complex dynamic tasks with lots of threads and it claws back alot of that lost ground. Hence the good 3dmark06 cpu test and supreme commander scores. For those that dont know, I seem to remember reading that the 3dmark06 cpu test doesnt use the cpu to render anything, rather to run the a.i. and physics of the little floating robot battle thing going on in that red valley. Which makes something like 7 threads in the first test and 5 in the second, hence the 2nd runs faster. (the 2nd test is simplified somewhat i mean).

cheers
 
It seems to me that the Barcy core excels at multi-tasking with simple and routine calculations. That seems to relate to the server environment.

The only place available for an increase in performance would be through better prefetching. But even then, it might match Intel and then it would be a core v. core fight where it would lose.

I guess what I'm trying to conclude, where the author didn't, is that he should specifically state that the architecture of the starr was designged for a different purpose than that of it's K8 counterpart. State the engineering design concepts and then tell us how that relates to our needs as a consumer right now in today's world. He gave us the latter...but didn't really didn't give us any forethought as to design execution.

Instead he just said that in today's environment the Starr sucks, but just like the 90nm P4 it could be the precursor to something great.

I'm sorry...I'm just tired of poor reviews from THG and this one probably isn't as bad as I'm making it out to be.
 
I really think the issue is simply not the size of the cache ... it's the overall level of sophistication of the cache / prefetch system ... where core 2 is far superior at present ... or the argument boils down to the IMC / L3 speed issue ... or a combination.

Look at the shocking latencies on Phenom once L3 cache is accessed ... mainly due to the 1.8Ghz (drop from core) frequency of the IMC / L3 system.

Even the 2.0 Ghz for the ES benchies floating around earlier showed more promise.

If AMD can increase the IMC / L3 Cache speed then single socket performance will improve markedly.

I think it is set low because they cannot bin fast enough parts.

So lowering this means more parts rescued from the last bin ... the bin that doesn't get sold.

I see the C stepping parts should address this deficiency ... it's similar to the increase in FSB speed we have seen in the past with other cpu's.

Perhaps AMD's original specs had the L3 cache and IMC running at (or close to the core speed) and the early samples did blow cloverfield away??

Then the cruel reality of an immature 65nm process resulted in a rushed part of much lower spec.

I guess we can't test much of this unless someone can access some samples and increase the IMC speed without it locking up.

The IMC is in the centre of the die too ...
 
Now that actually seems to apply a bit more critical thinking than the author of the article handed over to the reader.

I can understand the premise of your argument and even go along with it as it makes much more sense. Maybe we're all stating the same thing differently but the bottom line is that the Starr arch doesn't seem to be sync'ed correctly.

Thanks for clearing that up Reynod.
 


I think your pretty close with that. Like I said who cares about losing 50mhz on your ram speed because of an odd multiplier. I doubt AMD ever originally planned to cripple there own chips.

I also have another theory. Which actually might be more realistic. IMC was always meant to have its own clock generator. It would be odd to have added it afterwards. It probably seemed like a great idea to seperate IMC. As well as changing the core clocks they could have adjusted the IMC for a huge range of products. The low-high end being 1.8-3.4ghz ish. Come manufacturing they just couldn't get the speeds they wanted so it got slower and slower. Which means there is a problem with the design of the IMC clock gen. They have probably now discovered that it would have been easier to implement the IMC at core speed. We already know the IMC can function at high speeds with the A64's. I think they were trying to be too clever for there own good.
 
"I think they were trying to be too clever for there own good."

Agreed, a little too much too soon?

thought the r600 smacked of that as well. I liked it end the end though, have a 2900pro myself.
 
Engineering wise R600 was radically different and innovative. "I think they were trying to be too clever for there own good." and shot themselves in the foot with it just like with Phenom. I always liked R600. I find it funny how so many people say the 2900 series is crap yet they rave about the 3800 series being so much better.

Spoonboy don't know if you saw it, I did reply to you earlier in the post.
 


Assuming the memory dividers work the same way for AM2+ as they are for AM2 and assuming you're stuck with the DDR2 800 dividers (as having 1066 dividers would make it oh so easy) you'll need the following settings to get 1066 memory running at full speed without your CPU exploding.

CPU Base Frequency = 266/267 (266 if your board is always slightly on the high side of selected speed, 267 if on the low side)
CPU Multiplier = 12
HT multiplier = 4x should work ... if you're very lucky, 5x

This gives a CPU speed of 3192-3204 MHz, with the RAM at 1064-1068.

You'll need a decent CPU cooler (half decent would probably do) and you'll need to be lucky with the motherboard.
 
Oh, and for clarification here is how AMD's memory dividers can be worked out (If anyone actually wants to know)

M = Memory Divider
C = CPU Divider

DDR2 533: M=(C/2)+4 (Rounded UP to nearest whole number)

DDR2 667: M=(C/2)+2 (Rounded UP to nearest whole number)

DDR2 800: M=(C/2) (Rounded UP to nearest whole number)

DDR2 1066: M=(C/2)-2 (Rounded UP to nearest whole number)

I worked these out a couple weeks ago when I was without an interweb connection... It's quite easy to make a table in excel which will show you final CPU + Memory speeds for different multi's & base frequencies.

When inputting the formula for working out the memory divider in excel, make sure you use the =roundup(((C/2)+2),0) the 0 after the comma denotes # of decimal places to do it to.
 

This difference in speed is key here. Let's face it - the author claims to have identified the scaling limits of Phenom. He hasn't because he has failed to take this factor into account. All the current benchmarks prove is that some scalability exists even when you hold the IMC speed constant. Given that limitation I think the scalability shown in the benchmarks is pretty d*mn good.

Question for the original author: would it be possible to re-run the benchmarks in the following speed ranges: 1.2, 1.4, 1.6, 1.8 and 2.0 GHz and adjust the speed of the IMC to match the processor speed at each step? If so, this would give us a true sense of the *system's* scalability.
 
You guys are absolutely right about the IMC frequency, but remember that it also has limitations - the faster IMC runs (stable!!!), greater the chance that you might saturate the memory bandwidth. If you come to this situation, the bottleneck becomes hardware design...

For me, the author obviously messed up entirely.

For Phenom, I think that proper software design (aware of the characteristics of NUMA) and corrected cache latency (B3) will help more than plp are saying. Larger caches from a die shrink (45nm?) or re-design will certainly have less influence - no matter the cache size if the core logic gives high latency cache access or has a faulty prefetch algorithm (which induces elevated cache misses).

I think that with proper software design, Phenom B3 will give some 20% increase in performance (at least, to the code I am working on).