AMD CPU speculation... and expert conjecture

Page 221 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

GOM3RPLY3R

Honorable
Mar 16, 2013
658
0
11,010


I'm sorry, but I just have to say, something is definitely manually forged by the Company of Heroes benchmarks, or the information isn't correctly given. I see people with max settings, 1080p and 1200p with 680s and i5-3570ks STOCK, that are pulling more than that. There's no possible way that a game can have that perfect frame rate from a boost of 2 Ghz. Check your results and try again buddy.

EDIT: On the contrary, I completely agree that dual core gaming is realistically dead. My friend was using an Athlon Dual Core @ 3.0 and on 18WoS: ALH, max settings he was getting ~30 FPS with a 6750. Then he put it in a computer with a Quad Core i3-530 @2.9 Ghz, and he was instantly getting ~80-100 frames.

Also with ArmA 2, he went from getting 30 frames on Low (No AA, ASF, or PostProcess), to ~60 Frames on High (No AA, but ASF @ Normal, and PostProcess @ Low)
 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860
those aren't my results ... buddy.

aside from that, read the s/a article again.

As you can plainly see the performance gap varies from marginal to massive, but in all cases real-world game-play produces higher frame rates than the built-in benchmark.

semiaccurate.com/2013/07/15/company-of-heroes-2-relic-performance/
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I will try to simplify what I tried to say: with some HSA enabled software Kaveri at stock clocks will be faster than an OC 3930k
 

GOM3RPLY3R

Honorable
Mar 16, 2013
658
0
11,010


Hmmmm. I see. Faster at what?
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


At task accelerated with OpenCL.
Why compare APU+HSA with CPU which doesn't even have iGPU?
 


The true beauty of the 6800K is not in how much or how little it synthetically games more over the 5800K but the technical victory it has over the 5800K. We recieved test ES 6800's which were taken back and to run a Trinity 5800k at 4.1ghz you needed a hefty bump in vCore to keep it stable, normally around 1.425v, Richland does the same stable downvolting to 1.368v, The 5800K at 5ghz needs around 1.545v to run stable and Richland needs around 1.465v to maintain stability. All that corrolates to heat and efficiency. The next big thing is while it may not look like the iGPU is much of a gain it is still night and day ahead of anything thrown at it more impressively is its ability to beat a HD6670 in certain tesselated synthetics and gaming environments.

That all said at its asking price its hard to buy a APU at the cost of a 8core FX processor.

 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


It does lol

*Store to load optimization... i hope its to overcame their constraints to better allow data speculation

* Dispatch 2 stores and 8 INT issue dispatch means there is 2 dispatch engines, one per core, 2 stores is very difficult to do with one dispatch engine only

* track dependency of stack operations it means there is a first an effort to avoid dependency checking constraints, means it could be used for "decode" to avoid blocking decode on complex operation(as i said earlier)... perhaps "alias" and "address prediction for memory operations" is not far behind, meaning, memory operations are going above and beyond Out-of-Order, they are going with speculation like branches.

* Last but not leas, since Jaguar already uses a fetch loop buffer for instructions, which in a vertical multithreading scheme is even more beneficial, i wonder if they will implement something like a "decoded" stream loop detection buffer(intel has it)... that is, the Jaguar has a loop buffer before decode, it would be nice if SR implements one after decode... this could really alleviate decode pressure.

So it mens for sure there is 2 dispatch engines... the decode implementation remains elusive. To me its the same 4 decode pipes, only arranged like 2, total shared SMT style for simpler decode, 1 dedicated complex engine per thread.

And other improvements allover the board... nice...
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


There is 3 basic ways of Multithreading

2 threads in the same engine

Interleaving Multi-Threading; requires good control and very fast "context switches"... first it was used by the Cray vector processors i think... it means the pipeline starts with one thread, let say A, and then next clock cycle it changes to thread B. Usually it means the all pipeline is synchronous, the context tracking is only at the beginning, after is pretty much SMT.

This is A->B->A-B->A... but in Cray example its only at fetch, after fetch, pretty much FIFO logic dominates... first in first served.

AMD is different because it separates the pipeline in different "thread control" domains.... so its not exactly A->B->A->B etc...different "domains down the pipeline can have different forms of control.. its asynchronous... and this enters;

Block Multi-Threading; its the possibility of changing the granularity of a 2 thread engine as example. Instead of being 1 instruction from one thread, and next cycle one instruction from the other thread, it can be 2 or 3 or more from one thread, then jumps to the other thread for another block of instructions from the other thread (can be 2 3 4 or 5 ), then jumps back to the first thread again for another "BLOCK" of instructions from this one, and on and on.. understand ?

This permits much finer control, usually permits to change threads on an "event", meaning if a priority is established it can change threads, if one thread becomes a resource hog it can be blocked from grabbing more resources... usually this form of multithreading with "blocks" of instructions is also called "switch on event multi-threading"(is used by SPARC).

Again AMD is different because it uses several different domains of control, each of which can have different "events" associated... permitting so a better control over the behavior of several threads down the same pipeline

Simultaneous Multi-Threading(SMT) aka Hyperthreading;
its the simpler and more widely used form of multithreading down a same pipeline, derives from wide issue RISC uarches, it feels better with wider and and fatter full of resources designs, it has "loose" control, pretty much FIFO logic, which is pretty simple... only Intel uses it in a CISC arch (like x86)... which is not the wisest, since its complexity, with usually instructions faster and slower than others, if one one instruction stalls and if first-in than it must be first out... but if it stalls , pretty much all pipeline work stalls (Nehalem and P4 earlier even worst, suffered heavily on this)...

The AA->BB->AA->BB, means the "granularity" of the "block of instructions on AMD scheme is 2 instructions form the same thread, and then jumps to the other for more 2, and on and on. But this is only the "basic" granularity most likely at "fetch", because stated and shown in all diagrams not all pipeline is VTM, and different domains can have different "events" of control, and so change granularity wisely to avoid bubbles ( the A_ or B_ ).

So AMD AFAIK, is the most advanced form of multithreading down a single pipeline, it dominates so well the "art" of very fast "context switch" that it even implements several domains of control down a pipeline... which is called "vertical Multi-threading".

And the beauty of "fast context switches" is that it can be used in a horizontal way, that is to share a "execution core" Integer or FPU or even SIMD... ,as well or better than to share a fetch or a decode engine, meaning the "module" implementation of AMD is to grow in number on threads with time, it was designed with that intention(strong suspicion). It may not be fake that "die shot" (image) that was presented here, to me it shows clearly a 4 thread module.... and if the "fast context switch" on that "open way" that is used in VTM, is used and enlarged for Horizontal( several threads on advanced ways be it "speculative multithreading" (spMT) or "dataflow approaches" ... bot of which are the things that can complement better HTM - Hardware Transactional Memory, and intelligent versioning caches).

*IF* ALL is as good as it seems don't be surprised to see in the future you'll see 8 threads per module, meaning comparing with BD/PD which has 8 threads/cores (2 per module), the new chip (that could even be smaller on smaller on smaller FAB processes) could have 32 "good" threads on a single piece of silicon.... and 5 modules is not to discard, so 40 threads in a single piece of silicon below the the 300mm² LOL... ( a MCM could have 80)

THIS IS REAL **HIGH PERFORMANCE**... no intent to be polemic or pick on intel fanboys... in actual intel designs i can't see one that can compete with this. Of course it needs software "compiled" to take advantage of those so many threads, but i think that is where the HSA compiler toolchain, which is the principal work of the HSA foundation do far, can enter!...

Intel simply can't compete with the actual designs (they need to invent something much better fast), this is even better to make a many core with real CPU cores and "control"( which MIC doesn't have, isn't a CPU), and HUGE FlexFPUs, meaning large vector AVX style crunching much better than Intel PhyX MIC (perhaps that is why they "stole" Gustafson from Intel lol)...

Of course none of this shows in the "popular new football game" of computer aficionados, ( rigged to the bone like if the federation league headquarters would be on Old Trafford Manchester lol), called benchmark review, NEW SOFTWARE is needed, BETTER SOFTWARE is needed... nothing that scares the HPC (**HIGH Performance Computing**) community.... but boy!, the single thread approach of the current x86 client/desktop PC world is so awkwardly stupidly OBSOLETE, that there is no point to pick a decent arguing... oh! well!...
 
Nope. It is not well-threaded because above benchmarks show how a i5 (4 threads) gives more FPS than six core i7 (12 threads). In a well-threaded game the 12 threads i7 would be faster than any other chip there including the (8-threads) FX-8350.

NO IT WOULDN'T. That's the point you're missing.

Look, I really can't simplify it any more then this (and I'm going to oversimplify, assuming no OS overhead, zero scheduling time, and perfectly parallel threads):

You have two CPU's, one is a dual core chip, and one that is a single core chip that is twice as fast. You have two parallel threads you want to run, which take up exactly 100% load on one core of the dual core chip (50% of the core on the single core chip).

In this situation, both CPU's would perform EXACTLY the same. Why? Because both would finish their work in the same exact amount of time, despite the presence of two totally independent parallel threads being run on the dual core CPU. In this case, the increased IPC+Clock of the single core CPU is good enough to let it finish in the same amount of time.

Take this situation instead: You have a dual core chip, and a single core chip thats one third faster. You give both chips the same two threads, each which take up 100% of a core on the dual core chip, and 66% of the core of the single core chip.

Guess what? In this case, the dual core chip will be about 33% faster, because the single core chip is bottlenecked (~134% CPU load).

My point is, same two parallel threads, the only difference is the relative strength of the single core chip in relation to the dual core chip. If the chip with fewer cores is fast enough to get its work done and avoid a CPU bottleneck, then it will outperform any other chip with lower IPC+Clock, regardless of how many cores/threads are in use.

Oh, look! The i3, which is 100 MHz faster than the i5-3470 suffers a loss of 24 FPS in the title you are citing as an example that more cores equals the same performance as less cores. I think you're being a bit obtuse, don't you?

i3 is simply overloaded. Can happen with two heavy threads, four heavy threads, 100+ light threads, etc. The number of threads itself isn't the bottleneck, but the amount of work they do.

Additionally, by your same logic, the P2X2 560 @ 3.3 GHz should be running right along side the P2X6 1100T @ 3.3 GHz, yet it's clearly 26 FPS lower.

Same situation.

So, now that we have a baseline...and have established that the engine's multithreading capability vastly determines the effect of more cores on greater performance...go ahead and tell me how it is that software cannot be more multithreaded in better ways and take advantage of newer hardware's capability when it's programmed more effectively.

I already did. If the CPU is getting its work done before the GPU finishes the current frame, then performance is dominated by how fast that work gets done. By extension, if a dual core chip gets that work done faster then a 16 core chip, the dual core would have the greater performance.

In Crysis, it looks like most quads are sufficient to get the work done before the GPU renders its current frame, so individual core performance determines overall performance.

...though the more cored solutions are running nearly twice the FPS.

Before 4 cores, which appears to be the point where the CPU can get enough work done. After that, core performance seems to be the driver of performance. Easy way to confirm this would be to normallize clock speed to 3.0 GHz and benchmark, in which case each generation should perform identically.

2.5 ghz 4770 is faster than a 3.2 ghz 3930k so the theory of clock speed fails.

Uhhh...chart has the 4770k @ 3.5, and the 3960k @ 3.2. Hence the extra 3 FPS.

sb-e has a higher latency than the 1155 cpus. explains its place below IB.

this game warrants some memory speed testing to see how it reacts.

well threaded or just latency dependent?

~3ns of extra latency isn't going to have much of an effect when 1FPS is generated every ~16.7ms. System memory bottlenecks in games are downright rare, especially on systems with more then 4GB of RAM installed. Now, over the PCI-E bus, you'd get much worse latencies, but for a single GPU, that limitation won't be coming into effect.


We're essentially back to 2009, back when the faster clocked E8600 outperformed the slower Q9650. Only with the discussion being between Quads and Octo's instead of Duos and Quads. Once the CPU bottleneck is overcome, per-core performance dominates benchmarks.
 

8350rocks

Distinguished

So, now that we've moved the goalposts...

You still haven't explained how it is that the 8350 with 8 cores exceeds the 3470/3570 performance in crysis 3, even though Intel has "dramatically superior" IPC.

CPU_03.png


qSNrpeA.png


Based on your logic, a quad core with superior IPC should be ahead, and in games that are more "single core IPC" driven the 3570k comes out ahead. But not here...why? Because even a quad core on Crysis 3 is not enough to overcome being bottlenecked. We notice this in the FPS for the FX 4300 vs. FX 6300 vs. FX 8350. While the FX 4300 can run the game decently...it doesn't have the resources to do it as well as the FX 6300 or any of the 8 cores. By your logic, the faster clocked 4300 should be ahead of the 6300 with a lower clock speed and close to the 8350 which is clocked only slightly higher. Which means that your assumption that any quad core is enough to escape a bottleneck is also short of reality. While the HTT Intel CPUs perform better...interestingly enough, the 3930k in Crysis 3 outperforms them by a pretty good margin.

The 3930k seems to run about 85 FPS on Crysis 3 in benchmarks not called Welcome to the Jungle (65 FPS in Welcome to the Jungle is pretty much max for any of the top end Intel CPUs), which means that realistically, 6 cores are enough to escape a CPU bottleneck in Crysis 3, because otherwise, it would perform similarly to the 3770k in the mid 60's across the board. The 3770k doesn't really have a huge advantage over the 3570k because while HTT is an interesting parlor trick, when you actually need more cores, it comes up short...

http://www.overclock.net/t/1364211/pclab-more-crysis-3-cpu-benchmarks/210

In there is a 3930k @ 4.6 GHz running max FPS of 120 in Crysis 3 at over 80% utilization of the CPU.

So it seems to me that your theory is still not holding water. Perhaps for your poorly coded RTS engine it might, but for Crysis 3, it will eat as many cores as you can throw at it...at least 8 for sure, as the FX 8350 has 75% utilization average across 8 cores based on previously posted information. Additionally, the 2600k Intel that was shown before was well over 80% CPU utilization, while the 3930k showed all 6 cores loaded around 50% plus 2 HTT modules loaded to near max as well.

For Crysis 3, more cores = more performance. You can slightly get around this via IPC, but you still do not escape a bottleneck until at least 6 cores, arguably 8.

 


Likely IPC improvements then; same effect as raising the clock. Game is clearly GPU limited @ ~41 FPS though; would be interesting to see how the 4770k @ 3.5 would do with a more powerful GPU (EG: would the lead grow?).

The real question becomes why a 4770k @ 2.5 is GPU limited, and a 3930k @ 3.2 isn't. Don't think the Haswell IPC boost is enough to account for those numbers over IB. Maybe something architecturally the game really likes?
 
((EDIT: Spent 5 minutes editing the post, and lost everything. I hate the new forums more and more the more I'm here...))

You still haven't explained how it is that the 8350 with 8 cores exceeds the 3470/3570 performance in crysis 3, even though Intel has "dramatically superior" IPC.

Clockspeed.

Single Core Performance = clockspeed * IPC. Improving one is as good as improving the other. If one architecture has half the clock but double the IPC, they would perform the same.

Hence why the 3470 beats the 8150, but not the 8350. 15% IPC gains and 400 MHz worth of clock can easily account for the extra 6 FPS the 8350 gains over the 8150. Also explains why the 3470 beats the 8150, but not the 8350. Based on your "more cores" theory, both the 8150 and 8350 should be beating the 3470.

The simple solution would be to normalize clocks for the purposes of looking at scaling. Normalize all CPU clocks to, say, 3.5GHz, so IPC and Cores should be the only two things affecting the benchmark results. IPC favors Intel, Cores favor AMD. If you see all the Intel chips grouped at the top, then my theory holds.

Any takers?
 


And this is again why the FX9590 is even better irrespective of price due to the higher clocks. What will be interesting is a official review and to see where these can OC to, should they get close to 6ghz on air/closed loop they can be rather impressive enthusiast part for clockers and gamers.

 

8350rocks

Distinguished


Well, the issue I see with the logic is simple:

If IPC were the only culprit and clockspeed a direct correlation, then why does a 6 core SB-E with lower IPC than IB hold pace and exceed the higher IPC and higher clocked IB architecture?

I am not saying IPC plays no role, it clearly will in some way for anything...however, the point I am making is, it's far from the only variable at work here. More cores have a correlation too, you're just not acknowledging that.

EDIT: Let's stop comparing separate manufacturer architectures for a moment...because it's like comparing apples and oranges and saying they both make juice, but one takes more fruit to make the same amount of juice.

The comparison is naturally flawed because we have no baseline to bridge between the 2 separate uarch's.

So, in Intel architecture, clockspeed makes a correlation, but Intel does not exceed max CPU burden comfortably until you get to a 6 core CPU. Even 4 core CPUs with HTT do not break below 80+% usage average per core. The only Intel CPUs that do are 6 core w/HTT, and even then, they typically have 2-3 HTT modules running and average 50% core usage across the board (counting physical core resources not HTT modules)

In AMD architecture, The FX 6300 is about the point where you start to see CPU utilization fall below a 90% aggregate average for core usage under load. This correlates with the Intel data because the i7 Quads with HTT always have 4 cores loaded highly and 2-3 HTT modules running near their max capability. The FX 8350 and possibly 8320 (hard to say definitively without data) are likely the only 2 AMD CPUs that do not function at 80+% load on all cores running this software.

Based on this information, I can definitively say, you need 6 cores or more from either manufacturer to get below a 90-95% average resource consumption per core under load. To be at a point where the CPU actually has headroom (<80% core resources consumed average) then you need to look at an 8 core AMD or a 6 core Intel.

Clockspeed effects frame rates in an expected non-linear scaling manner, though the biggest FPS jump seen comes when going from 2 to 4 cores (24 FPS) and then from 4 to 6 cores (20 FPS) inside Intel architecture. In AMD, the jump from 4 to 6 to 8 cores scales less considerably, but are still significant jumps in FPS over something that could be achieved via clockspeed.

This tells me 2 things, the engine is sensitive to IPC/clockspeed, but more sensitive to having enough resources to run all the threads concurrently as those jumps were always more significant then jumps in clockspeed (even taking OC'ed CPU info from other forums into consideration, a 20% jump in clockspeed is less significant than the jump from 2 to 4 or 4 to 6 cores for Intel).

 

Krnt

Distinguished
Dec 31, 2009
173
0
18,760
Nice discussion here, keep it up, is pretty entretaining :p

[strike]Now people you should take into account that most of those CPUs have Turboxxxx capabilities, and since that table had some very unexpected results for me, I think the workloads between threads aren't very balanced or its very affected by latencies.
[/strike]
As, 8350rocks said, there are many variables here affecting the result.
Here is a link for those who love some ancient overclocked CPUs:
http://www.dsogaming.com/pc-performance-analyses/company-of-heroes-2-pc-performance-analysis/


 

truegenius

Distinguished
BANNED


it must be because of driver
recently i updated my hd6770's drivers and found that fps in codmw2 and 3 were between 40-45fps regardless of resolution and cpu core count or clock speed
and gpu remained above 80% for all settings (800x600 to 1920x1200)

and after i reinstalled the windows and used the drivers that came in driver cd i found that fps were normal and now i can get 60fps @1080p



Single Core Performance = clockspeed * IPC. Improving one is as good as improving the other. If one architecture has half the clock but double the IPC, they would perform the same.
i will say that everything matters but there are degree to how much they matters
i think thqt all these matters in this sequence (decreasing, hardware level only)
ipc
clock speed
core count
threads count (per core/module)

not to forget that ipc is instruction level parellelism (am i right ? )
so while thread level parallelism deends on software efficiency to use more cores, ipc doesn't depends on that

(dunno what i want to say)
 

Krnt

Distinguished
Dec 31, 2009
173
0
18,760

No, its clearly GPU limited, read the original article for more details, but seems that the game likes low cache and memory latencies.

not to forget that ipc is instruction level parellelism (am i right ? )
so while thread level parallelism deends on software efficiency to use more cores, ipc doesn't depends on that

Whaaat?!

Well Cache and memory latencies and bandwith are some of the things that have kept me away from Bulldozer and Piledriver, and I wish that Steamroller could fix that, but I try not to keep my hopes up.
 

Mitrovah

Honorable
Feb 15, 2013
144
0
10,680
will the new FM2+ socket affect compatibility for cpu coolers? Im not expert but could the difference of one pin force all cpu coolers to be installed with new adapters. conceivably an FM2 compatible cooler would work with the new FM2+

If CPu coolers won't be affected why arent motherboard makers coming out with the FM2+ socket now? they work with the old AMD APUs,
 

8350rocks

Distinguished


Anything that works on FM2 (i.e. any common cooler for CPUs) will work on FM2+, the only thing changing is 1 or 2 additional pins in the PGA for the APU.
 

Mitrovah

Honorable
Feb 15, 2013
144
0
10,680


If that is the case, why haven't motherboard makers started announcing fm2+ sockets. I have heard of only one manufacturer announce the new socket and I'm guessing it won't be out until late 2013 or 2014.
I'm on the verge of building a new system and ive just about chosen all the parts have been picked.
this seems like the persistent asking of a spoiled 3 year old, , but im curious why the new socket isn't being expedited if it isnt much of a change
 

8350rocks

Distinguished
No need for the new boards until the new APUs hit...I can assure you by Q4 this year, you will see the new boards and APUs. Asus already has a FM2+ A87 chipset MB prototype they displayed at E3
 

Mitrovah

Honorable
Feb 15, 2013
144
0
10,680


well, i just recently heard kaveri might be delayed. aside from that i would rather see fm2+ motherboards come out ASAP, so i could build a richland system sooner and upgrade later without buying a new motherboard, im still on the fence whether i want to wait for kaveri or get a richland. But oh well' i can't move heaven and earth so that's that.
But still it seems a little silly to delay the socket if it won't affect everything else concerned except future upgradability
 

GOM3RPLY3R

Honorable
Mar 16, 2013
658
0
11,010


After reading the first word, I almost wanted to rage, then I read the rest. I usually will comment that this is a bad idea, but now that the difference in IPC and Core amount is in limelight, I must say, you are a genius. The only thing that I may add is this. Obviously with almost every CPU, the Intel will dominate the AMD chip on a equivalent basis (i.e: i5-3570k vs FX-8350, where at the same speed, the Intel will always win). Noticing a percentage and/or numerical difference between say those two chips @ 3.0, 3.5 and 4.0, would make sense. Even with other CPUs, it would work.

The only thing however, is to state on a scale (maybe even a 0-100 scale rating), on which how one performs to the other. This would be used as reference in-case any unique results query.

^_^
 
Status
Not open for further replies.