AMD CPU speculation... and expert conjecture

Page 205 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
2 threads per core means 4T per module, dual module capable of executing 8 threads. sounds nice in theory. i think this is the way intel might go with skylake... and/or skymont atom. since amd is seemingly trying to get the jump on intel, it'd be great if they succeed.
i skipped over the seemingly-fake-dieshots-may-not-be-fake argument, totally uninteresting. i am more interested in how the quad threading per module will work. a mix of cluster multithreading with amd's version of hyperthreading, perhaps? i really don't want to go back and research thread execution again. :(
imo, this approach seems more reasonable than cmt-only approach with bulldozer. if a software uses all 8 threads(wider), the perf will be there, but if it loads 4 cores(less wide, but faster(longer/vertical?)), it'll get strong per-core perf(frequently abused by c.a.l.f. as single core perf) and far less unused hardware. seems more possible to scale down to low tdp (for laptops and ultrathins). but... if it gets as bad powergating and power management as bd and as bad turbo controlling as bd(that took all the was till richland to become barely passable), then jaguar will remain at the top.
 
Jim Keller did say that although I think the JK effect will only start to manifest in steam roller but Excavator will likely be a full on JK design and if there is anyone that knows how to innovate its JK.

On the drivers issue I am looking forward to that, should they make the improvements that the prototype drivers made and some, and this applies to Dual Graphics, we may finally see Dual Graphics become significant considering for around $190 you get 1680x1050 at least but some games scale well at 1080 with settings that are similar to entry level gaming parts.

I decided to buy and read a PCFormat and considering its a international magazine with all doing the exact same thing they finally got to the HD7790 vs GTX650ti Boost debate. While I agree the 650ti Boost was well timed and is about the only thing Nvidia did right in the entire GTX600 family I do reffute the claims made. The gleeming obviousness is that the 650ti Boost and 7790 are not really directly competing despite operating withing a close price proximity. The HD7790 was designed to not only replace Cape Verde's by offering 20-25% more over the 7770 but also reduce power over the Cape Verde part, but with performance closing in on the HD7850 served the opportunity for AMD to EOL the dense Pitcairn Pro design, this follows the Tihiti LE which now served AMD the chance to EOL Cape Verde and Pitcairn silicon which had sold out supply and replace it with better placed products. The 650ti Boost is stripped down unsold GTX660 silicon and operates at more power and runs hotter than the original GTX660, the 650ti boost was more a product to compete with the HD7850/7870GE which dominated the lower price point segments for well over a year, good product but one that is already late and irrelevent now with the GTX700 releases and the like 750 range to be faster within the same price bracket.

While AMD reinvented the wheel with the Bonaire XT, Nvidia's 650ti Boost is a response to AMD's price war and a clear acceptance that Nvidia is sitting on a lot of unsold silicon. The problem Nvidia have now buy this new price war battle was that the aggressive pricing on the 600 cards will eat into the sales of the 700 family, conversely AMD have EOL'd all 6000 parts bar the Turks based cards, and have EOL'd the Pitcairns and Cape Verde's supplimenting them with less dense silicon in the form of 7870XT and 7790 which occupy strong markets while Tahiti silicon in the HD7970 mold are still being mass sold to date. I think AMD is in a better position to now release a new generation graphics series having sold off most of their silicon.

Overall if performance the beefier GTX650ti Boost is the way to go but the HD7790 offers tremendous perf/watt and perf/dollar and some are selling in the lower $120-$130 window, that is around $25-50 less than the 650ti Boost models. I think Sapphire need to address pricing on their Vapor X cards, way to high, think 130-150 is ideal.
 


if AMD ever does do 4 thread per module, which might come with excavator if that ever comes, it will be CMT with 4 cores per module with better shared resources. There is no way they would be able to effectively add SMT to a module at this point without adding a lot of complexity.
 

i myself ignored the idea of 4T per module at first, but the more i think about it, the more i think that it's a better way than cmt alone. may be i am ignorant about how much complexity this kind of design approach will add... the biggest issue may be that mixing cmt and smt may get in the way of the modular design. other than affecting modularity... the next hurdle may be the turbo controller.
may be adding cmt to the modules helps them save transistors reserved for adding extra modules. it could be software-friendly as well. imho, cmt with 4C per module seems awfully large in terms of die size and may not be possible/feasible until 14nm high performance node or at least 18-16nm. meanwhile, octothreaded dual modules save space and focus on integer processing, let fp processing to the igpu - that's why i think making an sr-fx might be difficult.
for the most part, i agree with you because it really seems more difficult to implement than 4C/module. but i think smt/core seems cheaper on a bigger node.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


The Opteron series with 8 PD cores is substituted at the end of 2013 by Berlin CPUs and APUs with 4 SR cores. AMD claims that HSA support is only enabled in the APU, not in the CPU. If they can release SR CPU without HSA, I don't see any reason to wait a FX SR series with enabled HSA via GPU. Moreover, I know no plans to HSA GPUs.



Well, I don't see any move in that direction.

8C PD Opterons CPUs substituted by 4C SR Berlin CPUs and APUs. Why not a 8C SR Berlin CPU?

6C SR Kaveri eliminated from roadmap. Then GDDR5 support eliminated as well.

Then adds the crazy numbering of Centurion line. FX-4xxx, FX-6xxx, and FX-8xxx denote 4/6/8 cores. But now the new FX-9xxx are 8 cores as well. It seems like if Centurion is the end of the FX line. This idea matches with AMD claim we would wait one more FX series for the socket. Everyone (including myself) interpreted this as a coming SR FX-series being socket compatible with PD, but now I start to think that the Centurion FX line is that socket compatible FX-series.

I wait to be completely wrong.
 


Wouldn't work with current design. HT is just exposing another register stack to the OS so that it may schedule onto it, it's cheap to do in the silicone but doesn't add any additional processor resources. The SB uArch has 3 ALU's pre core, rarely will all 3 of those be used so HT allows the spare ALU's to be used to do "other things". BD uArch has 2 ALU's per core, the times when one of those ALU's isn't being used can be rare. We're already having problems feeding 16 ALU's via eight exposed register stacks, trying to use sixteen register stacks won't do much for you, especially with the chunkiness of x86.
 
I would assume that should AMD FX still be on the face of the planet, it would be an octo-core, while 2M/4C/8T system can seem like a good idea, it could be just as hard if not harder than the current design.
 
I would assume that should AMD FX still be on the face of the planet, it would be an octo-core, while 2M/4C/8T system can seem like a good idea, it could be just as hard if not harder than the current design.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


The FlexFPU and the BIU ( interconnect interface and L2 ) are already SMT (2 threads at the same time).

Actually SMT is quite easier, and performs worst in every aspect, compared with CMT (cluster multithreading) (no, it doesn't have to do with single-thread->that is *NOT* high performance).

But no matter what is the choice, the real "crux" is the whole cache hierarchy... you can even put 10 cores, but if the cache system is no good, or the DRAM access is terrible, then say goodbye to performance...

4 thread per module would be nice, but even better would be 2 FlexFPU. That die shot that circulated, and a few times posted here, is a 4 thread module for sure... almost every thing is duplicated including the size of the cores with 4ALU + 4AGU, that now could function like separated clusters inside a cluster M(2xCMT)... that is, since exposed by the developer guides, and thought for PD (didn't happened), the AGUs in the versions 20h up, are able of execution simple ALU operations and some MOVs (register to register)..

So having 4 AGUs crunching a thread it can do a lot of the normal execution not only "address generation"...

One form i see this functioning, and not being SMT, is with a form of *eager execution*, or access ahead or memory ops ahead, and 1 thread at the AGUs "cluster" each time(the same at the EXs mul & div)...

DAE (decoupled access execute) is an academic exercise since long... it can have quite a good advantage for an OoO superscaler, not only ILP (Instruction level parallelism = IPC), but also clock cycle(ghz)... and nothing is more decoupled than BD (bulldozer) uarch (quite pertinently ahead of any intel design), its not only decoupled at control and execute, its several stages, its "vertical multithreading". Different from pure academic exercises, this kind of approach (DAE) can only function well if there is "good data speculation", because if the rational its to break the v Newman paradigma by having separated PE (processing elements) for control/access and execute, and this decoupling serves to have data, the "access quite ahead of execute" , in order to warm up and fill the most better possible the all possible cache hierarchies. Without a good data speculation the whole thing falls into the ground, exactly because of the high latency affecting the "data", leading to "loss of decoupling events" by bubbles & stalls waiting for data to execute, involved in the current paradigm from DRAM to caches to exec pipes(can be hundreds and hundreds of cycles from DRAM to exec pipes).

So trying to maintain that "memory ops" distance (stall and change sub-clusters, with 1 or 2 OS threads per core), that is, memory ops will always try to keep a good distance ahead in the OoO(out of order) scheme, that is, 4 ALUs work one thread 4 AGUS work on another, but they can be dynamically interchangeable, that is, some form of reservation stations holds the thread states, and 2 threads in what ppl recognize as a core/cluster have 2 groups of function units that can work *each* on one thread, but one at a time and interchangeable, and highly OoO... so this is not SMT, and can preserve and augment the advantages of CMT. But with only good caches an good "data speculation" (lol)...

NO NOT validating that die shot (could be fake)... but that die shot seems to have indications of what i just expose above... eager execution and cluster in cluster, you can save a lot of space and power, because if you are going to have 4 "cores", transposing from the probable Steamroller single thread single "core", you also have to have 4 decoders and 4 LS engines (Load/Store), and this last one tend to be quite bigger compared with the others.

So perhaps that expose about Excavator that circulated as a 1th april is not much off base, its not exactly SpMT (speculative multithreading) but a "decoupled control/access - execute" paradigm with an "eager execution" approach... but since it will be extensively "data speculative" its normal threads will spend quite a lot speculating for "data", but it will not involve more threads than the 4 per cluster schedule by the OS.
[ in the end it could have internal thread contexts for "access" and for "execute"(internal DAE), in which case it can be considered "Speculative Multithreading", and have the context switch of this "internal threads" on a "dataflow" paradigma, that is, directed by the data availability not the program count engine, which will continue directing the execution on a broader view (talking about an internal high Out-of-Order approach) ] (edt)

So i think SR will debut "data speculation" for AMD, above the actual STL (store to load) OoO memroy ops schemes... intel already has it since Nehalem... Intel model is akin to a "data renaming" scheme (trace parallel with "instruction renaming"), it has "data speculation", and has been intel advantage (none more exec pipes, decode etc), only the monolithic characteristics of the design, with small L1 and L2 caches is showing its age and can't grow much more, that is why hasfail is hasfail, its has reached close to the max "dataflow" limits, throwing more exec pipes at the problem ( 33% more exec pipes from SNB/IB to HSW gained only 10% ), doesn't help... while AMD approach extensively decoupled and modular has more legs to fine tuned certain parameters of "data speculation"... and 30% more ops per cycle is not a pipe dream...

Then Excavator can extend this with 4 thread per module and "eager execution" (can be data ahead and can be both sides of some branches at the same time )

(a possible POV of how to extend BD)
 

8350rocks

Distinguished


But some of the Vapor-X cards are 2 GB GDDR5 with 1+ GHz clock speed and offer performance on par with a HD 7850. AMD didn't want anyone to Cannibalize the HD 7850 so they could unload the remaining silicon, but Sapphire saw a window and made a HD 7790 "on steroids" with double the normal VRAM and a good 150-200 MHz GPU and memory clockspeed bump. That's why the 2 GB Vapor-X cards are priced in HD 7850 territory. Additionally...those Vapor-X cards likely compete with the GTX 650Ti Boost quite well, though they cost similar money too.

I also think JK has his hands on steamroller, I don't think it will be a full on JK design either. It was likely too far along when he got there to be a full blown design from his desk, though I do think Excavator will be a full on JK design and we will see some impressive things I think. If he and John Gustafson collaborate for Excavator APUs...I wonder if Intel will even be able to get into the ballpark at that point. I could picture a 4 core APU at 14nm XM Hybrid FD-SOI with massive parallelism, HSA, and 768 GCN 2.0 cores.

I think such a product would retail around current FX 8350 money, but offer a 5 GHz CPU with HD 8770-8790ish graphics on board, and likely be something on the order of a 20+% IPC gain over Kaveri.

That thought is impressive...1080p High settings from a stock APU anyone? LOL.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


It would only be true, AMD advantage in graphcis and client market sector (since GPU with compute will be much more important than CPU alone for "client"), *if* they can show an APU with the FlexFPU replaced by an evolved kind of GCN "core" but capable also of 256bit AVX . Yes that means very flexible execution, 16x 32bit pipes per SIMD engine is 512bit, it could at least do a 256bit instruction per cycle per SIMD, permutating with 32bit x16 for graphics per engine.

A *Module* with 4 threads could have 2 of those SIMD engines or FlexFPUs (funny it would be exactly equivalent to 4 128bit FMAC of actual engines lol.. or equivalent to 2 Factual FlexFPUs), with an additional very small L1 block for graphics instructions, accessing the L3 directly for "data", augmenting this with a 128Kb Scratchpad cache per module (local data store, and much denser than any L1), for graphics FP register state holding, graphics FP register spilling, and with preemptive context and exceptions possibilities.

Yes it means even "MOAR" decoupling (lol)... the graphics front-end including texture filtering, interpolation and ROP operations(Z, aliasing), the video (UVD VCE) and its compression will also function at this additional GCN cores... will have a couple of additional GCN CUs .

64sp of additional exec ( 1 GCN CU ) is more than enough to serve each module shader capability, and each of this additional CUs could have an enhanced scaler co-processore(s) with at least 2x the actual power, i.e., 2x the special functions, 2x the branch, 2x the load/store and 2x the texture filtering capabilities, and so able to serve *each* 1 CPU Module with 2 SIMD engines each functioning as the shaders.

And so this "Decoupled Graphics Engine" will do the raster/tessellation and the ROP ops at this "decoupled" GCN cores, passing the exec instructions and data to the *modules* that will function like the "shaders", being the most pertinent obvious advantage that server APU versions with extensive "compute" capabilities will have those same FlexFPU architecture ,but will dispense with the graphics additional engine (no need for video or graphics, but more modules)

2 of those SIMD engines per *module* posing as FlexFPUs, will be equivalent or more than 2 GCN CUs(each have 4 SIMDs), since the module could function at 4 Ghz+ vs 1Ghz+ of a possible GPU, or 4x the frequency of a discrete graphics version.

So a 4 thread module, 4 module APU will have 16 CPU threads (lol), and equivalent of 512sp plus the 4 additional graphics dedicated GCN CU, or 256sp plus, for an equivalent 756sp+ level discrete GPU. 3 modules, will be 12 CPU threads and 576sp+ level GPU. But this is relative i think, this kind of APU has plenty of advantages than the number of GPU equivalent "sp" indicates, just to remember Fermi had 512 sp and battled with a triple as much 1536 sp contender... so above 1024sp equivalent GPU is not out of reach for such 4 module APU)

That is i think one possible path and why the *module* architecture "seems" so right for this... an i think intel will do identical, PhyX cores will be attached to each CPU core as an "internal" co-processor, the actual mainstream iGPs will be augmented accordingly to serve this as additional shaders to.

Puzzling
http://i.imgur.com/HDjJSET.jpg
If this is a fake, that guy made a terrific job of puzzling together a module in the same lines i thought... NO! i didn't do it ... perhaps telepathy LOL ;)
 

jdwii

Splendid
If Amd does 4 threads per module instead of 2 i think i'm jumping board its the most stupidest thing i ever heard, when their performance per clock is 30% less than intel's at least. CMT itself lowers performance per core, and its not possible to run 2 threads at the exact same time on one core and if it is can we say latency issues. JF-AMD said it 1000 times. I'm always wondering why anyone with a desktop would want 16 cores or 12 isn't this who OpenCL(anything else that mocks this) thing suppose to kick off one day anyways?

Its hard enough 2 cores fight for resources already in Amd's design causing latency issues, we don't need 4 of them doing it.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Power is a function of Voltage squared times Frequency. Since higher clock speeds require a higher voltage for stability, the power ends up being much higher. For example take the top Kabini (A6-5200, 25W) and double it ~50W. Compared to an FX-4300 (95W). That is close to twice as much power. I don't think either company wanted a 200W APU.

For PC gaming the 4C chips will run the equivalent 8 threads just fine, as they are running more than twice the clock speed. The disadvantage is they will be using twice the power to do so.

 

jdwii

Splendid


That reminds me of HT. EDIT except HT doesn't use twice the power.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780

(edited for corrections)

Ummm... wrong... CMT lowers performance per 2 threads... while intel urach doesn't do that because the performance gain of 2 threads in one core is restricted do to the very nature of SMT, that is, HT average cannot gain more than 30%, its supposed to work on the bubbles and stalls of the "main" thread.... the design is intentionally restricted.... clearly shown by the 3570K without HT, and all other things equal its performance is very identical in performance to the 3770 with HT, even at mildly threaded code.

AMD wanted to have the equivalent of 2 cores, and while in server is not that bad, in client/desktop it fall clearly short.

And 30% comes from where ? ... hear say ? ... now that is really clever lol ... and for the catrillion time, who cares about single thread ?... the future is multi-threading no matter the mantra, otherwise no one will make chips (NOT even Intel) with more cores than actual, and you can forget about upgrading to gaining performance, ILP = IPC IS DEAD, because even if you have 16 cores you'd only use average 4. So for the chips to have more cores, to be wise to be sensible, a better multithreading effort got somehow to escape a little from the HPC world, or this rants about who is above who, will become the most stupid imaginable, with peremptory statements about incredible ridiculous small 10 or 20% differences, that in the REAL WORLD nobody can feel the slighest diffrence working with their systems.

Also clearly shows you have few clews lol... the performance per core doesn't have anything to do with the "modular" aspect, if it did, than Piledriver couldn't gain 10% ( the same of hasfail) with exactly the same deign but tweaked, neither would be AMD able to *officially* announce a *module* with 30% more ops, and the 2 decoders specifically addressing the drawback of shared resources for the 90% of 2 "independent" cores.

what it shows in becnhmarks is effect of software not hardware and has little to do with the software that ppl use at home/offices... th review enterprise is a marketing enterprise disguised as an "entertainment" clicks for money business (like going to the circus or to a movie), there isn't NOTHING peremptory ABSOLUTE about its measurements, only some *FEW* clews about improvements and stronger and weaker points... just do a blind test and be surprise in choosing an AMD system that supposedly is slower, thinking its intel lol... software can make a lousy chip and break a good one, software has MORE than 1 order of magnitude, more performance to gain by tweaking than the hardware.

30% ? ... its possible that in some tests Intel gains even more than that, but average REAL WORLD current loads is not even half of that... DEPENDS ON THE *SOFTWARE*... i sound like trying to correct a broken routine of "hear say" lol ... but just for variety here you have a test where a FX8350 is >600% (compare with 30%) ahead of a i7 3960x .... what do you think of that ?? only a larger cache and better multithreading, or is SOFTWARE the principal culprit ?? ( IF the second so is at AT and other sites, if fake so is fake for everybody)

JUST APPRECIATE
http://www.phoronix.com/scan.php?page=article&item=llvm_clang33_3way&num=4
 

jdwii

Splendid
^ again you're not going by single core performance you're going by 2 threads. Intel's design no longer lowers single threaded performance and it takes no extra die area to use HT(lower cost,lower TDP). Amd is also easily without a doubt 30% slower per clock for one their CPU's are clocked 20+% higher and their still slower per core, people can be in denial all they want but at the end of the day its the truth.

You act like CMT doesn't force the cores to fight for resources. Also when are people going to learn everything can't be parallel? Never i believe. Its said that some people find themselves making excuses saying things like benchmarks don't matter and calling it fake i hope Amd and Intel are smarter than that because if they're not we're not going to see anything great in years.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Agreed. That 1 module 4 thread stuff just isn't going to happen for a consumer CPU. It would be the complete opposite direction of what PD to SR is. Making bigger modules will just make the power issue worse as the cores can't be run at different clock speeds.

Even Intel's newest Atom cores are abandoning hyper-threading. It was deemed too inefficient.
 

jdwii

Splendid



Its an easy 15-30% gain however without the cost of die area but like i said i hope they don't do this. Some people are however in denial they're hoping Amd makes a 64 core processor with the performance pre clock of an atom.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780



NOT everything... just clearly better, it doesn't even have to be HPC standard (HPC means High Performance Computing) ... so it already started, you don't saw nothing, you didn't appreciate nothing, actually in Linux AMD is clearly ahead of intel(software counts)(ed)... and you make "peremptory arguments" about ridiculous small differences, as if "performance matters", *yet* have no clew or want to consider what High Performance world do and how and what it means. :??:

In the end you are free to *believe* (its a believe) anything you want... but i can't understand you, what kind of performance are you talking about ? (edt)



Oh! yes i can even agree with that... 4 thread module means 16 or 24 threads for the client/desktop.... even APUs could have more than 12 at the 20nm class fab processes... it will not happen anytime soon because its a waste, attending the typical workload ppl have in the client/desktop windows world(software mandates)(edt)

OTOH i don't understand why ppl are content in discussing ridiculous differences... why ppl doesn't want REAL High Performance, as if in a state of resignation, instead of demanding better products for the money :??:

in this order who will gain more is ARM... the silent intruder to the throne ...the A57 is already on top of performance of Jaguar and ahead on Atom ( its ISA is prone to light and fast software)(edt).... the polarization of discussions intel vs AMD is starting to get quite abhorrent, a flush of passions with total lack of vision and reason.

 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


4.2Ghz to 3.9Ghz is 7% difference not 20%

And CMT doesn't have to "oblige" to fight for resources, If AMD wanted it could have everything doubled , double Fetch, double Branch, double Decode, double Cores, double FPU... yet maintain the same EXACT " modular cluster topology" with "vertical multi-threading" and what not.... matter of fact they already have SMT like intel for some functional resources. Its a question of balance and tradeoffs(can't design any chip with infinite resources like in simulations), some things go well shared some don't, but in the end CMT as a concept and particularly in AMD implementation is quite flexible, and better than any SMT implementation (not only of intel).

avoid talking about what you have few/no knowledge.

 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


HT does cost some die area but it's only like 10%. When Intel was redesigning Atom from in-order execution to out-of-order, they deemed it too inefficient. They decided to just double the number of cores instead of doing HT. AMD took a similar route with Jaguar.
 
I played around and I have managed to fix up Dual Graphics performance, in fact it was easier than waiting for official Catalyst drivers. Catalyst CAP 13.5 has a lot of presets, selected BF3 and no more microstutter and playable 1080P at low/med settings but it is very nice at 1680x1050 with med/high settings. I can do a review on this soon as for me its one aspect that is worth gold and spanks the bejeezus out of Intel's Iris.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Where have you seen A57 benchmarks?

All I've seen is press releases and theoretical performance numbers. Nothing real world yet.
 


AMD is expecting the A57 to beat jaguar in efficiency and performance per core. At least in some microserver workloards. It could be they are comparing a 2GHZ A57 to their current 1.4ghz jaguar cores. Either way, A57 should come close to jaguar and silvermont.
 
Status
Not open for further replies.