esrever :
de5_Roy :
2 threads per core means 4T per module, dual module capable of executing 8 threads. sounds nice in theory. i think this is the way intel might go with skylake... and/or skymont atom. since amd is seemingly trying to get the jump on intel, it'd be great if they succeed.
i skipped over the seemingly-fake-dieshots-may-not-be-fake argument, totally uninteresting. i am more interested in how the quad threading per module will work. a mix of cluster multithreading with amd's version of hyperthreading, perhaps? i really don't want to go back and research thread execution again.
![Frown :( :(](data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7)
imo, this approach seems more reasonable than cmt-only approach with bulldozer. if a software uses all 8 threads(wider), the perf will be there, but if it loads 4 cores(less wide, but faster(longer/vertical?)), it'll get strong per-core perf(frequently abused by c.a.l.f. as single core perf) and far less unused hardware. seems more possible to scale down to low tdp (for laptops and ultrathins). but... if it gets as bad powergating and power management as bd and as bad turbo controlling as bd(that took all the was till richland to become barely passable), then jaguar will remain at the top.
if AMD ever does do 4 thread per module, which might come with excavator if that ever comes, it will be CMT with 4 cores per module with better shared resources. There is no way they would be able to effectively add SMT to a module at this point without adding a lot of complexity.
The FlexFPU and the BIU ( interconnect interface and L2 ) are already SMT (2 threads at the same time).
Actually SMT is quite easier, and performs worst in every aspect, compared with CMT (cluster multithreading) (no, it doesn't have to do with single-thread->that is *NOT* high performance).
But no matter what is the choice, the real "crux" is the whole cache hierarchy... you can even put 10 cores, but if the cache system is no good, or the DRAM access is terrible, then say goodbye to performance...
4 thread per module would be nice, but even better would be 2 FlexFPU. That
die shot that circulated, and a few times posted here, is a 4 thread module for sure... almost every thing is duplicated including the size of the cores with 4ALU + 4AGU, that now could function like separated clusters inside a cluster M(2xCMT)... that is, since exposed by the developer guides, and thought for PD (didn't happened), the AGUs in the versions 20h up, are able of execution simple ALU operations and some MOVs (register to register)..
So having 4 AGUs crunching a thread it can do a lot of the normal execution not only "address generation"...
One form i see this functioning, and not being SMT, is with a form of *eager execution*, or access ahead or memory ops ahead, and 1 thread at the AGUs "cluster" each time(the same at the EXs mul & div)...
DAE (decoupled access execute) is an academic exercise since long... it can have quite a good advantage for an OoO superscaler, not only ILP (Instruction level parallelism = IPC), but also clock cycle(ghz)... and nothing is more decoupled than BD (bulldozer) uarch (quite pertinently ahead of any intel design), its not only decoupled at control and execute, its several stages, its "vertical multithreading". Different from pure academic exercises, this kind of approach (DAE) can only function well if there is
"good data speculation", because if the rational its to break the v Newman paradigma by having separated PE (processing elements) for control/access and execute, and this decoupling serves to have data, the "access quite ahead of execute" , in order to warm up and fill the most better possible the all possible cache hierarchies. Without a good data speculation the whole thing falls into the ground, exactly because of the high latency affecting the "data", leading to "loss of decoupling events" by bubbles & stalls waiting for data to execute, involved in the current paradigm from DRAM to caches to exec pipes(can be hundreds and hundreds of cycles from DRAM to exec pipes).
So trying to maintain that "memory ops" distance (stall and change sub-clusters, with 1 or 2 OS threads per core), that is, memory ops will always try to keep a good distance ahead in the OoO(out of order) scheme, that is, 4 ALUs work one thread 4 AGUS work on another, but they can be dynamically interchangeable, that is, some form of reservation stations holds the thread states, and 2 threads in what ppl recognize as a core/cluster have
2 groups of function units that can work *each* on one thread, but one at a time and interchangeable, and highly OoO... so this is not SMT, and can preserve and augment the advantages of CMT. But with only good caches an good "data speculation" (lol)...
NO NOT validating that die shot (could be fake)... but that die shot seems to have indications of what i just expose above... eager execution and
cluster in cluster, you can save a lot of space and power, because if you are going to have 4 "cores", transposing from the probable Steamroller single thread single "core", you also have to have 4 decoders and 4 LS engines (Load/Store), and this last one tend to be quite bigger compared with the others.
So perhaps that expose about Excavator that circulated as a 1th april is not much off base, its not exactly SpMT (speculative multithreading) but a "decoupled control/access - execute" paradigm with an "eager execution" approach... but since it will be extensively "data speculative" its normal threads will spend quite a lot speculating for "data", but it will not involve more threads than the 4 per cluster schedule by the OS.
[ in the end it could have internal thread contexts for "access" and for "execute"(internal DAE), in which case it can be considered "Speculative Multithreading", and have the context switch of this "internal threads" on a "dataflow" paradigma, that is, directed by the data availability not the program count engine, which will continue directing the execution on a broader view (talking about an internal high Out-of-Order approach) ] (edt)
So i think SR will debut "data speculation" for AMD, above the actual STL (store to load) OoO memroy ops schemes... intel already has it since Nehalem... Intel model is akin to a "data renaming" scheme (trace parallel with "instruction renaming"), it has "data speculation", and has been intel advantage (none more exec pipes, decode etc), only the monolithic characteristics of the design, with small L1 and L2 caches is showing its age and can't grow much more, that is why hasfail is hasfail, its has reached close to the max "dataflow" limits, throwing more exec pipes at the problem ( 33% more exec pipes from SNB/IB to HSW gained only 10% ), doesn't help... while AMD approach extensively decoupled and modular has more legs to fine tuned certain parameters of "data speculation"... and 30% more ops per cycle is not a pipe dream...
Then Excavator can extend this with 4 thread per module and "eager execution" (can be data ahead and can be both sides of some branches at the same time )
(a possible POV of how to extend BD)