juanrga :
A crazy idea
I read somewhat that one of main limitations of BD/PD modules was the single decode per module. If I understood this well, the decode switches between cores within a module which means cannot feed both cores in one clock. For a ten clock period, the situation would look like (A denotes first core working, B denotes second core working and _ denotes a core is not being feed)
A_
_B
A_
_B
A_
_B
A_
_B
A_
_B
Now Steamroller has twice decoders, which means each core can be feed in one clock
AB
AB
AB
AB
AB
AB
AB
AB
AB
AB
Each core in Steamroller module is being twice more efficient than in bulldozer/piledriver module, earning twice more instructions. To get the same number of A and B (10 each) one would use two bulldozer/piledriver modules
A_ A_
_B _B
A_ A_
_B _B
A_ A_
_B _B
A_ A_
_B _B
A_ A_
_B _B
I know this is an oversimplification but could this partially explain why 4C/8C Piledriver are being replaced by 2C/4C Steamroller?
wow!... an actual AMD uarch probably related Steamroller post, in a AMD uarch "expert" Steamroller thread ????? ... naa! i think "juanrga" is derailing the thread lol
elas!.. NO decode is not the main problem...
it can be, depends on the software(compiler)... the problem with decode its the same as "If" a single core per "decode engine", that is if it were there only 1 integer core would be the same problem, i.e., in x86 the "complex" decode pipe blocks the others upon more complex instructions to decode, that is, upon more than 1 MacroOp( 2 microOPs, execute + memory), the 4 decode pipe acts like if it were only 1... the same thing happens with intel uarchs.(edt)
*IF* (not easy due to the "strong dependency model" of x86, which "a priori" sees every instruction dependent on another)... it was possible that upon decoding "complex" instructions the "complex decode pipe" doesn't block the others, the "same 4 decode" pipes would act like more 2 decode engines(or more) than else...
I think that is what happens with
"Steamroller", it has the same 4 decode pipes of BD/PD, only perhaps instructions from one thread don't block instructions from the other thread, that is,
Steamroller must have 2 "complex" decode pipes on those 4, that assume "non-dependent" instructions by checking the "contexts", and relax the dependency checking.
Yes... steamroller most probably will have the same 4 decode pipes of BD/PD... only arranged in a different fashion.
The question of "Vertical Multi-Threading" is not the culprit either, actually it is what makes it clearly superior to any intel uarch,
that VMT works upon 2 "open contexts", that is, its incredible fast "internally changing the thread contexts" ( otherwise if it were OS dependent it would be slower than a Pentium I (quite slow) lol).. and the operation is inherently "asynchronous" (quite difficult).
VMT work like this.. .contexts can change from 1 cycle to another
AA or BB ... of course A_ or B_ or _ _ can happen... (
but there NEVER is AB or BA, that is SMT and no context changing needed) (edt)... but can happen also in STM (simultaneous multithreading = hyperthreading)... or in single cores... caches misses always leads to "bubbles & stalls".
So it is not
VMT the problem, since its not only decode that is VMT... its
fetch, branch, dispatch including the
"FlexFPU frontend". The decode problem can be fixed by relaxing the "dependency" checking and constrains upon decoding complex operations (2 "independent" complex operations, forcingly upon 2 thread contexts) .. MOAR = brute force, always leads to disappointing results.
VMT is so nice... and since a "module" is an optimized 2 cores sharing... that is, another CPU core "jumped inside" another core to make a module, that in the future i see a "module" jumping inside another module (sharing) to make a 4 core/thread module lol
I also see VMT use in a "Horizontal" way... so Horizontal Multi-Threading using the same "asynchronous open contexts" of the VMT( in an horizontal way) and each Cluster ( integer core) have more than 1 thread context, yet not being SMT(simultaneous multi-threading) but an evolution of the CMT (cluster multi-threading) concept.
[ UPDATE: in actually the scheme of of BD/PD is... A ->B -> A ->B -> ... its "interleaving", its one thread one cycle the other thread the other cycle, being the beauty of VTM that you have more fine grain control on the execution, i.e., it can be A->A->A->A->B ->B -> B -> B ... or any granularity between 1 and 4 instructions...
SteamRoller changes this to A-A ->B-B->A-A-> ... that is the "minimal granularity" passes to be " 2 instructions" in any combination up to 4( can be AAAA or BBBB)... it simplifies it, and probably wastes less power on context switches, and has no adverse effect on performance...
So the advantage of VMT over SMT is exactly the possibility of sharing resources having a fine-grain control, upon cache misses or other events, one thread never clogs the other , contrary to SMT/Hyperthreading that had to have resources augmented since Nehalem, since it was often that with SMT/HT turned off, the performance was greater than with SMT/HT on!
On VMT this is way much better (ok it doesn't approach the performance of 2 separated cores... but NOTHING will...)... if a thread just grabs all resources (and even waiting on instructions and or data can grab resources, that is, doing nothing can grab resources lol) then the VMT control can simply switch to another, it leads to much more efficient utilization of resources ]