AMD CPU speculation... and expert conjecture

Page 119 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860
Smt is simultaneous multithreading, belongs to intel's ht.
CMT is cluster multithreading, AMD's term for the modular cpu.
cmp is core multiprocessor. This started with athlon x2 and core 2 duo.
 

lilcinw

Distinguished
Jan 25, 2011
833
0
19,010
@8350

I think I understand where you are coming up with your '44 threads' theory (even though I am fairly certain that there is no real world case where that level of efficiency is achieved and 'engineered maximums' are only relevant in a programmers nirvana where The Code is handed down from on high by an ascended Linus Torvalds and the unicorns poop rainbows and pee energy drinks).

What I don't understand is a statement you made a while back that SR is supposed to include additional integer resources. The way I understand the current PD arch is that each integer cluster has 2 ALUs and 2 AGUs which is duplicated in each module and has a shared floating point unit consisting of 2 128-bit FMAC units and 2 MMX units. AMD has stated that they have removed one of the MMX units from the SR FPU which will yield 4[2(2xAGU + 2xALU) + 2FMAC +1MMX] = 44 'pipelines' whereas the current PD has 48 'pipelines' (4 additional MMX).

Wouldn't that mean that SR is being nerfed instead of buffed?
 

jdwii

Splendid


Well lets not go that far.

Again going with gamers Programming knowledge i'm gonna have to side with him. Also 44IPC or what ever isn't going to improve much of anything since most programs don't even have that kind of ILP. Also what do you guys think TMT is i Google it and i get "TMT provides tools and services to rapidly and easily develop software that is fully parallelized and scales to perform to industry standards on “ManyCore CPU” architectures"
 

truegenius

Distinguished
BANNED
^ TMT is ™Truegenius
i.e, TradeMarkTruegenius :whistle:

@8350rock
do you meam that
option 1) we can execute more than 1 thread on 1 core concurently i.e, all 44 threads will be under execution at every instance of time instead of waiting state

or you mean
option 2) all 44 threads will be loaded in memory but will undergo execution in 1 thread at any instance of time i.e, schedular is doing the job to make us feel that all threads are under execution

for example
lets take a single core of any cpu without ht/smt
and if we run 1 thread of 7zip and 1 thread of winrar
do they will compete for cpu time or will they get full cpu time as cpu can run multiple threads ?

if latter is your answer then why we get performance hit of 50% on both application when running both application at same time.
 

lilcinw

Distinguished
Jan 25, 2011
833
0
19,010
For clarity's sake here is the 'nerf' I am referring to:

Bulldozer's 10 'pipes' per module:
bulldozer-die-2.jpg


Piledriver's 12 with added(?) MMX units:
5-shared-floating-point.jpg


Steamroller's 11:
AMD-HOT-CHIPS-Keynote-FINAL_PRESS_Page_14.png


Wikipedia mentions the MMX units in the original Bulldozer architecture so maybe they just weren't exciting enough to make the marketing materials at the time (it was all about AVX/FMA IIRC).

Regardless it is reduced from 48 'pipelines' per chip to 44.
 

mayankleoboy1

Distinguished
Aug 11, 2010
2,497
0
19,810


Just point me to the source/sources which says Ubuntu has poor support of HTT, and linux programs have poorer/no support of HTT, unless paid by Intel to do so, and you have to specifically edit the boot settings to use HTT.


And i apologize for the bad language. My provocation was unneeded and uncalled for.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Bulldozer had MMX as well, but it's rather out dated now that SSE is being used more heavily.

Here's a more detailed Bulldozer slide.
block-diagram.png

 

griptwister

Distinguished
Oct 7, 2012
1,437
0
19,460


I noticed that too. But I highly doubt the OEM Models will allow OCing. It'll probably have some sort of restriction if it does.

Also, it's nice to see everyone playing nicely in the sandbox full of Einsteins we call Tom's Hardware. :D
 

it is possible to isolate apu power consumption. you need a multimeter to measure voltage and current readings. hwmonitor reads from the apus'/cpus' internal sensors, before other parts' power use get into the mix. even if you don't trust them, you should be able to trust amd at least - their setting tdp @100 for 5800k and 65w for 5700 means that in load scenarios those apus will use that amount of power respectively, which, i've already shown. the smaller the enclosure gets, heat dissipation per unit volume increases. and it's easier to dissipate 65w of heat with stock cooler - keeps noise lower.
at does recommend 5800k for sff, but you should see in which kind of scenarios. their conditions for using 5800k is that where the 5800k is not loaded, they don't mention anything on gaming use except performance. they recommend pc parts but they don't actually build the thing and test its temps. i am discussing load scenario. if you don't load the cpu or apu even a hypothetical amd fx8300(95w) with a hypothetical mini itx motherboard(e.g. 880g chipset) can run inside a sff pc. heck, if you don't put any load, a pentium or a core i3 will do. in reality, more people use amd's e-350/450 apus or sb/ivb pentiums and core i3 inside their sff pcs - but that's a different discussion.

 
8350, I believe your confusing x86 threads and the actual micro-ops that get processed internally. In x86 CPU land it's 1 register stack = 1 thread, period end of story.

Now the problem is that not all instructions are equal, some take longer then others or have more complex dependencies. By implementing redundant CPU resources we can effectively process multiple instructions per thread at once though only one thread gets ownership of that CPU context (register stack) at any one point in time. When your looking at the BD/PD CPU design your seeing the internal resources that process micro-ops not x86 machine code, so your "44" instructions are micro-op instructions not x86 binary. Very large difference between the two. Also CPU's have whats known as a register file, it's a location inside internal CPU memory that allows for multiple register stacks to exist. It's used for rapid context switch's to process instructions from an additional thread while the previous thread is stalled waiting for some external I/O event. All this put together gives the illusion of the a single CPU element processing multiple threads.

Basically go read up on and learn what super-scalar architecture is.
 


The 5600K, 5700 and 5800K are essentially the same CPU with different settings. If you were to set the 5800K at the same settings as the 5700 then they would both use identical power draw, same with the 5600K. The reason they list a 100W TDP is because those two CPUs will both attempt to self-overclock (turbo) and that's the limit their designed around. If there was a price difference between the 5700 and 5800K then I'd recommend the 5700 but as it stands their the same price ($129 USD). So really just buy the 5800K and use it's multiplier setting to clock it at a lower multiplier if you need less TDP. With how amazing the 5800K is, I'm really wanting to see what else AMD can build for the SFF world.
 

i agree. the main reason 5700 isn't recommended more often is because of the price similarity to 5800k. amd's skus that have lower tdp but sorta similar settings (same amount of shaders, in this case) seem to have higher price. that's why i said 'palatable' instead of 'must have'.
then again, if you're gonna downclock, what's the point of getting an unlocked apu (disregarding the price for now)? i know how it sounds... but i hear thing like this on a regular basis - 'if you're gonna buy fx/k, why shouldn't you oc?' etc. with the 5800k, you're already hitting 100w on load in an sff build. so oc isn't an option until better cooling is introduced - which often costs more, considering the conditions. that's... until you consider that both 5700 and 5800k cost [strike]same[/strike] similar.
edit: newegg says there's $1 difference, so... inb4 3rd party microcorrection! :pt1cable:
 
Well the cost between the two is so small that you might as well get the 5800K, then you can chose which direction to clock it at. If heat / power is a problem you can clock it down 200~300 Mhz, or if you got the capability then clock it up.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I already explained this to him and give him power consumption deltas under load. If he does not OC, then the 5800k will have a slight power consumption becoming from the slightly higher clocks and turbos, the rest of the chip being the same.

Regarding future AMD products, what about Kabini SOCs already here?
 

8350rocks

Distinguished
So, after a lot of reading, CMT is basically a slightly reworked TMT. What they did was allow for each module to split an additional thread between the 2 cores in the module. So, essentially in TMT each core operates on 1 core at time, but uses context switches to multithread, in CMT, each module does the same, but the 2 cores in the module can pickup an extra thread (1) between the 2 of them.

@Palladin, I am familiar with superscalars.

So, I have found the hard numbers after looking through loads of bulldozer information...there are 3 register stacks per core for BD/PD(this was not changed in PD)...so that's a total maximum of 24 threads that can be changed between context switches on cores, and because of CMT, you gain 1 additional per module (still unclear on how exactly this works, I could not find a technical schematic or logic flowchart for CMT)...for a total of 28.

Now, for SR, they have not said how many register stacks are per core, and they have said they are increasing register file size, efficiency, and decreasing the memory size of a single thread in the register file. So, without any hard schematics, or information it's hard to say what this will end up looking like in terms of performance increase.
 
^^ i am sorry, but i am really confused here. it's bad enough that i have limited know-how about cpu architecture but bulldozer's modules processing 3 threads was the last straw. i need some understandable explanations. from what i know, bd modules can process 2x integer instructions(operations?) and 1x 256bit(or 2x 128bit) floating point instructions at the same time. aren't (software) threads assigned by the os scheduler? i thought register stacks only process instructions or micro instructions. what constitutes a 'hardware thread' if that's a real thing?
 

blackkstar

Honorable
Sep 30, 2012
468
0
10,780


I think he is talking about context switching.

SPARC does this as well. On SPARC, there are 4 windows (I believe) and then 32 registers in a window. So, what happens is program A uses up the window of 32 registers and then program B uses another window with 32 registers.

So now, two programs can use the CPU's registers without having to go to cache or system memory or anything and the CPU can switch between the two programs without having to access cache or higher level memory.

I am guessing x86 CPUs do something similar so that they don't have to go to cache and stuff every single time it changes threads.
 

8350rocks

Distinguished


Basically this.

Each core can work on as many as 3 single threads per clock cycle...but it will only work on them 1 at a time. The core will use context switching to rotate among the 3 threads per cycle. The interesting part is, how it accomodates an extra thread per module...I will have to find more technical sources on CMT to determine the hardware side of what's going on there...
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780
I think you are confusing work queues with multithreading.

Those are special registers that contain the address of the beginning of each thread, and that is how much its those so many threads at *the same time*.

In SPARC *block-multithreading* model is a different story because the TLB (translation lookaside buffer) can be warmed up, and also the L1 that may contain pre-decoded pre-fetched instructions are also warmed up, of those so many threads(current versions are 8 thread per core), but the pipeline only executes one thread at a time, dictated by its internal PC (process/instruction counter) logic not by the OS (operating system) scheduler. Basically it takes advantages of the so many pipeline bubbles that always exist in common code to perform internal fast context switches... ... in the end *the pipeline* only executes one thread at a time, but it can be quite efficient because even there was only 1 thread and only one to execute, those are always full of bubbles from some clock cycles to others, and if you have 2 or more threads on 1 core targeting those bubbles, the original thread can execute almost exactly in the same time frame as if it where alone, yet gives "the illusion" that is executing more than one thread at a time.

SMT (simultaneous multi-threading, a.k.a HTT in intel lingo) and AMD CMT (cluster multi-threading) are evolutions of this logic in that 2 threads are really executed at the same time by the pipeline( it has advantages and some drawbacks), being the difference between the 2 that CMT has dedicated hardware almost in a logic of co-processores(clusters) (actually the FlexFPU *IS* a co-processor, meaning it can track the evolution of a thread semi-independently). So 3 threads at the same time per module ( DON'T KNOW) could be possible but only by a fraction of a split second upon an OS dictated context switch, and this while the FPU just finishes to execute/writeback a few instructions left from a previous thread. In practice and in reason it should be said each module only executes 2 threads at each time because the pipelines must be flushed(any register/cache must be write back to memory to provide consistency) upon those OS context switches.

IMHO AMD CMT is clearly superior, not much because of having quite additional dedicated resources for 2 threads ( the integer cores of so much polemic), which could/should always boost 2 threads at the same time... but more because AMD uses multithreading logic for all its pipeline, i.e. its divided in thread <domains> in the sense explained above about block-multithreading... its *Vertical Multithreading*...

In BD and PD the logic is one cycle per thread on each domain of the pipeline, that is, one domain deals with one thread instruction on one cycle, then the next cycle it switches for the other thread, and on and on (very similar to the interleaving multithreading exercises of Cray). In Steamroller the logic is changed to 2 cycles per thread making it a true vertical block-multithreading scheme.

In this context, there aren't exactly 2 decoders on Steamroller... has above is as if, but is an illusion. As revealed in a RWT thread(rumor or not don't know) SR will have the same 4 decode pipes, naturally considerably beefed up but 4, and *the way i see the difference* is that it will have 2 dedicated decode domain input buffers, and 2 dedicated output buffers, that share the 4 decode pipes in a SMT (simultaneous multithreading- that is, decode from 2 threads simultaneously as the FlexFPU and the L2 but on a vertical logic) plus a block-multithreading fashion. It can be tremendously more effective for those same 4 pipes, courtesy of the vertical multithreading scheme.

This is just to show how versatile and superior this "vertical multithreading" is... as if each group of pipeline stages in a *domain bordered by input output buffers*, where in themselves like "vertical co-processores" ( well a bit exaggerated lol)... and how much more easy it will be to replace or improve those <domains> without having to re-design a whole chip, as in the traditional/synchronous pipelines of Intel cores. AMD BD uarch is asynchronous/semi-synchronous pipeline, could much more easily augment efficiency by making parts run-ahead... could much more easily change the resources/characteristics of each domain... could much more easily make each module with 3 or 4 Integer thread cores/cluster or a number of *heterogeneous cores/clusters*... could even not hard put ARM Arch64 integer cores where now are x86 cores lol...

Yes many opinions put BD uarch as a failure (propaganda is rampant is all competitive lucrative businesses), but IMHO is not a failure, its *POTENTIAL* clearly superior to anything Intel has... this asynchronousness and modularity *potential* is very very difficult to came by right(no wonder it took as rumor 8 years to finish first iteration), but it could provide an accelerated path for improvements in successive iterations.
 
Status
Not open for further replies.