AMD CPU speculation... and expert conjecture

Page 118 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

mayankleoboy1

Distinguished
Aug 11, 2010
2,497
0
19,810
@ 8350 rocks :
Your CPU is capable of multi-tasking. This means that the maximum number of threads a CPU can run at once are engineered maximums before the pipelines fill up. There is an acceptable limit to the amount of clocktime the engineers originally designed to allow to be split up under heavy load.

What does this mean ?



The process you have described is correct. It is something like a "priority list based round robin" scheduling.
But you are awfully confused somewhere.
 

8350rocks

Distinguished
http://en.wikipedia.org/wiki/Multithreading_(computer_architecture)

http://en.wikipedia.org/wiki/MMX_(instruction_set)

http://en.wikipedia.org/wiki/Thread_(computer_science)

http://en.wikipedia.org/wiki/Multiprocessing

http://en.wikipedia.org/wiki/Temporal_multithreading

http://en.wikipedia.org/wiki/Simultaneous_multithreading

Yes, I agree, what you're discussing is SMT, what AMD is using is TMT. While not the same, they're not dissimilar.

Multithreading != Multiprocessing is also true.

But SMT is not the only form of Multithreading.

See, AMD CPUs have far more physical resources than intel CPUs so they use context switches and division of clocktime to execute multiple threads, betting on the fact that most programs are not so heavily threaded. Even if they are, it can keep up with an i7 using HT in most applications. The difference becomes, when HT is not written for in the system (SMT)...then the AMD gains a speed advantage in multithreaded apps because it can still efficiently process multiple threads by using context switches.

The downfall of this is the fact that you do have to divide resources to run more than 8 threads. Albeit, intel has to do this as well with HT, though it's a different way of dividing things up if you will. Also, mispredicted branches and other errors hurt TMT arguably more than SMT, because the CPU has to manually clear it's cache in order to resume, SMT operates a little differently. Which is why in some cases PD is not as efficient as it could be at highly multithreaded apps. It should, theoretically destroy the i7-3770k in a lot of areas, but it runs a tight race instead and even loses in some arenas. With SR, the missed predictions and a lot of other errors will be reduced dramatically meaning that the CPU will run far more efficiently and spend less time running the wrong processes, but also less time clearing cache, and starting over, etc. This means that the performance increase will be dramatic.

SMT needs to be coded for to be efficient, TMT does not. That's why in heavily threaded apps on other OS systems, like ubuntu, etc. AMD is quite a bit faster(intel gains speed as well, just at a lesser rate). Since there is no HT support at all on ubuntu, TMT is the only way you can really multithread beyond 1 thread per core.

TMT has been around for a long time, back from the original days of single core CPU multithreading...it's not a new technology...and it's not exciting, but it does work, and runs at a hardware level, no software loops like HT requires.

:)

EDIT: I realize you likely knew quite a bit of most of the links I posted...but others may not, so the read may be good for other curious as to what I am talking about.
 

8350rocks

Distinguished


Above, I posted a link to the wikipedia entry for TMT, that should sort it out for you.

Perhaps my explanation was as clear as mud?
 

8350rocks

Distinguished


Ubuntu doesn't support hyperthreading...unless you're running a windows application that supports HT through WINE...in which case...sure. But native applications don't support HT.
 

8350rocks

Distinguished
Let me clarify...

Ubuntu itself does support HT, but, not very many (if any) of the applications utilize code that supports hyper threading...making HT in ubuntu essentially worthless...outside of windows applications that support HT run through WINE.

By the way, report whatever you want...I am not trolling anyone...

Intel can hardly get windows developers to use code that supports HT, and they pay many of them to use it when they do put it into a program. Do you honestly think a bunch of opensource volunteers and coders working in their spare time care at all about adding complexity to code just to support one specific brand of processor? Let me help you if you're unclear...(answer's no, they don't care).
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I think that trottling is designed to prevent the chip from going beyond the rated TDP. Therefore I fail to see how could affect only the k series with a larger TDP.

In my opinion, the small power consumption delta between 5700 and 5800k is due to that both are essentially the same chip, with one of them being clocked at a slightly larger GHz and with a bit higher turbo. Note that even under full load the guru3d review gives a delta minor than the theoretical 35W. In my opinion AMD estimated the 35W taking into account moderated overclocks on the k series.

Yes, power consumption figures depends on lots of factors: CPU/APU, motherboard, dGPU, HDD/SDD, memory, additional devices, operative system... It also depends on the PSU used if you are measuring at the wall. It is not rare to see differences of 40 or 60W between different reviews of the same chip.

Both the 5700 and the 5800 are perfect choices for sff. There are lots of people using them.
 

mayankleoboy1

Distinguished
Aug 11, 2010
2,497
0
19,810


This is the tird time i am asking for source.

BTW, apps dont support or dis-support Ht. Coders write in a language, and the compiler produces code that runs on processors.
And AFAIK, Linux kernel treats HT cores as real cores. So any app running does not know whether the threads it is being offeres are full threds or HT-threads.
 
For Linux, HT is supported at kernel level, so all Ubuntu versions support HT through SMT. That's the short answer. If Ubuntu touches the scheduler for the kernel to "ignore" SMT, it would be so stupid that I'd be amazed that this hasn't been pointed out before.

Regarding semantics of "thread handling" inside the CPU, IIRC the amount of stack registers and the decoding stage mandate how many [strike]threads[/strike] instructions you can handle at the same time, not the amount of FPUs or ALUs or AGUs you have inside a CPU.

Cheers!

EDIT: I see the problem here, haha. And gamerk pointed out that a few comments back. What we commonly refer as "thread" is a big bunch of Ops waiting to go into the CPU. It's all in the assembler, hahaha. It's all in the Assembler! If you want to do fine print, you can say a CPU can handle different threads at the same time by playing with the Ops; you have SMT which is basically that approach and it's incarnation is HT, but that' about it. So far, you can only manage to handle 2 threads, independent of the number of Ops you feed to it per CPU. That's my understanding. Palladin, I'm pretty sure you can add something wise to this, haha.
 

truegenius

Distinguished
BANNED
Your CPU is capable of multi-tasking. This means that the maximum number of threads a CPU can run at once are engineered maximums before the pipelines fill up. There is an acceptable limit to the amount of clocktime the engineers originally designed to allow to be split up under heavy load.

let me explain this :D

a single core (no ht no smt not etc) can process only 1 thread, you can have a pool of hundreds of threads which will wait for cpu time and every thread will get ts cpu time using cpu time slicing which is managed by schedular

as you can see here
over 500 threads are under processing
PV360_Task_Manager.png


seems like you mistakenly mixed threading with ipc (bd = 44 ipc and 44 threads (according to you)
 

the apu's heat dissipation is it's own. other parts can influence the power use in varying degrees but it's the single biggest heat source in an apu-based sff pc with no discreet gfx. that's why i posted the hwmonitor links, to show that the 5800k uses much higher watts than 5700 on load and to isolate the apus' power use. system consumption can differ as you already know.
5700's significantly less power use makes it more suitable than 5800k. 5800k might be usable in sff pcs as much as any other cpu/apu, only with appropriate cooling for dissipating the 100w out of the small enclosures. if a user decides to live with it, it's his/her personal choice.
 

8350rocks

Distinguished


Well, then provide your source for your information. I have provided tons of sources already.

I have researched and as near as I can tell, it depends entirely on the distribution. The newest releases of ubuntu 10.04+ should support HT, but you have to enable it in the "acpi=" command line, however, versions of other linux do not necessarily support it (I can only assume based on whichever version of kernel they chose to use).



 

8350rocks

Distinguished


Kids can read this forum...you know...how about you get your language under control before you start insulting someone? Keep it civil or keep out.
 

griptwister

Distinguished
Oct 7, 2012
1,437
0
19,460
This is the reason as to why I don't comment on things I have no idea about. mayankleoboy1, I should report you for the extensive language and insulting of THW members.

And Seriously guys? You guys were wrong on MULTIPLE subjects, and the guys who are right are even saying 8350rocks is right on this topic. Let it go.

Also, might I add, if you'd go back a few pages, you would find the sources.
 

8350rocks

Distinguished


Yes, based on TMT, it is the entire premise that the core can only execute 1 thread at a time; however, that core uses context switches to change from thread to thread up to X number of threads in the hardware design. So, it is dividing resources to do more than 1 thread per core per cycle.

Your POV on this is mostly accurate. Though I am not sure how I am confusing information?

EDIT: Ahh, hang on, I see what you're getting at on the hard numbers...let me check my sources and verify the numbers are the same...that could be an anomaly or inaccurate. IPC != Threads...yes.
 


From your wiki link:

The distinguishing difference between the two forms is the maximum number of concurrent threads that can execute in any given pipeline stage in a given cycle. In temporal multithreading the number is one, while in simultaneous multithreading the number is greater than one.

Secondly, AMD uses a CMT scheme, not TMT. Essentially, they duplicate most of the resources of a CPU core, except the scheduler and some FP units.

http://www.behardware.com/articles/833-2/amd-bulldozer-architecture.html

IMG0032133.png


One BD Module is a "core" with SMT in the classical sense. But since there's two separate register contexts, Task Manager and the OS see "two" cores per BD module. Same concept as Intel HTT, where a HTT core is shown on Task Manager.

See, AMD CPUs have far more physical resources than intel CPUs so they use context switches and division of clocktime to execute multiple threads, betting on the fact that most programs are not so heavily threaded.

But those resources (the second Integer scheduler, for instance) can't be used on a single thread, so they do nothing but increase power draw. Resources are useless if not used. Hence why BD does so poorly in single-threaded apps, as almost HALF the resources in a BD module go unused.

Secondly, switching threads is VERY computationally expensive. You want to avoid undergoing a context-switch whenever possible. During a context switch, the CPU is doing NOTHING. Finally, context switches have been around since, I don't know, the first 8-bit CPU's? Its not like its an AMD exclusive feature here...

http://en.wikipedia.org/wiki/Context_switch

Even if they are, it can keep up with an i7 using HT in most applications. The difference becomes, when HT is not written for in the system (SMT)...then the AMD gains a speed advantage in multithreaded apps because it can still efficiently process multiple threads by using context switches.

Firstly, against, context switches are expensive to execute and should be avoided at all costs. If you want a different thread to run, it should be handled by the OS scheduler. Having to context switch in hardware basically means the pipeline stalled (for whatever reason), necessitating the entire state be saved, a new thread loaded in, its state restored, and continuing with that thread. It keeps the CPU going, but its MUCH cheaper to run one thread until the OS schedules a different thread to run.

AMD gains in multithreaded apps mainly due to having a more powerful SMT implementation (~80% performance for CMT, ~15% performance for HTT) and a significantly faster base clock (3.4 versus 3.8). Even then, the poor per-core performance keeps the BD arch from trouncing i7's, despite more cores at a higher speed.

The downfall of this is the fact that you do have to divide resources to run more than 8 threads.

No you don't, because no more then 8 threads can execute at any given time. So no division of resources is necessary.

Seriously, this isn't that hard to grasp. To run a thread, you have to manipulate data. To manipulate data, you have to load the data into CPU registers. 8 sets of registers means you can only run 8 threads at a time. This is Computer Architecture 101 here people...

Albeit, intel has to do this as well with HT, though it's a different way of dividing things up if you will. Also, mispredicted branches and other errors hurt TMT arguably more than SMT, because the CPU has to manually clear it's cache in order to resume, SMT operates a little differently. Which is why in some cases PD is not as efficient as it could be at highly multithreaded apps. It should, theoretically destroy the i7-3770k in a lot of areas, but it runs a tight race instead and even loses in some arenas. With SR, the missed predictions and a lot of other errors will be reduced dramatically meaning that the CPU will run far more efficiently and spend less time running the wrong processes, but also less time clearing cache, and starting over, etc. This means that the performance increase will be dramatic.

1: again, AMD uses a CMT scheme, discussed above.

2: Fixing a branch predictor will help their worst-case performance, but won't do a damn to improve best-case PD numbers.

SMT needs to be coded for to be efficient, TMT does not.

Not true. HTT needs special coding, due to the lack of anything other then the extra register context, which limits what a HTT core can do. But thats no a limitation of SMT.

Simple example: An intel core and a HTT core share the ALU. So only one core can handle math operations at a time. If you have two instructions that both require the ALU, guess what? One has to wait. Hence why HTT typically doesn't add much extra processing power. [That being said, register-register arithmetic could theoretically scale to 100%. I still occasionally use bit-shifts in my code for exactly this reason.]

That's why in heavily threaded apps on other OS systems, like ubuntu, etc. AMD is quite a bit faster(intel gains speed as well, just at a lesser rate). Since there is no HT support at all on ubuntu, TMT is the only way you can really multithread beyond 1 thread per core.

As I pointed above, AMD uses CMT.

Secondly, you also have to take into account Drivers, configs, and the like when talking about Linux in general. I mean, comparing CCC to Nouveau when doing a NVIDIA to AMD comparison on Linux would hardly be fair...
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


It is not really possible to isolate cpu/apu power consumption. And hwmonitor does not really measure power consumption, for measuring it you need a power meter.

The A10-5800k is one of the recommended chips for SFFs

http://www.anandtech.com/show/6490/holiday-2012-small-form-factor-buyers-guide/4
 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860


Ya, kinda funny how some people can go spouting off like this, but when returning the favor, you get told to stop. Also funny that someone is asking for sources of information and never provides any of his own, just tossing degrading comments.
 
Lets not turn this into a he said she said.
Sometimes we say things and need to learn, no big deal.
No ones stupid here, or drunk.
We come to learn and share, step it up you learned ones, its you who are failing
 

8350rocks

Distinguished

Yes, you misread that, what it means is, specifically what it says in the simplest terms. TMT will only function on one thread at a time, but it will use context switches to change the active thread throughout the clock cycle. SMT means that you can have more than one thread at a time running concurrently by using a virtual core or other set of resources.

This is the difference between a physical core using context switches and a virtual core running an operation using partial resources from a physical core. One is a round robin type system working on one thread at a time, but hitting several in the same clock cycle, the other is working on 2 threads simultaneously, but one at an accelerated rate.

I think you just misread that, or didn't read far enough into it...either way.

Secondly, AMD uses a CMT scheme, not TMT. Essentially, they duplicate most of the resources of a CPU core, except the scheduler and some FP units.

http://www.behardware.com/articles/833-2/amd-bulldozer-architecture.html


One BD Module is a "core" with SMT in the classical sense. But since there's two separate register contexts, Task Manager and the OS see "two" cores per BD module. Same concept as Intel HTT, where a HTT core is shown on Task Manager.

Erm...you mean CMP I am guessing? CMT is a poorly executed TV music channel so far as I know. As for CMP, of course the architecture is CMP (Chip level multiprocessing)...that is true of any multicore CPU. Intel chips utilize CMP as well as SMT...do you honestly think that AMD can't utilize both CMP architecture and TMT process? At one point there was discussion that BD would have SMT + TMT and it would have CMP by default because of architecture design, just like Intel has CMP and SMT.

Remember the discussion earlier? "Multiprocessing != Multithreading!" This is the distinction you are alluding to, you're just mixing the 2 together now.



But those resources (the second Integer scheduler, for instance) can't be used on a single thread, so they do nothing but increase power draw. Resources are useless if not used. Hence why BD does so poorly in single-threaded apps, as almost HALF the resources in a BD module go unused.

Secondly, switching threads is VERY computationally expensive. You want to avoid undergoing a context-switch whenever possible. During a context switch, the CPU is doing NOTHING. Finally, context switches have been around since, I don't know, the first 8-bit CPU's? Its not like its an AMD exclusive feature here...

http://en.wikipedia.org/wiki/Context_switch

Aha...but now you see how the inefficiency in the AMD architecture becomes magnified by misprediction errors and cache clears from improper execution, huh? This isn't something new, or unique to AMD, intel went with SMT, and AMD went with TMT. However, you have to consider as well from an AMD business standpoint. Designing architecture to use SMT and have it be a completely new architecture from the ground up is a daunting task to be sure. So they took the cheaper, albeit lazier, way out and decided to use 15 year old less efficient technology relying simply on raw horsepower, where they have a 2 to 1 advantage in sheer brute force.

Now, that didn't work out so well with BD because the architecture was far less efficient than they initially suspected...so PD was rolled out pretty quickly to stem the tide. Now, they have the time to sit down and do steamroller correctly so that it turns out to be what it should. The issue is, when you use TMT, you have to make the hardware extremely efficient, or you lose time and resources...as you well pointed out above and below. Which means that every % point gained in efficiency is basically worth twice as much in terms of performance.

If they had gome SMT like intel, they would have released BD later, but it likely would have been far simpler to tweak and adjust because SMT is far more forgiving on the hardware's internal logic systems, and so mispredictions and logical errors cost far less.



Firstly, against, context switches are expensive to execute and should be avoided at all costs. If you want a different thread to run, it should be handled by the OS scheduler. Having to context switch in hardware basically means the pipeline stalled (for whatever reason), necessitating the entire state be saved, a new thread loaded in, its state restored, and continuing with that thread. It keeps the CPU going, but its MUCH cheaper to run one thread until the OS schedules a different thread to run.

Yes, but if you're not setup for SMT, TMT is the only other way to multithread...CMP is not a legitimate way to carry a full load because you're limited by 1 thread per core. No realistic engineer is going to bet on purely 8 threads and nothing more. You always have to have a plan B. TMT was an easy implement...that they underestimated how much would be used in my own estimations...hence PD arriving so quickly on the heels of BD.

The switches are inefficient, but you notice how much Cache is in BD/PD? They did that for a reason....to accelerate the ability to store the threads for the cores to be able to come back to them. That's why it's relatively large compared to previous generation. It's all to accomodate the low level multithreading design of TMT and try to get as much out of it as they can.

AMD gains in multithreaded apps mainly due to having a more powerful SMT implementation (~80% performance for CMT, ~15% performance for HTT) and a significantly faster base clock (3.4 versus 3.8). Even then, the poor per-core performance keeps the BD arch from trouncing i7's, despite more cores at a higher speed.

You're partially right...greater CMP and higher clock speed amounts to more "brute force"...but SMT and architecture efficiency catch intel back up where TMT holds AMD back because the hardware is not tightly designed well enough to eliminate as many errors as they could have. SR will fix an enormous portion of those issues.



No you don't, because no more then 8 threads can execute at any given time. So no division of resources is necessary.

Seriously, this isn't that hard to grasp. To run a thread, you have to manipulate data. To manipulate data, you have to load the data into CPU registers. 8 sets of registers means you can only run 8 threads at a time. This is Computer Architecture 101 here people...

So, my question to you is...Why are there 4 register pipelines per core in PD and 8 in SR? If they are not executing multiple threads per clock cycle...why would they need more pipelines? Seems like an awful waste of engineering effort to design something with no purpose. Also, in SR, they're increasing the register queue from 8 to 16 places.

That's a lot of engineering for a system that isn't multithreading...don't you think?

They are using TMT.

1: again, AMD uses a CMT scheme, discussed above.

AMD and intel have CMP CPUs...it's the architecture they used to design them by making them multicore. It has nothing to do with the price of goats in africa when we're discussing multithreading.

This is what you're referring to: http://en.wikipedia.org/wiki/Multi-core_(computing)

2: Fixing a branch predictor will help their worst-case performance, but won't do a damn to improve best-case PD numbers.

I disagree, there will be places where the hardware was bottlenecking itself, and the SR improvements will make a huge difference.



Not true. HTT needs special coding, due to the lack of anything other then the extra register context, which limits what a HTT core can do. But thats no a limitation of SMT.

Fair enough, I overgeneralized lumping SMT and HTT together...point noted. But it doesn't change what I said about HTT being true...as you concede clearly.

Simple example: An intel core and a HTT core share the ALU. So only one core can handle math operations at a time. If you have two instructions that both require the ALU, guess what? One has to wait. Hence why HTT typically doesn't add much extra processing power. [That being said, register-register arithmetic could theoretically scale to 100%. I still occasionally use bit-shifts in my code for exactly this reason.]

Yes, that is where HTT has a weakness versus more cores...I have pointed this out to some on several occasions.



As I pointed above, AMD uses CMP.

I fixed it for you... :) Again, nothing to do with multithreading...that's multiprocessing

Secondly, you also have to take into account Drivers, configs, and the like when talking about Linux in general. I mean, comparing CCC to Nouveau when doing a NVIDIA to AMD comparison on Linux would hardly be fair...

Yes, my statement was broad and generalized, and honestly outside of redhat/fedora, debian and ubuntu...I haven't really played with too many other linux distributions out there. So I cannot comment on the ones that I have not messed with, but the ones I have used were significantly faster. So I will concede your point of being too broad, perhaps...there are literally thousands of different linux versions out there.

 
Status
Not open for further replies.