gamerk316 :
]
From your wiki link:
The distinguishing difference between the two forms is the maximum number of concurrent threads that can execute in any given pipeline stage in a given cycle. In temporal multithreading the number is one, while in simultaneous multithreading the number is greater than one.
Yes, you misread that, what it means is, specifically what it says in the simplest terms.
TMT will only function on one thread at a time, but it will use context switches to change the active thread throughout the clock cycle. SMT means that you can have more than one thread at a time running concurrently by using a virtual core or other set of resources.
This is the difference between a physical core using context switches and a virtual core running an operation using partial resources from a physical core. One is a round robin type system working on one thread at a time, but hitting several in the same clock cycle, the other is working on 2 threads simultaneously, but one at an accelerated rate.
I think you just misread that, or didn't read far enough into it...either way.
Secondly, AMD uses a CMT scheme, not TMT. Essentially, they duplicate most of the resources of a CPU core, except the scheduler and some FP units.
http://www.behardware.com/articles/833-2/amd-bulldozer-architecture.html
One BD Module is a "core" with SMT in the classical sense. But since there's two separate register contexts, Task Manager and the OS see "two" cores per BD module. Same concept as Intel HTT, where a HTT core is shown on Task Manager.
Erm...you mean CMP I am guessing? CMT is a poorly executed TV music channel so far as I know. As for CMP, of course the architecture is CMP (Chip level multiprocessing)...that is true of any multicore CPU. Intel chips utilize CMP as well as SMT...do you honestly think that AMD can't utilize both CMP architecture and TMT process? At one point there was discussion that BD would have SMT + TMT and it would have CMP by default because of architecture design, just like Intel has CMP and SMT.
Remember the discussion earlier? "Multiprocessing != Multithreading!" This is the distinction you are alluding to, you're just mixing the 2 together now.
But those resources (the second Integer scheduler, for instance) can't be used on a single thread, so they do nothing but increase power draw. Resources are useless if not used. Hence why BD does so poorly in single-threaded apps, as almost HALF the resources in a BD module go unused.
Secondly, switching threads is VERY computationally expensive. You want to avoid undergoing a context-switch whenever possible. During a context switch, the CPU is doing NOTHING. Finally, context switches have been around since, I don't know, the first 8-bit CPU's? Its not like its an AMD exclusive feature here...
http://en.wikipedia.org/wiki/Context_switch
Aha...but now you see how the inefficiency in the AMD architecture becomes magnified by misprediction errors and cache clears from improper execution, huh? This isn't something new, or unique to AMD, intel went with SMT, and AMD went with TMT. However, you have to consider as well from an AMD business standpoint. Designing architecture to use SMT
and have it be a completely new architecture from the ground up is a daunting task to be sure. So they took the cheaper, albeit lazier, way out and decided to use 15 year old less efficient technology relying simply on raw horsepower, where they have a 2 to 1 advantage in sheer brute force.
Now, that didn't work out so well with BD because the architecture was far less efficient than they initially suspected...so PD was rolled out pretty quickly to stem the tide. Now, they have the time to sit down and do steamroller correctly so that it turns out to be what it should. The issue is, when you use TMT, you have to make the hardware extremely efficient, or you lose time and resources...as you well pointed out above and below. Which means that every % point gained in efficiency is basically worth twice as much in terms of performance.
If they had gome SMT like intel, they would have released BD later, but it likely would have been far simpler to tweak and adjust because SMT is far more forgiving on the hardware's internal logic systems, and so mispredictions and logical errors cost far less.
Firstly, against, context switches are expensive to execute and should be avoided at all costs. If you want a different thread to run, it should be handled by the OS scheduler. Having to context switch in hardware basically means the pipeline stalled (for whatever reason), necessitating the entire state be saved, a new thread loaded in, its state restored, and continuing with that thread. It keeps the CPU going, but its MUCH cheaper to run one thread until the OS schedules a different thread to run.
Yes, but if you're not setup for SMT, TMT is the only other way to multithread...CMP is not a legitimate way to carry a full load because you're limited by 1 thread per core. No realistic engineer is going to bet on purely 8 threads and nothing more. You always have to have a plan B. TMT was an easy implement...that they underestimated how much would be used in my own estimations...hence PD arriving so quickly on the heels of BD.
The switches are inefficient, but you notice how much Cache is in BD/PD? They did that for a reason....to accelerate the ability to store the threads for the cores to be able to come back to them. That's why it's relatively large compared to previous generation. It's all to accomodate the low level multithreading design of TMT and try to get as much out of it as they can.
AMD gains in multithreaded apps mainly due to having a more powerful SMT implementation (~80% performance for CMT, ~15% performance for HTT) and a significantly faster base clock (3.4 versus 3.8). Even then, the poor per-core performance keeps the BD arch from trouncing i7's, despite more cores at a higher speed.
You're partially right...greater CMP and higher clock speed amounts to more "brute force"...but SMT and architecture efficiency catch intel back up where TMT holds AMD back because the hardware is not tightly designed well enough to eliminate as many errors as they could have. SR will fix an enormous portion of those issues.
No you don't, because no more then 8 threads can execute at any given time. So no division of resources is necessary.
Seriously, this isn't that hard to grasp. To run a thread, you have to manipulate data. To manipulate data, you have to load the data into CPU registers. 8 sets of registers means you can only run 8 threads at a time. This is Computer Architecture 101 here people...
So, my question to you is...Why are there 4 register pipelines per core in PD and 8 in SR? If they are not executing multiple threads per clock cycle...why would they need more pipelines? Seems like an awful waste of engineering effort to design something with no purpose. Also, in SR, they're increasing the register queue from 8 to 16 places.
That's a lot of engineering for a system that isn't multithreading...don't you think?
They are using TMT.
1: again, AMD uses a CMT scheme, discussed above.
AMD and intel have CMP CPUs...it's the architecture they used to design them by making them multicore. It has nothing to do with the price of goats in africa when we're discussing multithreading.
This is what you're referring to: http://en.wikipedia.org/wiki/Multi-core_(computing)
2: Fixing a branch predictor will help their worst-case performance, but won't do a damn to improve best-case PD numbers.
I disagree, there will be places where the hardware was bottlenecking itself, and the SR improvements will make a huge difference.
Not true. HTT needs special coding, due to the lack of anything other then the extra register context, which limits what a HTT core can do. But thats no a limitation of SMT.
Fair enough, I overgeneralized lumping SMT and HTT together...point noted. But it doesn't change what I said about HTT being true...as you concede clearly.
Simple example: An intel core and a HTT core share the ALU. So only one core can handle math operations at a time. If you have two instructions that both require the ALU, guess what? One has to wait. Hence why HTT typically doesn't add much extra processing power. [That being said, register-register arithmetic could theoretically scale to 100%. I still occasionally use bit-shifts in my code for exactly this reason.]
Yes, that is where HTT has a weakness versus more cores...I have pointed this out to some on several occasions.
As I pointed above, AMD uses CMP.
I fixed it for you...
Again, nothing to do with multithreading...that's multiprocessing
Secondly, you also have to take into account Drivers, configs, and the like when talking about Linux in general. I mean, comparing CCC to Nouveau when doing a NVIDIA to AMD comparison on Linux would hardly be fair...
Yes, my statement was broad and generalized, and honestly outside of redhat/fedora, debian and ubuntu...I haven't really played with too many other linux distributions out there. So I cannot comment on the ones that I have not messed with, but the ones I have used were significantly faster. So I will concede your point of being too broad, perhaps...there are literally thousands of different linux versions out there.