cdrkf :
RE CMT, I'm not saying I think CMT is a good idea now- my point was that back when they started designing bulldozer (5 years + before it launched) I can understand why they went that way. The one thing CMT has allowed AMD to do (that other design schemes might not have) is to cram allot of cores onto a reasonably sized die at a relatively large node (compared to what Intel are working with at least). Also the design scheme behind CMT doesn't actually dictate poor single thread performance, as if your using one module for one thread, very much like HT all resources are dedicated to that thread and speed improves.
I understood your point, but I disagree. I think CMT was a bad idea back then. This is why no other processor builder followed DEC 1996 arch except AMD.
AMD CMT approach makes little sense (then and now) because it is a self-inconsistent hybrid of a TCU and a LCU:
(1) Bringing lots of cores in a relatively small die makes sense when you are designing a manycore, but this is a multicore. And multicore != manycore. This confusion was a mayor design fault of CMT.
(2) As already covered in early FSA (precursor to HSA), CPUs are for latency, GPUs for throughput. However, with CMT, AMD go for a throughput optimized CPU, "moar cores", which not only makes little sense per se (serial code, branching...), but it contradicts what HSA is about: latency + throughput.
(3) The shared FPU was chosen because someone at AMD pretended to use the CPU for integer and, in a future, external GPU as a kind of giant FPU. CMT integer vs floating 'clusters' is a copy of old DEC boxes. Against this 'clustering' ignores that one also needs FP for latency workloads. Moreover, the shared FPU complicated the design, instead simplifying it, when compared to CMP arch. because the shared FPU has to implement a form of SMT.
Moreover, the FMAC units can be united to throughput 256-bit code, but then are only accessible to one of the cores in the module at one time, bringing back throughput by one half and eliminating any performance gain from running AVX-like code, which again makes no sense.
(4) Modules were aimed to provide more throughput, but then the shared decoder brings ~20% penalty, which finished in throughput code running faster when only one thread is scheduled to each module and the companion cores are unused. Modules were also aimed to improve power consumption, but in the above case the full module is active even when only one of the cores is working, increasing the power consumption when compared to a traditional CMP design where each core can be parked when unused.
(5) As mentioned above, modules were also aimed to improve power consumption. The CMT approach pretends to do it by reducing die area by sharing elements
Power ~ Area
But then IPC was reduced by area constraints. Net throughput of CMT would still be superior thanks to "moar cores". The problem is that low IPC would be a step backward for a CPU, then this had to be compensated by emphasis on a high-frequency engine ~4GHz. The problem is that power is not linear with frequency
Power ~ frequency^n
where n>=2. Thus any power advantage from sharing resources was eliminated by the high-frequencies resulting in higher TDPs for the CMT design. Power consumption is the reason why throughput machines are clocked at low frequencies : 1--2GHz.
(6) To tolerate the increased latencies necessary for a high frequency target, branch prediction is critically important; however in CMT the branch predictor is shared by the two cores in each module reducing prediction to one half.
(7) This is all about the hardware. CMT also introduces software problems on the software side. A CMP scheduler is trivial. A SMT scheduler is easy (first fill real cores then virtual ones). A CMT scheduler is complex. For throughput, independent threads have to be scheduled on free cores on separate modules with the companion core unused; otherwise performance is reduced due to front-end bottleneck. For data dependent threads, it is better to schedule on the same module for avoiding performance penalty from moving cache data across modules. For efficiency the scheduler has to follow other strategies, because scheduling two threads in two modules increases power consumption.
At the end no scheduler can extract maximum efficiency from a CMT approach.
(8) This is all about CMT. Bulldozer introduced further flaws/problems on top of that.
cdrkf :
I agree with you on Kaveri- it is a shame that the DDR5 dimms weren't available as the iGPU in Kaveri + DDR5 would be epic (1920 x 1080p @ 30+ fps in pretty much any title). I think as new memory technologies become available AMD's APUs are going to get better and better. I'm waiting for the day Tom's recommend an APU on either (both?) of their best CPU / GPU for the money articles.
Agree. But at least we know that it was not AMD fault, but just a point case of bad luck. In any case the GDDR5M solution was only temporary, because GDDR5 cannot scale. As I mentioned in a previous post the new 2016 APUs will offer fast stacked DRAM for the high-end range.
It is expected that APUs will replace both CPU and dGPU about 2018. Although Intel could accelerate things and kill GPUs before.