AMD CPU speculation... and expert conjecture

JOHN A BENSON · Mar 18, 2013

I like all the discussion in one place just cruze through one thread instead of 5

-Fran- · Mar 18, 2013

griptwister :

Hey, my Sager is 15.4" and 1080p. It looks FABULOUS, hahaha. Also, there are some 14" with 1080p. They look good, but too damn small for my taste. Too much DPI with no scaling/sizing is horrible for the eye.

viridiancrystal :

I don't remember if GT2 desktop parts are the same "batch" as the GT3 mobile parts. Are they from the same silicon and design?

Cheers!

hcl123 · Mar 18, 2013

noob2222 :

AFAIK none game uses Intel compilers.

If there is those kind of optimizations it would be at the "drivers" level. Even so i doubt nVidia uses ICC for their drivers... surely AMD doesn't.

Also the question of SSE etc disabled on AMD CPUs is wrong. This can be done only at the "application code", meaning the binaries can have double path triggered according to CPUID, a trick that earned Intel already a lawsuit that they lost.

Compiler is of the utmost importance, more than any "uarch" now... but lets see. Intel Compiler collection is use only for some HPC solutions, for some vertical application... almost nowhere else AFAIK... except being used extensively for those popular benchmark codes. In Windows world almost everything is around MSFT Visual packs... for Open Source almost everything is around the GNU GCC..

Take your own conclusions.

hcl123 · Mar 18, 2013

mayankleoboy1 :

AMD64 is only a little tweaked version of Open64, its not AMD compiler, never was, its open source and meant for servers mostly.

The true AMD compiler, if we can abuse language that way, is the HSA toolchain... HSA (its basically a compiler alright) which is mostly(>98%) a software paradigma, meaning different vendors can diverge freely with any "processor/device" ISAs and fixed functions as much they like, that is the beauty of it ... "sucks" so bad that it managed to get Samsung ARM TI Imagination(PowerVR) etc etc on board lol...

mayankleoboy1 · Mar 19, 2013

hcl123 :

noob2222 :

AFAIK none game uses Intel compilers.

If there is those kind of optimizations it would be at the "drivers" level. Even so i doubt nVidia uses ICC for their drivers... surely AMD doesn't.

Also the question of SSE etc disabled on AMD CPUs is wrong. This can be done only at the "application code", meaning the binaries can have double path triggered according to CPUID, a trick that earned Intel already a lawsuit that they lost.

Compiler is of the utmost importance, more than any "uarch" now... but lets see. Intel Compiler collection is use only for some HPC solutions, for some vertical application... almost nowhere else AFAIK... except being used extensively for those popular benchmark codes. In Windows world almost everything is around MSFT Visual packs... for Open Source almost everything is around the GNU GCC..

Take your own conclusions.

Dude, please dont open this argument again. As it is, we are barely managing to keep the rabid fanbois silent.

mayankleoboy1 · Mar 19, 2013

gamerk316 :

And hence my (and yours too, i think) view that offloading graphics work on the CPU is a BAD idea. You might also check how LLVM-pipe drivers perform on intels 6 core beasts. Pathetically, i might add.

mayankleoboy1 · Mar 19, 2013

palladin9479 :

Not sure what you are replying to..

Regarding that last part, what we all do know is that in Floating point heavy tasks, like rendering, even with large number of cores, and an embarassingly parallel workload, AMD procs perform poorer in comparison to Intel processors.
In some integer heavy, highly parallel workloads, that AMD performs +/- 5% to intel. 9basically equal)
In single threaded loads, integer or FP, AMD (with their default much higher clocks) perform poorer than Intel.

hcl123 · Mar 19, 2013

Cazalan :

I really hope so, it would be fun... because that ST-Ericsson demo just got me salivating...
http://www.advancedsubstratenews.com/2013/02/fd-soi-arm-based-smartphone-chip-hitting-3ghz-in-barcelona-but-wait-its-the-low-active-standby-power-0-6v-for-1ghz-thats-really-amazing/

3GHz for a quad processor in a process optimized for low power at ~1v... 1Ghz at 0.6v, and none "bulk" process so far is able to go below the 0.7*v, but even if they will eventually, doubt anything close to 1Ghz. Today AFAIK, the lowest voltage bulk process for a CPU gets only around 500 to 600Mhz.

Apart from STMicro/GloFo no one is committed to FD-SOI... and its not only much faster and lower power, its intrinsically cheaper from a FAB POV to.

http://www.electronicsweekly.com/Articles/11/06/2012/53867/globalfoundries-opens-up-28nm-20nm-fd-soi-process-to-all-comers.htm

http://www.advancedsubstratenews.com/2012/11/ibs-study-concludes-fd-soi-most-cost-effective-technology-choice-at-28nm-and-20nm/

i just wonder why Apple hasn't choose them!.. but if i were AMD they could have all of TSMC if they wish lol.. all my production would transition to FD-SOI, including all GPUs, embedded APUs, and even chipsets (its cheaper in any case).

STMicro and ST-Ericsson entered the the HSA foundation just recently...

umm... perhaps Kaveri can be FD-SOI (the tech is basically STMicro but can be fabed at GF), but if not, because little time to change from earlier plans, the first version of Kaveri which should be mobile could fit plain "bulk", the other Steamroller versions, Kaveri desktop and CPU could be 28nm FD-SOI (after all, one of the biggest marketing points announced by the "FD-SOI guys" is that porting a "bulk" design for FD-SOI is very easy, at least the BEOL of the fab process is practically the same).

sarinaide · Mar 19, 2013

Cazalan :

Richland mobility releases today, the desktop Richland is only out in June.

amdfangirl · Mar 19, 2013

-Fran- :

I would imagine a separate die is used to conserve die space.

hcl123 · Mar 19, 2013

gamerk316 :

This reminds me of VERY old discussions i entered. IBM was where this trend is now almost 20 years ago, they where about to implement an 8 issue wide core CPU... it scaled so well that i think it never saw the light of the day. Meanwhile IBM went the other way and implemented an in-order core for the Power5 version that in its days broke all speed records... and by being much simpler and smaller you could much more easily put multi-cores on a chip.

So this is a very old "arguing" and "experimentation", its more than proved now that in REAL high performance "parlance", then you must talk multi-thread. Perhaps there is a reason why Sony and MSFT choose 8 cores for a gaming console, and single-thread performance is not a big issue(otherwise they would had chosen another kind of core), when in the PC world the trend seems to go "contra-natura" and no one seems wanting to go beyond 4 cores/threads. Is it Intel "cloth" felt ?.. any doubt that Sony or MSFT might end up developing 8 thread games? ... its an old story in any case, that end up smashed against the REALITY of very low scaling prone ILP nature of current ISA paradigma(be it RISC or CISC).

Long story short, consoles seems indicating multi-threading intents, the all server world is heavy multi-threading since ages, common OSs now can be also heavy multi-threading, compilers can support it fine (only they can't code for the developer... yet)... why isn't or can't the PC world accompany the trend ?

gamerk316 :

Yes "dark silicon" is a serious concern, which is made worst by wider fatter cores. In any case it doesn't mean at all multi-threading can't scale, the contrary, nothing can scale really for the foreseeable future but multi-threading. The problem is the PC software isn't there, and as it seems unwilling to be there. But i think reality will creep in eventually as the 8 issue wide IBM CPU attempt, i think Haswell less than expected performance increase for a "tock" is just a warning...

gamerk316 :

Its a question of good development tools. Its a "caricature" that AMD is much more a pusher of IBB (Intel Building Blocks) than Intel itself... Intel seems to have forgotten about good multi-threading development tools altogether, in server world is no big deal since it employs already the "ninjas" of coding, but for the DT world good tools could be essential to take multi-threading programing out of the ground.

gamerk316 :

Just wonder why Intel is pushing HTM (hardware transactional memory) into their new designs... guess it will be turned off in DT only active in server SKUs, when it could be real useful to all. HTM and only 4 or even 8 threads is simply a bad joke!..

hcl123 · Mar 19, 2013

MU_Engineer :

Good analyses, missing only the fact that is "vertical multithreading", which is absurdly difficult and complex to do right. Only Sun Microsystems tried something similar AFAIK, i think their Rock chip would had something similar if it had not been canceled( and SUN bought by Oracle).

The concept is easy to understand, instead of having only more cores and threading in a "horizontal" way (which everybody understands), and having the ability to synchronize and so be asynchronous between several threads, which is mandatory for scaling... AMD does it "vertically" to, that is, grouping several pipeline stages in thread domains, which again to scale properly must be somehow asynchronous between those thread domains.

This is even terrible good for power, since you can "clock gate or CnQ or Turbo" those different <domains> dynamically at demand (if you could transition really fast on and off with it)... and even "power gate" some of the "cores or clusters inside a module", in a more horizontal way. None of it is done yet in any current implementation, but to me is no wonder if the BD "uarch style" ends up in the embedded style (bobcat/jaguar/etc) APUs. After all the Jaguar design is already "module" oriented (no not the same, but the concept could have a merge and evolution).

About the size "inflation" of this first BD design, i think completely contrary to the common view. BD "Orochi" is not a good deal smaller because it has in it 5 good x16 Hypertransport links plus the all plethora of DCCLs to control those ( 1 link for die-to-die for MCM, 1 link for I/O, 3 links for other 3 sockets for a 4s system).

In the image http://chip-architect.com/news/AMD_Bulldozer.jpg (the PCIe indication is really HT for MCM) all around the chip die are "interfaces that don't scale well at all with new fab processes", the "module with L2" is only 30.9mm², and are those links/interfaces that prevent Orochi chip from being more compact, specially in the all "uncore", and so a good deal smaller (4x30.9=123.4 meaning L3 an the rest is almost twice as big as the cores plus L2). That is, why make the whole middle more compact when you can't scale size any further because of the all "pad" requirement?... By the space of the SB-E of 435mm² or just a little more, AMD for the same 32nm node size, could have had perhaps made a 10 modules/20 cores chip (10x30.9 =309mm² - left 126+mm² for the rest).

So the concept is very good for space and for power, and in this first implementation the cores really thinking about "clock"... only this first implementation was really sub-par... they made a chip with trade-offs for server but servicing DT to, and the fab process was not good at debut, ending up with a chip that is good at neither (including size, power or clock). Perhaps at 28nm they could ditch 2 HT links, let Intel have the 4s market(its dying in any case) for itself that for real high numbers of cores there is Seamicro (cloud style is the future), and then a FX/server chip with 8 "real" cores could be around 200-250mm² or less depending on cache size and good design (BD has "double" the cache size of SandyBridge... and 28nm FD-SOI is like 26nm compared to other planar techs... )

Cazalan · Mar 19, 2013

hcl123 :

It is impressive but it's also just small ARM A9 cores. Nothing compared to the size and complexity of Kaveri/Trinity. 3Ghz would be a big upgrade for Jaguar class CPU though.

gamerk316 · Mar 19, 2013

hcl123 :

Different types of systems. On a console, or any other integrated system, where you have exactly ONE hardware profile, and you don't have a heavy OS to lug around, you can code VERY low level, directly to the hardware, and extract significantly improved performance. I can, for instance, guarantee the contents of every single memory address at just about any point in any program I choose to run. I can guarantee what threads are running on the system. And so on.

Look, I've worked on systems where you code directly to the HW. Its a different way of life. About 90% of my work is VERY optimized assembly (because when you measure code and memory space in the KB realm, you really care about code efficiency). Trust me when I say, PC's are probably never getting more then about 50% of their theoretical maximum performance, simply due to overhead.

Long story short, consoles seems indicating multi-threading intents, the all server world is heavy multi-threading since ages, common OSs now can be also heavy multi-threading, compilers can support it fine (only they can't code for the developer... yet)... why isn't or can't the PC world accompany the trend ?

Again, PC's are a different world. Consoles are integrated machines; you can code to a very low level. Servers are designed around parallel workloads (multiple users, large datasets). PC's, for the most part, don't.

Yes "dark silicon" is a serious concern, which is made worst by wider fatter cores. In any case it doesn't mean at all multi-threading can't scale, the contrary, nothing can scale really for the foreseeable future but multi-threading. The problem is the PC software isn't there, and as it seems unwilling to be there. But i think reality will creep in eventually as the 8 issue wide IBM CPU attempt, i think Haswell less than expected performance increase for a "tock" is just a warning...

Haswell was focused on the iGPU. No shock there.

Its a question of good development tools. Its a "caricature" that AMD is much more a pusher of IBB (Intel Building Blocks) than Intel itself... Intel seems to have forgotten about good multi-threading development tools altogether, in server world is no big deal since it employs already the "ninjas" of coding, but for the DT world good tools could be essential to take multi-threading programing out of the ground.

There is little any compiler can do in this area. If the OS scheduler decides to run thread A on core 0, guess what? Thats where that thread is going to run. If the OS scheduler decides it has a high priority interrupt to handle, and your applications thread is running, guess what? You get kicked off the core. This is entirely the domain of the OS, not the compiler/optimizer.

Just wonder why Intel is pushing HTM (hardware transactional memory) into their new designs... guess it will be turned off in DT only active in server SKUs, when it could be real useful to all. HTM and only 4 or even 8 threads is simply a bad joke!..

HTM has its own downsides. The concept is simple: Perform the action in question without placing a lock, and when you are done, confirm the memory in question has not been changed. If this works, then you saved a very minor amount of processing (no need to place a lock). If, however, some other thread DID change the contents, guess what? The transaction is undone, a lock put in place, and the operation is performed a SECOND time, this time with the lock in place. So the "fail" case is going to be at least 2x as slow as conventional processing, for a very minor speedup in the "pass" case.

Hence why most developers understand that if you have two threads that require potentially simultaneous access to the same data structure, you *probably* have a design issue that needs resolving. Basic threading principle: If you have to do a lot of reaching across the thread boundary, then your threading model is probably wrong.

Now, in a massively parallel situation, HTM would probably result in a significant speedup in system performance, simply because the overhead of locks will likely be greater then the cumulative performance hit of all the retries. For minimal threaded workloads, I would expect a decline/no change in performance.

hcl123 · Mar 19, 2013

Cazalan :

Just code it the other way around mate. "Occlusion tricks" and "deferred rendering" allied with tiling just can make the CPU not having to draw every single frame to totally, everytime, for the GPU to render. It will have plenty of room for coordinating some moar good "compute" tricks (that the GPGPU can finish in any case), which none of it requires intensive CPU power. And perhaps more "threads" and not more IPC is much better for a new game paradigma.

There must be good reasons why Sony and MSFT choose what they choose, certainly they have good engineers that are not stupid. Perhaps is the old way that is changing...

Cazalan · Mar 19, 2013

hcl123 :

Eight cores makes sense for Consoles geared towards many interactive hardware components but mainly it's a compromise for low power. The games won't get 8 cores. The motion capture processing isn't trivial and will probably eat 1-2 cores on its own. The OS will probably have 2 reserved for itself and I/O processing. The games will probably only get 4 cores to itself.

Jaguar cores are weak compared to modern desktop/mobile processors but it fits the power envelope they were going for. Support for newer instructions like SSE3/SSE4 and AVX helps. They clearly wanted more TDP reserved for the GPU.

esrever · Mar 19, 2013

Cazalan :

hcl123 :

Eight cores makes sense for Consoles geared towards many interactive hardware components but mainly it's a compromise for low power. The games won't get 8 cores. The motion capture processing isn't trivial and will probably eat 1-2 cores on its own. The OS will probably have 2 reserved for itself and I/O processing. The games will probably only get 4 cores to itself.

Jaguar cores are weak compared to modern desktop/mobile processors but it fits the power envelope they were going for. Support for newer instructions like SSE3/SSE4 and AVX helps. They clearly wanted more TDP reserved for the GPU.

you won't be doing motion capturing and processing the game all the time. For games that don't need it, they can have the full cpu.

griptwister · Mar 19, 2013

It was only like 5 years ago when people were saying, "Games aren't going to use quad cores." Now, in 2013, "Games aren't going to use eight cores." See a pattern developing here? I say any one who says "You won't need 8 cores" is denying that in a year or two, their quad core processors are going to be out dated.

hcl123 · Mar 19, 2013

gamerk316 :

Very true, but GDDR5 has "programmable CAS", meaning the interface can be somehow decouple, and besides this interface can run more than twice the speed of DDR3 as example. Think tweak the settings for latency and caches to match(GPUs will always prefer bandwidth in any case).

GDDR5 around 6 Gbs has no difference of latency whatsoever to GDDR3, and the new 7Gbs can match DDR3 equally. Now think MCM, DRAM devices really really close, those "timings" can be real tighten, even if it has to run much slower. Now think stacked silicon interposed in 2.5D style... the future... its a paradigma change! Not only it will have quite better latency and bandwidth than DDR4, meaning can be the main system DRAM, as well this tight coupling will permit DRAM devices with much less constrains about different voltage domain crossings, and so DRAM at lower fab process (20nm) and 16Gbit devices and 16GB (8 dies staked) inside a socket/package... not being a pipe dream...

gamerk316 · Mar 19, 2013

Cazalan :

Probably 6 or 7. Remember the PS3 Cell technically had 8 PPE's, one permanently disabled to improve yields, and only one reserved for the OS. I'd imagine the PS4 would be set up in a similar manner, so figure 6-7 cores free for developer use. I can't imagine Sony is going to make an OS much heavier then the one they already have.

The 720 is more interesting, since it appears the processing chip for the Kinnect was moved to the console itself. Working on that theory, its possible the 720 will need to reserve an extra core just for that. We'll see.

In either case, I'd wager 6 cores for developer use would probably be a safe assumption, so, processor wise, you'd have the same breakup you had in the current generation of consoles (6 PPE's for the PS3, 3-core with 2-way SMT for the 360) from a developer perspective.

gamerk316 · Mar 19, 2013

griptwister :

Games still typically don't. Crysis 3 pulled it off by moving processing off the GPU (which I again feel is idiotic, but thats its own debate). Other engines are typically maxing out at 2-3 threads that do most of the work, and any CPU loads can be mostly attributed to threads jumping from core to core based on what cores are available when the threads are actually scheduled. (EG: A 50%-33%-25%-2% load is probably two threads, with one of them jumping between two cores every so often).

That being said, theres something here I find ironic. I really need to dig the thread up, but I recall getting blasted back in 09 for proclaiming "Duo's are dead". I just don't see CPU's getting much beyond 3-4 cores before it becomes near impossible for a single application to effectively utilize any extra cores.

sarinaide · Mar 19, 2013

From what I heard and seen with Kaveri desktop is that it will be a SoC with embedded GDDR5, the CPU/APU is interchangeable and any Trinity or Richland part may be used and have equal benefits and features as the Kaveri will. They are also using dual hybrid controllers, where the iGPU has exclusivity to GDDR5 but can be mapped and share DDR with the CPU component where parallel computing environments are intended. There is also complete certainty that RCM will feature in SR based Arch which will see around 60% gains on the iGPU and some are saying 20-30% on the CPU side all reducing total system power by 20%.

Richland Kabini notebooks out soon with AMD's very aggressive response to the very elusive GT3 which will in all probability be weaker, and with quadruple digit pricing intended for GT3 parts you may need to get your second mortgage. I am however sure Richland will be reviewed like every AMD product with a profound sense of "I really couldn't be bothered, so my enthusiasm is curbed, oh boy let me move on to my next review". Somehow will be wholly inadequate.

-Fran- · Mar 19, 2013

sarinaide :

From what I heard and seen with Kaveri desktop is that it will be a SoC with embedded GDDR5, the CPU/APU is interchangeable and any Trinity or Richland part may be used and have equal benefits and features as the Kaveri will. They are also using dual hybrid controllers, where the iGPU has exclusivity to GDDR5 but can be mapped and share DDR with the CPU component where parallel computing environments are intended. There is also complete certainty that RCM will feature in SR based Arch which will see around 60% gains on the iGPU and some are saying 20-30% on the CPU side all reducing total system power by 20%.

Richland Kabini notebooks out soon with AMD's very aggressive response to the very elusive GT3 which will in all probability be weaker, and with quadruple digit pricing intended for GT3 parts you may need to get your second mortgage. I am however sure Richland will be reviewed like every AMD product with a profound sense of "I really couldn't be bothered, so my enthusiasm is curbed, oh boy let me move on to my next review". Somehow will be wholly inadequate.

That, right there.

Can you elaborate a little, please?

Cheers!

palladin9479 · Mar 19, 2013

griptwister :

Ehh we won't be going into "eight core" gaming for another 4~5 years, though we'll probably see it starting to happen in another two or three. We've just finally started to break out of the "dual core is all you need" technological lock-in.

Honesty that's what we really need to avoid doing again. Finding an efficient method to do something one way should not prevent people from doing things with a different methodology in the future. We evolve by constantly testing new methods and altering strategies, not locking ourselves into a single solution because "that's how we've always done it".

Cazalan · Mar 20, 2013

I don't think its a lack of programmers wanting to make things scalable and efficient, they just don't have the budgets to do it. You have to develop quick and for the largest market audience. Maybe 10% you get to experiment for "outlier" platforms.

It all comes down to cost. That's why the most interesting work usually comes out of universities, where the focus is education not profit.

"If you build it they will come", but it can take a while. The cores are coming though no doubt about it. The 5Ghz wall hasn't been solved and shows no signs of budging.

AMD CPU speculation... and expert conjecture

Honorable

Glorious

Honorable

Honorable

Distinguished

Distinguished

Distinguished

Honorable

Splendid

Expert

Honorable

Honorable

Distinguished

Glorious

Honorable

Distinguished

Splendid

Distinguished

Honorable

Glorious

Glorious

Splendid

Glorious

Splendid

Distinguished

Share this page