AMD CPU speculation... and expert conjecture

juanrga · Aug 19, 2014

sapperastro :

The PS4 must no be fully HSA compliant, but includes some closely-looking parts

http://www.extremetech.com/gaming/154924-secrets-of-the-ps4-heavily-modified-radeon-supercharged-apu-design/2

I was able to find the original AMD claim about PS4 using HUMA, before Microsoft got angry and AMD retracted.

Diana said that this gives the PS4 the edge on 3D performance. Talking to Heise he said that the heterogeneous uniform memory access is the key to the tremendous increase in performance of composite processors. PS4 does not have a distinction between CPU partition and GPU partition. Both processors can use the same pieces of data at the same time.

Since you can use both processors at the same time for a single task the system is extremely smart and extremely strong at the same time.

http://www.fudzilla.com/home/item/32310-ps4-is-better-because-it-has-a-sense-of-huma-says-amd

HERE you can find AMD Hot Chips talk about kaveri and HSA. In Page 22 and following ones you can find some details on how HSA synchronizes iCPU and iGPU to operate on same data tree simultaneously to increase performance. AMD also measured the performance boost that HSA introduces when compared to standalone CPU and when compared to a traditional non-HSA APU. There is one benchmark therein.

szatkus · Aug 19, 2014

sapperastro :

Everything related to 3D graphics for example

gamerk316 · Aug 19, 2014

Hmm, if the consoles (or at least one) has HSA, then this is a good thing for AMD in the long run. I am still waiting for mainstream programmers and game coders to begin taking advantage of HSA though. I wonder how long it will take...

Again, by itself, HSA brings no performance advantages whatsoever. Sure, having the CPU/GPU share memory space is nice, but you get hobbled by the GPU needing to communicate over the main memory bus every time it needs to do anything. And as we've discussed, that is the main performance bottleneck on current APUs. That's why GPUs still have a ton of VRAM built in, since you can do a large, slow memory transfer once, and never have to worry about it again.

Anyone who thinks HSA is a magic bullet has always been kidding themselves. Its a different way of doing things, with its own performance limitations to consider.

juanrga · Aug 19, 2014

sapperastro :

I will copy/paste

Chapter 13. FPU’s and multimedia

The computer is constantly performing calculations, which can be divided into two groups.

º Whole numbers
º Floating point numbers

The whole number calculations are probably the most important, certainly for normal PC use – using office programs and the like. But operations involving floating point numbers have taken on greater significance in recent years, as 3D games and sound, image and video editing have become more and more a part of everyday computing. Let’s have a brief look at this subject.

sapperastro · Aug 19, 2014

juanrga :

Hmm, wouldn't that mean that Floating point units would mainly work with parallel workloads? Hence wouldn't it be more efficient for the GPU to do this instead of the CPU?

I still remember having a 387 back in the day, with its own Maths Coprocessor, which was set up to do these kinds of loads (rare as they were back in those days of computing). I assume the maths coprocessor was the dinosaur version of the onboard FPU.

edit: I think a light went on in my head...this is what mantle is really all about, along with the idea behind the new DX, open GL, and HSA....am I cutting through to the bone here?

gamerk316 · Aug 19, 2014

FP <> Parallel. Granted, many of the workloads that use FP tend to be parallel, but its not an automatic.

The main point about Mantle, DX, and OGL NG is to reduce as much as possible the driver overhead and make multithreaded rendering easier. Currently, all games have two "heavy" threads: The main game engine, and the primary render thread. Those two heavy workloads require a high IPC core to keep performance up, as lower IPC cores will get overloaded, no matter how much you make the rest of the code parallel. The hope is, by reducing the driver overhead, you increase the performance of weaker cores. This also has the side effect of making Duo Cores more viable, as you have overall lower CPU utilization. As for multithreaded rendering, DX11 technically supports it already, but in very limited fashion. The idea is, rather then having one thread do all the rendering (which is one of the major reasons for the "heavy" render thread), you can perform the rendering across many threads.

The end result of the new render APIs will be that lower tier chips, mainly the Pentium, i3, and FX-4xxx lines should see significant performance boosts. Higher tier chips won't be affected much, however, as they aren't significantly CPU bottlenecked.

gamerk316 · Aug 19, 2014

Just for giggles, the first R7 SSD reviews are out:

http://www.techpowerup.com/reviews/AMD/R7_SSD/14.html

blackkstar · Aug 19, 2014

gamerk316 :

This is what I am saying when we need more cores. Basically, game developers abandon improving the parts of the game that are stuck in a single thread and then abuse the fact that Mantle (and derivative APIs) can take that task that used to be locked on one or two threads and spread it over a whole lot of them.

What we are doing isn't going to work, and if game developers continue with the same software and the same tools, games are going to hit a brick wall of improving.

Game logic is going to be absolutely difficult to break into multiple threads (or it might not even be possible!) But rendering with things like Mantle, OGLNG, and DX12 will allow at least one piece of the puzzle to scale to a lot of threads, and that in itself makes moar coars viable for gaming.

There are some tasks that do still scale well with many cores (rendering, transcoding video, compiling, etc), and some single thread tasks can scale to multiple cores by running multiple jobs (encoding several MP3s at a time instead of just one).

I do realize that there will be programs forever stuck in single thread, but software developers in general are going to have to admit that not embracing multi-thread is going to not give you any sort of advantages, specially if someone else gets multi-thread working better.

We are in the middle of a paradigm shift and the ones who figure all of this out are the ones that are going to prosper. AMD has a huge opportunity here with their hardware and software.

gamerk316 · Aug 19, 2014

This is what I am saying when we need more cores. Basically, game developers abandon improving the parts of the game that are stuck in a single thread and then abuse the fact that Mantle (and derivative APIs) can take that task that used to be locked on one or two threads and spread it over a whole lot of them.

Actually, game engines are getting a LOT more efficient these days. You can still get some improvement out of optimization, generally by re-approaching the problem, but these workloads simply do not scale, period.

What we are doing isn't going to work, and if game developers continue with the same software and the same tools, games are going to hit a brick wall of improving.

Define "improving" for me? Performance? Graphics? Features? I would argue game's haven't improved at all in the past decade.

Game logic is going to be absolutely difficult to break into multiple threads (or it might not even be possible!) But rendering with things like Mantle, OGLNG, and DX12 will allow at least one piece of the puzzle to scale to a lot of threads, and that in itself makes moar coars viable for gaming.

The opposite, actually.

The fatal flaw people keep making in regards to scalability is the assumption that more threads automatically lead to more performance. This is not true. It comes down to workload per core; if any core gets overworked, then performance suffers. Point being, if you have a thread that stresses a single CPU core to 150% of its maximum workload, it doesn't matter how many threads or cores you have, because you are limited by that one overworked CPU core. If the thread causing that workload happens to be a primary thread for that program, you're sunk.

That's why BD's approach was wrong, not for going to more cores, but because IPC suffered. As a result, even if software scaled in the way AMD was hoping, a single high workload thread would throw all those possible performance benefits out the window. That's why Intel still pulls ahead in gaming, despite the fact games are using more and more (sometimes up to 12+) threads that do "significant" work, the heavy workload threads are the limiting performance factor. Intel's stronger cores can deal with this, AMD weaker cores can't.

That's why Duos are going to benefit most from the new API's, since one of the two primary "big" threads for games should be a thing of the past. That leaves one big thread and a bunch of smaller ones, which is easier for a strong Duo core CPU to deal with. These APIs are literally the best thing that could happen to Intel, since its Pentium/i3 line becomes a LOT more viable for OEM PC's and mobile.

There are some tasks that do still scale well with many cores (rendering, transcoding video, compiling, etc), and some single thread tasks can scale to multiple cores by running multiple jobs (encoding several MP3s at a time instead of just one).

And we're right back to my primary argument: These tasks should be using the GPU anyway, since they are not just parallel, they are MASSIVELY parallel. Why use 8 cores when you can use 500?

I do realize that there will be programs forever stuck in single thread, but software developers in general are going to have to admit that not embracing multi-thread is going to not give you any sort of advantages, specially if someone else gets multi-thread working better.

We are in the middle of a paradigm shift and the ones who figure all of this out are the ones that are going to prosper. AMD has a huge opportunity here with their hardware and software.

Its not a matter of embracing, its a matter of feasibility. The majority of workloads simply do not scale, period. The stuff that does will get offloaded to the GPU. That leaves the CPU to deal with the non-threaded stuff. And guess what? That's exactly what we're doing now.

8350rocks · Aug 19, 2014

juanrga :

szatkus :

No. PD/SR include two 128bit FMAC units per module. This accounts to 128bit per core. This implies a maximum of 4 SP ops per core. Using SSE this gives 4 FLOP/core using FMA this gives 8 FLOP/core. Thus for Kaveri the maximum throughput is, when using FMA instructions,

4 core x 8 FLOP/core x 3.7GHz = 118.4 GFLOP/s

Haswell has two 256bit units per core and FMA support was added on AVX2. This implies 16 SP ops per core and, that Haswell peaks at 32 FLOP/core, when using FMA-AVX instructions,

Thus for i7-4770k CPU

4 core x 32 FLOP/core x 3.5GHz = 448 GFLOP/s

You can find this 448 number mentioned in many places e.g. next

http://www.hardware.fr/news/13206/kaveri-radeon-7750-integree.html

For double precision divide all those single precision numbers by two.

Broadwell maintains the 32 FLOP/core but Skylake will increase it to 64 FLOP/core achieving SIMD-wide parity with the Xeon Phi Knight Landing, which also does 64 FLOP/core.

The FP unit is the most weak part of AMD architecture (as was shown in my BSN article) and there is no way that AMD can increase the total throughput by a factor of 8x to achieve parity with Skylake, except when one believes on miracles!

Moreover, there are several versions of AVX512. AMD would implement all them and the companion ISAs to achieve parity with Skylake Xeons. This is not happening.

I expect AMD to increase the floating point performance of the K12 core to 16 FLOP/core. I expect the K12 core to have two 128bit units. This is an improvement factor of 2x over Piledriver/Steamroller FPU (i.e. 2x more transistors).

You are implying that they cannot possibly have 512 bit FMAC pipelines...

Why do you think they could not have 512 bit FMAC pipelines?

Carrizo will have 2x 256 FMAC pipelines...

That is double performance...and if they run 512 bit FMAC pipelines, it will be 4x.

Also, they will not be behind by 8x. Your math assumes half efficiency per core, though you are not aware of what the new architecture will have, in all likelihood, they will not share FMAC pipelines again...which doubles your number off the bat before FMAC pipelines are increased...

con635 · Aug 19, 2014

jdwii :

I could wait no longer to upgrade, I was so bored, so I now have my first intel cpu since a pentium p166 way back in the day which replaced my trusty Amiga 🙁 What can I say it was cheap and it found me, still feel a bit dirty though!
Indeed despite lower clock speed in certain places on bf4 64mp I'm getting more performance, eg Shanghai at the top of the big road looking down 40-50fps vs 90-112fps. I play with everything low on mp except mesh for obvious reasons so it has to be a cpu limitation. There's one example of 100%.

jdwii · Aug 19, 2014

con635 :

jdwii :

I could wait no longer to upgrade, I was so bored, so I now have my first intel cpu since a pentium p166 way back in the day which replaced my trusty Amiga 🙁 What can I say it was cheap and it found me, still feel a bit dirty though!
Indeed despite lower clock speed in certain places on bf4 64mp I'm getting more performance, eg Shanghai at the top of the big road looking down 40-50fps vs 90-112fps. I play with everything low on mp except mesh for obvious reasons so it has to be a cpu limitation. There's one example of 100%.

That is indeed a good example and i saw that in that game as well when tested in my bench suite, however i wasn't talking about overall performance just per clock per core. When the FX is clocked at its intended speed and it gets to use all of its cores we see some nice performance.

szatkus · Aug 19, 2014

gamerk316 :

As far as I know AMD planned to have slightly better IPC than K10. Something just had gone wrong.

juanrga · Aug 19, 2014

szatkus :

I believe you are confounding IPC with IPS.

jdwii · Aug 19, 2014

juanrga :

szatkus :

I believe you are confounding IPC with IPS.

Actually juan that is the offical report

. But according to this http://forums.anandtech.com/showpost.php?p=32421412&postcount=106
Its clear the engineering department hoped ipc went up.

juanrga · Aug 19, 2014

sapperastro :

If you have a serial workload that lies on FP calculations it is better to run it on the CPU, because are optimized for serial processing.

If you have a parallel workload that lies on FP calculations it is better to run it on the GPU, because are optimized for parallel processing.

Thus FP units on CPUs are needed and all companies are increasing its performance. AMD, Intel, ARM, Nvidia, Fujitsu, NEC... all them have presented, or will be soon, CPUs with increased FP units.

MANTLE, DX12, and Open GL Next are something different. Those are graphics APIs, i.e. software layers used to program the processes in all stages of computer graphics generation without accessing concrete hardware.

HSA is just another thing. This is a both hardware and software architecture that describes how compute units of different kind (e.g. CPU cores and GPU cores) can work together in a cooperative fashion.

szatkus · Aug 19, 2014

jdwii :

juanrga :

Actually juan that is the offical report

. But according to this http://forums.anandtech.com/showpost.php?p=32421412&postcount=106
Its clear the engineering department hoped ipc went up.

^This

Thank you. I would never find this post

juanrga · Aug 19, 2014

szatkus :

jdwii :

^This

Thank you. I would never find this post

Uh? We are discussing what AMD engineers planned and executed. The quote that I provided explains what engineers did on the design of Bulldozer.

Bulldozer never was a Brainiac design, it was a speed-demon design.

What one marketing guy, with bad fame, did spread in forums before launch doesn't count

http://scalibq.wordpress.com/2012/08/14/john-fruehe-leaves-amd/

http://www.zdnet.com/blog/berlind/amd-answers-to-controversial-use-of-retired-benchmarks-in-china/420

http://www.benchmarkemail.com/blogs/detail/in-social-media-marketing-honesty-is-everything

juanrga · Aug 19, 2014

gamerk316 :

Exactly! If some workload can be well-threaded then it makes no sense to run it on a CPU wasting power and die space, when can run better and faster on a GPU.

szatkus · Aug 19, 2014

juanrga :

szatkus :

Uh? We are discussing what AMD engineers planned and executed. The quote that I provided explains what engineers did on the design of Bulldozer.

Bulldozer never was a Brainiac design, it was a speed-demon design.

What one marketing guy, with bad fame, did spread in forums before launch doesn't count

http://scalibq.wordpress.com/2012/08/14/john-fruehe-leaves-amd/

http://www.zdnet.com/blog/berlind/amd-answers-to-controversial-use-of-retired-benchmarks-in-china/420

http://www.benchmarkemail.com/blogs/detail/in-social-media-marketing-honesty-is-everything

I used word "planned", so...

juanrga · Aug 19, 2014

I also used the word "planned" so...

juanrga · Aug 19, 2014

AMD has explained plenty of times why the moar cores paradigm is close to the end and has to be replaced by heterogeneous paradigm. The next slide explains some of the limitations of the moar cores design such as power and bad scalability

But a simple practical example is more useful

We will start with the 12 core Kaveri APU. This APU has 4 CPU cores and 8 GPU cores and the CPU occupies about 50% of the die

http://www.amd.com/en-us/innovations/software-technologies/compute-cores

Now imagine that we are given the possibility to port Kaveri to 20nm and that we are given the option to increase the die size by 50%. By simplicity, we are given two options: either double the number of CPU cores or double the number of GPU cores (bottlenecks, thermal issues, and other technical issues are avoided here by simplicity).

Ok, the total throughput of the APU splits as

If we double the number of CPU cores up to 8 cores and maintain the 8 GPU cores, then the total performance increases from [118.4 + 737.3] to [236,8 + 737.3]. This is an increase of the 13.8%

If, instead, we double the number of GPU cores up to 16 cores and maintain the 4 CPU cores, then the total performance rises from [118.4 + 737.3] to [118.4 + 1474,6]. This is 86.2% more performance

Thus we finish with two APUs of about the same size but one of them is much more powerful than other. The reason is the following: the CPU uses latency cores. The GPU uses throughput cores. The throughput cores are a kind of cores optimized for performance/watt and performance/area. This is the reason why with the APU with more GPU cores provides more performance for the same area than the APU with more CPU cores.

Of course reality is much more complex than all of above, but this simple example illustrates why traditional moar cores is close to being a dead paradigm in computer design and is being replaced by heterogeneity.

For 2016, I would like to see a 20 core APU with 4 strong* CPU cores, 16 GPU cores and plenty of bandwidth thanks to stacked RAM.

* I mean both big IPC and frequency.

szatkus · Aug 19, 2014

juanrga :

So we are talking about what AMD wanted to do, not what they delivered and said to journalist in the end.
Good that we agree.

jdwii · Aug 19, 2014

juanrga :

szatkus :

Uh? We are discussing what AMD engineers planned and executed. The quote that I provided explains what engineers did on the design of Bulldozer.

Bulldozer never was a Brainiac design, it was a speed-demon design.

What one marketing guy, with bad fame, did spread in forums before launch doesn't count

http://scalibq.wordpress.com/2012/08/14/john-fruehe-leaves-amd/

http://www.zdnet.com/blog/berlind/amd-answers-to-controversial-use-of-retired-benchmarks-in-china/420

http://www.benchmarkemail.com/blogs/detail/in-social-media-marketing-honesty-is-everything

I always read his blogs i used to not care for him when i was a blind fanboy but now i think he is accurate 90% of the time

juanrga · Aug 19, 2014

szatkus :

There are three different but related concepts here: what AMD initially planned; what AMD was able to produce finally; and what one person (latter fired from AMD) said in blogs and forums.

AMD CPU speculation... and expert conjecture

Distinguished

Honorable

Glorious

Distinguished

Honorable

Glorious

Glorious

Honorable

Glorious

Distinguished

Honorable

Splendid

Honorable

Distinguished

Splendid

Distinguished

Honorable

Distinguished

Distinguished

Honorable

Distinguished

Distinguished

Honorable

Splendid

Distinguished

Share this page