AMD CPU speculation... and expert conjecture

Page 605 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


The PS4 must no be fully HSA compliant, but includes some closely-looking parts

http://www.extremetech.com/gaming/154924-secrets-of-the-ps4-heavily-modified-radeon-supercharged-apu-design/2

I was able to find the original AMD claim about PS4 using HUMA, before Microsoft got angry and AMD retracted.

Diana said that this gives the PS4 the edge on 3D performance. Talking to Heise he said that the heterogeneous uniform memory access is the key to the tremendous increase in performance of composite processors. PS4 does not have a distinction between CPU partition and GPU partition. Both processors can use the same pieces of data at the same time.

Since you can use both processors at the same time for a single task the system is extremely smart and extremely strong at the same time.

http://www.fudzilla.com/home/item/32310-ps4-is-better-because-it-has-a-sense-of-huma-says-amd

HERE you can find AMD Hot Chips talk about kaveri and HSA. In Page 22 and following ones you can find some details on how HSA synchronizes iCPU and iGPU to operate on same data tree simultaneously to increase performance. AMD also measured the performance boost that HSA introduces when compared to standalone CPU and when compared to a traditional non-HSA APU. There is one benchmark therein.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


Everything related to 3D graphics for example :)
 
Hmm, if the consoles (or at least one) has HSA, then this is a good thing for AMD in the long run. I am still waiting for mainstream programmers and game coders to begin taking advantage of HSA though. I wonder how long it will take...

Again, by itself, HSA brings no performance advantages whatsoever. Sure, having the CPU/GPU share memory space is nice, but you get hobbled by the GPU needing to communicate over the main memory bus every time it needs to do anything. And as we've discussed, that is the main performance bottleneck on current APUs. That's why GPUs still have a ton of VRAM built in, since you can do a large, slow memory transfer once, and never have to worry about it again.

Anyone who thinks HSA is a magic bullet has always been kidding themselves. Its a different way of doing things, with its own performance limitations to consider.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I will copy/paste

Chapter 13. FPU’s and multimedia

The computer is constantly performing calculations, which can be divided into two groups.

º Whole numbers
º Floating point numbers

The whole number calculations are probably the most important, certainly for normal PC use – using office programs and the like. But operations involving floating point numbers have taken on greater significance in recent years, as 3D games and sound, image and video editing have become more and more a part of everyday computing. Let’s have a brief look at this subject.
 

sapperastro

Honorable
Jan 28, 2014
191
0
10,710

Hmm, wouldn't that mean that Floating point units would mainly work with parallel workloads? Hence wouldn't it be more efficient for the GPU to do this instead of the CPU?

I still remember having a 387 back in the day, with its own Maths Coprocessor, which was set up to do these kinds of loads (rare as they were back in those days of computing). I assume the maths coprocessor was the dinosaur version of the onboard FPU.

edit: I think a light went on in my head...this is what mantle is really all about, along with the idea behind the new DX, open GL, and HSA....am I cutting through to the bone here?

 
FP <> Parallel. Granted, many of the workloads that use FP tend to be parallel, but its not an automatic.

The main point about Mantle, DX, and OGL NG is to reduce as much as possible the driver overhead and make multithreaded rendering easier. Currently, all games have two "heavy" threads: The main game engine, and the primary render thread. Those two heavy workloads require a high IPC core to keep performance up, as lower IPC cores will get overloaded, no matter how much you make the rest of the code parallel. The hope is, by reducing the driver overhead, you increase the performance of weaker cores. This also has the side effect of making Duo Cores more viable, as you have overall lower CPU utilization. As for multithreaded rendering, DX11 technically supports it already, but in very limited fashion. The idea is, rather then having one thread do all the rendering (which is one of the major reasons for the "heavy" render thread), you can perform the rendering across many threads.

The end result of the new render APIs will be that lower tier chips, mainly the Pentium, i3, and FX-4xxx lines should see significant performance boosts. Higher tier chips won't be affected much, however, as they aren't significantly CPU bottlenecked.
 

blackkstar

Honorable
Sep 30, 2012
468
0
10,780


This is what I am saying when we need more cores. Basically, game developers abandon improving the parts of the game that are stuck in a single thread and then abuse the fact that Mantle (and derivative APIs) can take that task that used to be locked on one or two threads and spread it over a whole lot of them.

What we are doing isn't going to work, and if game developers continue with the same software and the same tools, games are going to hit a brick wall of improving.

Game logic is going to be absolutely difficult to break into multiple threads (or it might not even be possible!) But rendering with things like Mantle, OGLNG, and DX12 will allow at least one piece of the puzzle to scale to a lot of threads, and that in itself makes moar coars viable for gaming.

There are some tasks that do still scale well with many cores (rendering, transcoding video, compiling, etc), and some single thread tasks can scale to multiple cores by running multiple jobs (encoding several MP3s at a time instead of just one).

I do realize that there will be programs forever stuck in single thread, but software developers in general are going to have to admit that not embracing multi-thread is going to not give you any sort of advantages, specially if someone else gets multi-thread working better.

We are in the middle of a paradigm shift and the ones who figure all of this out are the ones that are going to prosper. AMD has a huge opportunity here with their hardware and software.
 
This is what I am saying when we need more cores. Basically, game developers abandon improving the parts of the game that are stuck in a single thread and then abuse the fact that Mantle (and derivative APIs) can take that task that used to be locked on one or two threads and spread it over a whole lot of them.

Actually, game engines are getting a LOT more efficient these days. You can still get some improvement out of optimization, generally by re-approaching the problem, but these workloads simply do not scale, period.

What we are doing isn't going to work, and if game developers continue with the same software and the same tools, games are going to hit a brick wall of improving.

Define "improving" for me? Performance? Graphics? Features? I would argue game's haven't improved at all in the past decade.

Game logic is going to be absolutely difficult to break into multiple threads (or it might not even be possible!) But rendering with things like Mantle, OGLNG, and DX12 will allow at least one piece of the puzzle to scale to a lot of threads, and that in itself makes moar coars viable for gaming.

The opposite, actually.

The fatal flaw people keep making in regards to scalability is the assumption that more threads automatically lead to more performance. This is not true. It comes down to workload per core; if any core gets overworked, then performance suffers. Point being, if you have a thread that stresses a single CPU core to 150% of its maximum workload, it doesn't matter how many threads or cores you have, because you are limited by that one overworked CPU core. If the thread causing that workload happens to be a primary thread for that program, you're sunk.

That's why BD's approach was wrong, not for going to more cores, but because IPC suffered. As a result, even if software scaled in the way AMD was hoping, a single high workload thread would throw all those possible performance benefits out the window. That's why Intel still pulls ahead in gaming, despite the fact games are using more and more (sometimes up to 12+) threads that do "significant" work, the heavy workload threads are the limiting performance factor. Intel's stronger cores can deal with this, AMD weaker cores can't.

That's why Duos are going to benefit most from the new API's, since one of the two primary "big" threads for games should be a thing of the past. That leaves one big thread and a bunch of smaller ones, which is easier for a strong Duo core CPU to deal with. These APIs are literally the best thing that could happen to Intel, since its Pentium/i3 line becomes a LOT more viable for OEM PC's and mobile.

There are some tasks that do still scale well with many cores (rendering, transcoding video, compiling, etc), and some single thread tasks can scale to multiple cores by running multiple jobs (encoding several MP3s at a time instead of just one).

And we're right back to my primary argument: These tasks should be using the GPU anyway, since they are not just parallel, they are MASSIVELY parallel. Why use 8 cores when you can use 500?

I do realize that there will be programs forever stuck in single thread, but software developers in general are going to have to admit that not embracing multi-thread is going to not give you any sort of advantages, specially if someone else gets multi-thread working better.

We are in the middle of a paradigm shift and the ones who figure all of this out are the ones that are going to prosper. AMD has a huge opportunity here with their hardware and software.

Its not a matter of embracing, its a matter of feasibility. The majority of workloads simply do not scale, period. The stuff that does will get offloaded to the GPU. That leaves the CPU to deal with the non-threaded stuff. And guess what? That's exactly what we're doing now.
 

8350rocks

Distinguished


You are implying that they cannot possibly have 512 bit FMAC pipelines...

Why do you think they could not have 512 bit FMAC pipelines?

Carrizo will have 2x 256 FMAC pipelines...

That is double performance...and if they run 512 bit FMAC pipelines, it will be 4x.

Also, they will not be behind by 8x. Your math assumes half efficiency per core, though you are not aware of what the new architecture will have, in all likelihood, they will not share FMAC pipelines again...which doubles your number off the bat before FMAC pipelines are increased...
 

con635

Honorable
Oct 3, 2013
644
0
11,010

I could wait no longer to upgrade, I was so bored, so I now have my first intel cpu since a pentium p166 way back in the day which replaced my trusty Amiga :( What can I say it was cheap and it found me, still feel a bit dirty though!
Indeed despite lower clock speed in certain places on bf4 64mp I'm getting more performance, eg Shanghai at the top of the big road looking down 40-50fps vs 90-112fps. I play with everything low on mp except mesh for obvious reasons so it has to be a cpu limitation. There's one example of 100%.
 

jdwii

Splendid


That is indeed a good example and i saw that in that game as well when tested in my bench suite, however i wasn't talking about overall performance just per clock per core. When the FX is clocked at its intended speed and it gets to use all of its cores we see some nice performance.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


As far as I know AMD planned to have slightly better IPC than K10. Something just had gone wrong.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I believe you are confounding IPC with IPS.

 

jdwii

Splendid


Actually juan that is the offical report :). But according to this http://forums.anandtech.com/showpost.php?p=32421412&postcount=106
Its clear the engineering department hoped ipc went up.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


If you have a serial workload that lies on FP calculations it is better to run it on the CPU, because are optimized for serial processing.

If you have a parallel workload that lies on FP calculations it is better to run it on the GPU, because are optimized for parallel processing.

Thus FP units on CPUs are needed and all companies are increasing its performance. AMD, Intel, ARM, Nvidia, Fujitsu, NEC... all them have presented, or will be soon, CPUs with increased FP units.

MANTLE, DX12, and Open GL Next are something different. Those are graphics APIs, i.e. software layers used to program the processes in all stages of computer graphics generation without accessing concrete hardware.

HSA is just another thing. This is a both hardware and software architecture that describes how compute units of different kind (e.g. CPU cores and GPU cores) can work together in a cooperative fashion.

 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


^This

Thank you. I would never find this post :)
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Uh? We are discussing what AMD engineers planned and executed. The quote that I provided explains what engineers did on the design of Bulldozer.

Bulldozer never was a Brainiac design, it was a speed-demon design.

What one marketing guy, with bad fame, did spread in forums before launch doesn't count

http://scalibq.wordpress.com/2012/08/14/john-fruehe-leaves-amd/

http://www.zdnet.com/blog/berlind/amd-answers-to-controversial-use-of-retired-benchmarks-in-china/420

http://www.benchmarkemail.com/blogs/detail/in-social-media-marketing-honesty-is-everything
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Exactly! If some workload can be well-threaded then it makes no sense to run it on a CPU wasting power and die space, when can run better and faster on a GPU.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


I used word "planned", so...
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790
AMD has explained plenty of times why the moar cores paradigm is close to the end and has to be replaced by heterogeneous paradigm. The next slide explains some of the limitations of the moar cores design such as power and bad scalability

hsa_history_large.jpg


But a simple practical example is more useful

We will start with the 12 core Kaveri APU. This APU has 4 CPU cores and 8 GPU cores and the CPU occupies about 50% of the die

http://www.amd.com/en-us/innovations/software-technologies/compute-cores

Now imagine that we are given the possibility to port Kaveri to 20nm and that we are given the option to increase the die size by 50%. By simplicity, we are given two options: either double the number of CPU cores or double the number of GPU cores (bottlenecks, thermal issues, and other technical issues are avoided here by simplicity).

Ok, the total throughput of the APU splits as

maximum-gflops.jpg


If we double the number of CPU cores up to 8 cores and maintain the 8 GPU cores, then the total performance increases from [118.4 + 737.3] to [236,8 + 737.3]. This is an increase of the 13.8%

If, instead, we double the number of GPU cores up to 16 cores and maintain the 4 CPU cores, then the total performance rises from [118.4 + 737.3] to [118.4 + 1474,6]. This is 86.2% more performance

Thus we finish with two APUs of about the same size but one of them is much more powerful than other. The reason is the following: the CPU uses latency cores. The GPU uses throughput cores. The throughput cores are a kind of cores optimized for performance/watt and performance/area. This is the reason why with the APU with more GPU cores provides more performance for the same area than the APU with more CPU cores.

Of course reality is much more complex than all of above, but this simple example illustrates why traditional moar cores is close to being a dead paradigm in computer design and is being replaced by heterogeneity.

For 2016, I would like to see a 20 core APU with 4 strong* CPU cores, 16 GPU cores and plenty of bandwidth thanks to stacked RAM.

* I mean both big IPC and frequency.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


So we are talking about what AMD wanted to do, not what they delivered and said to journalist in the end.
Good that we agree.
 

jdwii

Splendid


I always read his blogs i used to not care for him when i was a blind fanboy but now i think he is accurate 90% of the time
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


There are three different but related concepts here: what AMD initially planned; what AMD was able to produce finally; and what one person (latter fired from AMD) said in blogs and forums.
 
Status
Not open for further replies.