AMD CPU speculation... and expert conjecture

gamerk316 · Jul 23, 2014

Cazalan :

When Juan needs to win an argument? Efficiency != Performance. And lets not forget how insanely bad iGPUs have been in the past, so generational gains of 50% performance, in the short term, are easily accomplishible.

Seriously, Juan is basically claiming we'll see 50x performance gains over a 6 year period. I rest my case; there's nothing else that needs to be said.

szatkus · Jul 23, 2014

gamerk316 :

Well, some algorithms gain performance from tighter integration. I'm not sure if it's a main purpose of APU, but that's why they're working on HSA and release server APUs.

8350rocks · Jul 23, 2014

juanrga :

Now, here comes the issue, by the time you have 40 TFLOP iGPU, you will have roughly 250 TFLOP dGPU with the same power budget...

As I said, GPU tech is growing faster than APU tech. As Palladin pointed out, any gains made in APUs will carry over to dGPUs as well. Meaning they will only get more efficient and powerful as time goes on.

So your poppycock about APUs will be more efficient defies basic laws of physics.

In fact, Juan, I task you with presenting a physics proof using laws of thermodynamics that show that your solution is accurate, and putting it all on one piece of silicon will somehow allow you to "automagically" gain 10x more performance than projections show without breaking any laws of physics. Include projected transistor counts, material analysis for the substrate required, and the node required to reach the efficiency.

Remember, marketing != reality.

h2323 · Jul 23, 2014

colinp :

Yes, lol juan and 8350...haha. good one. It's hilarious how many times they have said I am not responding to you.

8350rocks · Jul 23, 2014

gamerk316 :

No, the purpose of APUs is to get a halfway decent GPU on the die so you don't require an external GPU, essentially creating a SoC platform that can be used as an entry into mobile markets.

Your entire argument that you can put a CPU and dGPU on the same die and outperform both as standalones within a very tight heat/power budget is insane. I can confidently state at no point in time will any APU beat a CPU/dGPU within the same power budget.

Pretty much this...

gamerk316 · Jul 23, 2014

^^ Boy, remember the days you and I argued tooth and nail?

juanrga · Jul 23, 2014

@szatkus, the K12-based APUs will include stacked DRAM. My sources say me that AMD is considering stacked DRAM for desktop Carrizo. The discussion is not technological but economic. If Hynix prices relatively low the memory modules then we will see HBM-Carrizo APU, otherwise we will see HBM in the post-Carrizo APUs.

@colinp, current CPU design is stagnant due to slow memory. Precisely ARM, IBM, Micron, and others joined to develop stacked memory because current memory technologies are too slow for their future CPUs

Check the HMC FAQ:

What is the problem that HMC solves?
Over time, memory bandwidth has become a bottleneck to system performance in high-performance computing, high-end servers, graphics, and (very soon) mid-level servers. Conventional memory technologies are not scaling with Moore's Law; therefore, they are not keeping pace with the increasing performance demands of the latest microprocessor roadmaps. Microprocessor enablers are doubling cores and threads-per-core to greatly increase performance and workload capabilities by distributing work sets into smaller blocks and distributing them among an increasing number of work elements, i.e. cores. Having multiple compute elements per processor requires an increasing amount of memory per element. This results in a greater need for both memory bandwidth and memory density to be tightly coupled to a processor to address these challenges. The term "memory wall" has been used to describe this dilemma.

Why is the current DRAM technology unable to fully solve this problem?
Current memory technology roadmaps do not provide sufficient performance to meet the CPU and GPU memory bandwidth requirements.

[...]

What are the measurable benefits of HMC
HMC is a revolutionary innovation in DRAM memory architecture that sets a new standard for memory performance, power, reliability, and cost. This major technology leap breaks through the memory wall, unlocking previously unthinkable processing power and ushering in a new generation of computing.
* Increased Bandwidth — A single HMC unit can provide more than 15X the bandwidth of a DDR3 module.
* Reduced Latency – With vastly more responders built into HMC, we expect lower queue delays and higher bank availability, which will provide a substantial system latency reduction.
* Power Efficiency — The revolutionary architecture of HMC allows for greater power efficiency and energy savings, utilizing 70% less energy per bit than DDR3 DRAM technologies.
* Smaller Physical Footprint — The stacked architecture uses nearly 90% less physical space than today’s RDIMMs.
* Pliable to Multiple Platforms — Logic layer flexibility allows HMC to be tailored to multiple platforms and applications.

AMD is not member of HMC consortium. AMD joined JEDEC and developed HBM which is a similar stacked DRAM technology.

The impact on TDP will depend on the technology used and bandwidth required.

juanrga · Jul 23, 2014

@gamerk316, I have made the math and can confirm that your prediction about APUs performance will suffer the same fate than your "nobody will use mantel" prediction.

Why is the 25x efficiency gain confused with the 50x performance gain?

We know that the target of 50x performance by 2020 cannot be achieved with outdated dCPU+dGPU architectures, which don't scale up. This is the reason why AMD, Nvidia, Intel, IBM, Fujitsu, and everyone else are developing a new architectural paradigm around heterogeneity

Everyone knows that HSA is about bringing more performance.

@8350rocks, your imagined 250TFLOPS dGPU on 300W defies basic laws of physics. A dGPU with same power budget only provides <30TFLOPS as I mentioned before

http://www.tomshardware.co.uk/forum/352312-28-steamroller-speculation-expert-conjecture/page-294#13779829

I already showed the math and the physics of HSA APUs before in this thread. I don't need to repeat the same demonstration again and again.

No sure why you mention marketing. I am discussing a well-known international project and the advances made by different academic groups and goverment funded companies. The above slides were presented in a recent professional forum. Are not slides for final consumers.

An example of marketing was the claim that K12 core will be faster than Skylake core. K12 is not faster.

8350rocks · Jul 23, 2014

@JUAN:

A dGPU will ALWAYS be more power efficient than APU. It is a simple fact.

cdrkf · Jul 23, 2014

gamerk316 :

I really think everyone is missing the point of the 'APU'... The early 'APU's' we have now are about a reasonable baseline for iGPU graphics performance (which is something that was needed for years, ridiculous that people could pick up a new machine and it would be incapable of even running a whole host of applications due to lack of proper graphics support).

However- the whole key to the APU is *integration*. The very nature of this means that *if done right* you can save power compared to the same components working separately.

Now let me be clear- I'm not suggesting that for a *very focused* single purpose application (e.g. games) that a piece of silicon that does multiple things will be able to out perform the same number of transistors dedicated to that one goal. What I am saying though is the combined silicon can beat it on efficiancy.

Don't beleive me- then why waste on-die silicon with an integrated IMC? Why waste silicon with a math-co processor (FPU)?

As things stand now- a discreet graphics card requires PCIe lanes- a separate bus which uses power. If you built a HPC system around an APU, you would do away with all that- an APU dedicated to *pure efficiency* wouldn't need external anything. All it needs is a simple board with a good interconnect to allow you to lash as many together as you can.

The other thing to bear in mind- there is a limit to the amount of throughput you can achieve with an external interace. Ok- were not at that limit yet, but making wider and wider IO for higher and higher throughput will use energy- eventually the power cost of the interface alone will swing the perf/w argument in favour of the integrated design.

TL; DR for HPC perf/w is everything, and it's fairly conclusive that an integrated soc that eliminates external controlers and buses is more efficient than a number of separate items lashed together (don't beleive me, why are all mobile phones and tablets using single chip solutions?). I'm not saying that the end result will look much like today's APU (it could be 90% GPU for example) and not even that it will be outright faster than a dGPU (which I don't expect to go anywhere any time soon) *but* I think it will ultimately win in perf / w, which means it will also win for HPC.

cdrkf · Jul 23, 2014

8350rocks :

One issue with this- with the APU you have one piece of silicon dealing with host processing and graphics.

With the dGPU- yes it might use less power than the APU *for the graphics work* but you can't count the power consumption of just the dGPU and compare to the APU. You have to account for the power consumed by the CPU and Chip set as well, then all of sudden the efficiency doesn't look so great...

juanrga · Jul 23, 2014

@8350rocks you are repeating something is untrue.

I know because I have done the math. AMD has done the math. Nvidia has done the math. IBM has done the math. Intel has done the math. Fujitsu has done the math. Cray has done the math... All of us agree.

During his "25x20" talk Papermaster mentioned why APUs are more efficient and scale up, unlike dGPUs. In the last professional forum, whose slides are reproduced by WCCFTECH, AMD has officially announced (I knew this in private before, check my posts) that will use HSA APUs for its exascale supercomputers.

I don't know what are you trying to argue here. AMD already took the correct decision and posts in this forum are not going to change anything.

cdrkf · Jul 23, 2014

juanrga :

Who are you replying too Juan?

If you read my post I was making the point that the overall efficiency of an integrated solution (an APU in this context) is going to be better than separate components (CPU + Chipset + dGPU)- I was agreeing with you (shock horror!)

Edit: *if it's done well at least*

8350rocks · Jul 23, 2014

cdrkf :

GCN will run SSE4 instructions natively on the GPU.

So, while it may be less efficient, it can be done. If you are programmed for massively parallel tasks, such as HPC functions, then rather than run your x86 instructions on 256 CPU cores, you can run them on 4000 GCN cores and tie up far fewer resources.

So, while your argument may be valid in some way...you do not account for the fact that HSA works across a bus system as well...and you can also fit 4 GPUs into a board with 1 CPU. Which really does not greatly diminish the efficiency of the specific cluster in question, if anything, being able to have 1 16 core opteron run potentially 16000+ GCN cores would increase the efficiency over the APU.

For example 1 16 core Opteron + 4 W9800 GPUs = 1550W @ ~23 TFLOPS

To get to 23 TFLOPS with A10-7850K parts, you need ~25 of them which = 2375W @ ~23 TFLOPS

Now, if we were running Dual GPU cards in the first rig it would look like this:

2150W @ 46 TFLOPs for the same setup with 4 295X2 equivalent cards

Versus requiring 50 A10-7850k APUs or 4750W @ 46 TFLOPS.

Now, which solution looks more efficient per cluster?

cdrkf · Jul 23, 2014

8350rocks :

I agree with you here, this is looking at current tech though. The crux is- if you can eliminate that bus and those 4 slots, and end up with similar total resources the overall efficiency will be higher.

The key point here is simply that *if designed correctly* the APU should be more efficient. In your example the issue is the APU would have far too many CPU cores and not enough GPU cores- it's simple a matter of the balance being wrong. What I think we'll see happen is that the GPU portion of APU will continue to grow, whereas the CPU won't- we already know 4 decent cores can keep up with a few thousand GPU cores so the *optimal* APU would need to match that balance.

I think the problem at the moment is that you simply can't squeeze enough transistors into a socked chip to realise that kind of configuration. There is going to come a point where the process node is small enough that you can reach an optimal balance (maybe 1 cpu core to 1000 gpu kinda level) where neither side of the chip is bottlenecked and at that point the APU should be more efficient.

Lots of if's and maybes I know (and quite probably something totally new will throw this whole argument out of the window) however I think that is the kinda goal they are looking at long term if things pan out as expected.

Cazalan · Jul 23, 2014

gamerk316 :

What? You don't want APUs the size of a whole wafer? :lol:

juanrga · Jul 23, 2014

cdrkf :

I know we agree. I was answering 8350rocks but your post appeared just before I submitted mine. Thus my reply was for the post just above the yours. I edited my post to make it clear I was answering 8350rocks.

jdwii · Jul 23, 2014

Juan i do not see it sorry i just don't i see using a dGPU+APU as the common gamer rig but not a APU alone since you have to share some of the transistors with the CPU and South bridge and north bridge and memory controller where on the dGPU you do not need to use that as part of the silicon you can just add more radeon cores or cuda cores making this whole argument kinda silly. Then when it comes to heat the more transistors you fit into a APU the hotter it gets. Not to mention its going to be hard for Amd/Iintel/Nvidia to get past 5nm making the dGPU even more needed in high end gaming rigs

Cazalan · Jul 23, 2014

So these new APU/CPUs will end up looking quite similar to these old Sun Sparc CPUs. Just instead of a PCB it will be a silicon interposer. And instead of SRAM chips they'll be 3D RAM stacks.

palladin9479 · Jul 23, 2014

Here ya go

http://s3-eu-west-1.amazonaws.com/ebayproductphotos/si/ns1147.jpg

CPU module for a V880/V480. Each module has two UltraSparc IIIi CPU's and their accompany memory banks. The V880 takes four cards, the V480 takes two.

Or for the really insane stuff

Here is what the CPU cards look like

The E6500 could hold 32 cards though only 30 could be used for CPU's as you needed one IO card and one storage card.

Talk about old school.

wh3resmycar · Jul 24, 2014

didn't palladin put a nail in the coffin from that juanrga argument that apu will replace dgpu/dcpu? COOLING? why is he on it again? 🙁

-Fran- · Jul 24, 2014

I still don't understand why you guys can't agree to disagree.

The ego boats are way too big it seems 😛

Cheers!

esrever · Jul 24, 2014

I wouldn't be surprised if local computing just stops being needed when almost all heavy computations could eventually be done on a cloud server. The technology in cloud computing is still fresh and not very ideal as of now but I think it can potentially be dominant in the years to come. There is only so many things you would need to do locally that you can't do when APUs are powerful enough to be 10s of terraflops. It will probably be a long time off but it could happen.

griptwister · Jul 24, 2014

Ooohhh, Something to speculate on guys 😀

https://www.youtube.com/watch?v=I7dReusG2aM

juanrga · Jul 24, 2014

jdwii :

A dGPU at 10nm only will have ~10% more transistors than iGPU. A dGPU requires transistors for discrete logic. iGPU has latency and bandwidth advantages. APUs benefit from HSA and similar approaches. At 10nm the power wall is of about 10x...

In short:

Both AMD and Nvidia abandon dCPU+dGPU and will build the fastest computers using APUs. Check AMD talks reproduced by WCCFTECH. Check the Nvidia slide that I reproduced, the slide where Nvidia GPUs are replaced by Nvidia HCNs (Heterogeneous Compute Nodes aka 'APUs'). Now check my signature.

Cazalan :

Future APUs/SoCs (HCNs in Nvidia parlance) also include SSDs. AMD only draws the SSD integrated in the SoC in a illustrative diagram given during a keynote. Nvidia gives more details. Their project mentions a 512 GB Flash or PCRAM SSD in the HCN.

AMD CPU speculation... and expert conjecture

Glorious

Honorable

Distinguished

Distinguished

Distinguished

Glorious

Distinguished

Distinguished

Distinguished

Judicious

Judicious

Distinguished

Judicious

Distinguished

Judicious

Distinguished

Distinguished

Splendid

Distinguished

Splendid

Distinguished

Glorious

Splendid

Distinguished

Distinguished

Share this page