AMD CPU speculation... and expert conjecture

de5_Roy · Jul 12, 2014

juanrga :

nowhere on the page nor on the promo slide does it say that cpu and gpu can manipulate the same data at once nor how. the page and the slide says about sharing resources, using the same memory space, coherency, but not the same data, not at once. in the phoronix link i posted, says hsa is about allowing different processor types share system resources more effectively and mentions shared pageable memory for example, but not "CPU and GPU manipulating cooperatively the same data at once". now it looks like you tried to b.s. through an argument, got caught, and tried to b.s. through again (edit: from the looks of it, the first result from the first page of a google search, instead of in-depth research)... and got caught, again.

if more knowledgeable people find my statement incorrect, feel free to prove it with credible, technical explanation. i'd be happy to be wrong since it'd mean amd has revolutionalized throughput computing and hsa.

juanrga · Jul 12, 2014

con635 :

It is true that I predicted this was going to happen (I did the math about one year ago) but here I want emphasize what AMD already admits openly

http://www.theregister.co.uk/2014/06/20/amd_25x20_power_efficiency_pledge/

http://www.amd.com/en-us/press-releases/Pages/amd-accelerates-energy-2014jun19.aspx

Recent AMD papermaster talk about the plans of the company for the year 2020 don't mention dGPU in any moment, because AMD knows that dGPUs don't scale up. Papermaster clearly mentioned that silicon alone cannot bring a 25x efficiency gain and that only APUs with HSA can achieve that goal

Current A10-Kaveri: ~800 GFLOPS on ~100 W

Multiplying by 25x efficiency

APU of year 2020: ~20 TFLOPS on ~100 W

Intel will start selling a 200 W processor the next year. If we consider that higher TDP then

APU of year 2020: ~40 TFLOPS on ~200 W

When Papermaster mentions "25x" he is rounding for marketing purposes. Moreover, the above numbers lack the memory on die because we use Kaveri as baseline. I did all the math and I expect something more close to 250W for a 40 TFLOPS APU. Nvidia engineers also did the math and they expect their own 40TFLOP APU will be rated at 300W TDP.

AMD APU of year 2020: ~40 TFLOPS on ~250 W
Nvidia APU of year 2020: ~40 TFLOPS on ~300 W

TDPs can vary ~50W depending of the amount of memory on die and its bandwidth.

There is no way an engineer can obtain 40TFLOPs, using a 250--300W dGPU. Silicon will only provide a ~2.8x thus

R9-290X: ~5.6 TFLOPS on 290 W

dGPU of year 2020: ~16 TFLOPS on 290 W

...................................................................................

We can continue with this interesting stuff. Using the APUs to build a fast supercomputer for the year 2020

100000 APUs of year 2020: ~4000000 TFLOPS on ~20000000 W

This correspond to an exascale supercomputer on 20MW, which is precisely the DARPA goal for the year 2018-2020. This is why Cray, Nvidia, AMD, Intel, Fujitsu... are designing future supercomputers around APUs/SoCs. Nobody in the whole planet is using dGPUs because don't scale up.

blackkstar · Jul 12, 2014

Juan, you still don't get my point. You act like APU will solve all the problems of dCPU and dGPU and let it scale. But you forget that HPC APUs will need more than one APU to scale. And at that point you end up in the same situation as you would with dGPU and dCPU. Off die dCPUs and dGPUs sharing coherent memory. Meaning that what I'm saying is going to have to happen anyways. The only difference with APU is that each socket is shared between CPU and GPU. You will still need to transfer data across each APU as well. That bottleneck is still going to exist when you stitch all your APUs together. Yet you somehow think that invalidates dCPU and dGPU in every single market because HPC is pushing for many APUs.

So all these issues like PCIe 3.0 latency and such are going to have to be fixed eventually. There is no way around it. And that fix will apply to dGPU and dCPU systems as well.

And as others have noted, APUs will not scale to as many cores as discreet counterparts. People are also forgetting that dGPU and (probably) dCPU will also scale in performance as time goes on. APU is not the only device that will be increasing in performance until 2020.

The problem with APU like you are saying, is that it introduces many more bandwidth and latency problems between each APU as you need more APUs to do the task of a single large GPU.

In 2008, 6 years ago, we were using 8800GTS as the high end and we were transitioning away from the AGP bus. 2002 we had the 9700 from ATI. Do you see how APU growing so much in 6 years does not make it a special snowflake?

Let that sink in for a moment.

Your other problem is you are basically comparing a more efficient mid range GPU on an APU to a high end dGPU. Would you care to explain to me how having a CPU which has low throughput thrown on every single GPU in the system would be better than many smaller, more efficient GPUs on their own? Why have those CPUs sitting there contributing to TDP and power consumption when you can just have one CPU with many smaller, more efficient GPUs?

Who is to say that an efficient dGPU like 7790 or 750 Ti won't just scale like you're saying with APU? It's clear that high end parts are having issues, but if you take the GPU from an APU that's scaling like that and do something like put it into a socket (sort of like Knight's Landing) you save on having a useless CPU adding to TDP and stuff.

And as I said earlier, dGPU shared like that will have the same exact problems as sharing many APUs.

jdwii · Jul 12, 2014

juanrga :

You are twisting the original claim and then relabeling as "magic" what rest of us call physics.

APUs scale better. Your belowed Nvidia engineers did the math.

Don't twist my words I wrote "same data". The performance benefit of HSA has been shown with actual cores.

CPUs have multiple layers of high speed cache on die for reducing latency penalty due to slow main memory. GPGPUs have multiple layers of high speed cache plus a fast memory subsystem. A typical GPGPU from Nvidia or AMD has GDDR5 and >300GB/s.

When you mentioned "current APUs" you picked a slow APU bottlenecked by DDR memory and a top GPU that uses GDDR5. You also omit that we are making claims about future APUs not about "current APUs". What part of "year 2020" is not still understood? You would read the "25x20" talk given by AMD (link given before).

Can we stop with the beloved nvidia crap as it stands they have a superior design that uses 30% less power for the same performance. Nvidia does a lot of closed crap but they are good at what they do....graphics.

jdwii · Jul 12, 2014

juanrga :

Show me a link proving that it has IPC on level with haswell its your claim prove it. Also lets all set and love juan for acting like TFLOPS are the only performance measure. If that was true we would see Amd's 280X be around 25% faster ALL the time compared to a 770gtx. Also claims mean nothing at all and under his finding's he used 25% which wasn't anywhere enough for your claims of 20+Tflops which you still haven't told everyone yet that its not the only performance metric.

juanrga · Jul 12, 2014

I am saying you that I cannot predict what you asked me about framerates for the year 2016, what part of my message: "It is not easy to predict, because it depends of lots of factors: market evolution, foundries roadmaps, memory maker roadmaps,..." you didn't get?

Second, don't forget that those claims about GPUs are not only from mine, but also from GPU companies such as Nvidia.

Third, the market couldn't care less about what you care ;-) We know that future games will use more and more the GPU for compute, not only for rendering.

Moreover, leading game developers are already studying the abandon of rendering techniques used in current GPUs and replace them with a new arsenal of compute-based rendering tools. One of the more interesting parts of those new techniques is that avoid the use of intermediate layers of graphic APIs, and rely on general languages such as CUDA to access directly to hardware general compute capabilities bypassing the graphics pipeline completely.

This is a very very old demo of rendering using CUDA, but can give you an idea

http://hothardware.com/News/NVIDIA-Shows-Interactive-Ray-Tracing-on-GPUs/

Finally, it doesn't matter if you care about KL or not. By outperforming a 290x, the KL puts in place to those 'experts' here who claimed it would be impossible to hit 290x level raw performance. In the second place, KL will put lots of pressure on the GPU divisions of both Nvidia and AMD. Both companies will need to accelerate their respective SoC plans.

juanrga · Jul 12, 2014

I already suspected that I would not spend time on explaining this to you, because you would not understand anything and you would return with perennial fantasies about "caugth", "bs", "lies"...

I gave you a pair of links but I am not surprised you cannot understand either.

juanrga · Jul 12, 2014

Your point was got and debunked earlier, but you act as if it never was.

Your strategy starts by making a fallacy. In the post just above the yours, I mentioned how one will need about 100000 APUs for 20MW HPC. You start by ignoring what i said with your fallacy "you forget that HPC APUs will need more than one APU to scale".

Then you continue by adding another fallacy. You pretend that an APU in a socket needs to feed APUs in other sockets, which is false. Inside an APU the CPU feeds the GPU. But an APU doesn't feed another APU. Only in some special cases an APU can require computation from another; however, the software is being explicitly designed for reducing such situations to a minimum and running computations locally.

In your CPU-in-one-socket-and-GPU-in-another-socket idea, we would always have to move computations between CPU and GPU. Wasting latency, bandwidth, and energy. Your idea brings a much slower and power hungry system. This is the reason why no engineer consider your idea (APUs are used instead).

Finally you post another fallacy, pretending that CPUs and GPUs will scale magically forever, when we know this is untrue (you ignored anything of what I wrote about this).

The funny part is that what you propose, using GPUs on sockets, is an old idea that was evaluated and rejected by AMD many years ago:

AMD's leading software expert Neal Robison said that Fusion-architecture - which integrates general-purpose [x86] processing cores with highly-parallel stream processors of Radeon GPUs - is a better solution for high-performance computing than to install special-purpose accelerators into CPU sockets. According to AMD, "it makes more sense from the software developers standpoint". Besides, it investments into "tool has already been made so we might as well use it". It looks like the once proposed Torrenza platform is no longer even considered as viable.

"APU is a better and cleaner solution than sticking a GPU in the same socket," said Neal Robison.

http://www.xbitlabs.com/news/other/display/20111211180811_AMD_GPGPU_Accelerators_in_CPU_Sockets_Make_No_Sense.html

I find amazing that Gamerk expects that the green company that said that dGPU will be replaced by 2020 will continue making dGPU for him, and you expect that the red company that has killed its Torrenza platform will do GPUs in socket for you.

de5_Roy · Jul 12, 2014

juanrga :

your parroted links do not contain any remote indications about how this:

juanrga :

works. i call it parroting because you utterly failed to provide any personal understanding to how that even worked. it's safe to call you a liar now, since your own provided info do not contain any such claims made by amd.
seems like you can't explain because you don't even know how that would work. even if amd had such technology you sure didn't provide any link to that.

edit: i re-read you reply and.."I already suspected that" - you already suspected that you'd get caught lying? or did you expect that if you lie you'd be called out?...then don't resort to lying and post credible and properly represented info, LOL.

Cazalan · Jul 12, 2014

If the dGPU market were to go away then both Nvidia/AMD would go out of business. And thus no 2020 AMD APU.

jdwii · Jul 12, 2014

juanrga :

I already suspected that I would not spend time on explaining this to you, because you would not understand anything and you would return with perennial fantasies about "caugth", "bs", "lies"...

I gave you a pair of links but I am not surprised you cannot understand either.

No you can't use the same memory at the same time. Also for further notice tell people what fallacy they are committing not just call it a fallacy

jdwii · Jul 12, 2014

Cazalan :

They won't get rid of them until they see fit that happens. Juan claims they have no DGPU planed but why do they need to tell us what they are making in that market 6 years form now when most of the information juan provides us mainly talks about servers not gaming rigs. I could swear juan claimed that we can't put sound on are video cards however sound cards exist how does that even work using his "logic"?

de5_Roy · Jul 12, 2014

jdwii :

you can use the same memory space - memory coherency. for example, multicore cpus use cache, system memory this way. the cores can read the data, but as soon as one core changes the data, the rest of the processor must be notified of the change.
or in multisocket servers (out of my knowledge-scope) where hypertransport or qpi links are used for maintaining coherence.

jdwii · Jul 12, 2014

http://cpuboss.com/cpus/Nvidia-Tegra-4-vs-Intel-Core-i7-4650U
Man i can't even begin to find any benchmarks comparing Arm's best with Haswell this is the closest thing i got for IPC they are clocked around the same(arm is clocked 200mhz more Tegra 4) yet its still 38.4% slower on geekbench and clocked 11% higher meaning Tegra is around half the performance per clock compared to haswell(cough PD). Based on proof and not claims i don't see anyone using Arm getting haswell IPC soon. Edit also looking into it it seems like geekbench favors soc.

esrever · Jul 12, 2014

DGPU will eventually be like sound cards. Will take a while tho. It gets to a point where even gamers don't need or want more than what integrated will give them. Technically, the consoles are already gaming PCs that run integrated graphics. Given it a couple more generations and who knows.

szatkus · Jul 12, 2014

jdwii :

You couldn't find worse benchmark? Use Phoronix, there are a lot of ARM/x86 benchmarks.

jdwii · Jul 12, 2014

szatkus :

jdwii :

You couldn't find worse benchmark? Use Phoronix, there are a lot of ARM/x86 benchmarks.

None came up i tried finding a benchmark for Arm vs Haswell IPC. Very hard

esrever · Jul 13, 2014

Go look at the 3dmark physics tests. The snapdragon 801 gets owned by atoms let alone haswell.

colinp · Jul 13, 2014

juanrga :

Don't take that condescending attitude. What I get is that you are making bold claims about dGPUs being obsolete by 2020, but can't even articulate what the picture will look like in 18 months.

Don't take that condescending attitude. The market does care what their customers want. And gamers want high frame rates at native resolutions with all the sliders to the right.

Finally, to reiterate, can you please stop taking that condescending attitude? I'm here to have a discussion, ask some questions, get some answers, etc.

Fidgetmaster · Jul 13, 2014

Only caring about frame rates.... and not/understanding the others, that just sounds awful ignorant haha, and no DGPUs are not going anywhere anytime soon...

de5_Roy · Jul 13, 2014

esrever :

ah, discreet sound cards... good times. yeah, it'll get to a point where consumers will have "good enough" gaming performance (e.g. 1080p @60fps and decent antialiasing) from cpu and igpu at cheap price. carrizo might do that.
i wonder how much power those socs will use after a node shrink. and if someone would jailbreak a ps4 and run x86 o.s. like ubuntu or freedos on it.... an soc like that in an intel nuc or gigabyte brix form factor case will make a great living room gaming pc.

szatkus · Jul 13, 2014

jdwii :

szatkus :

None came up i tried finding a benchmark for Arm vs Haswell IPC. Very hard

Because Haswell is so much different from SB and IB.

8350rocks · Jul 13, 2014

juanrga :

8350rocks :

The new AMD core will not outperform Skylake core in performance. The goal is to outperform Skylake core in efficiency, which is the important metric.

We know that K12 is the ARM core. I said this dozen of times. We know that the name of the sister x86 core is not still decided. I also said this dozen of times. Stop repeating to me stuff I said before.

Some details of the new AMD arch:

# the new cores abandon CMT and will use SMT.
# the new cores will be small (Puma-like style), emphasizing efficiency rather than raw performance [1]
# 14nm FinFET process, (16FF+ as backup option)
# Frequency up to 4GHz (on 14FF)
# lots of improvements in cache subsystem, including a novel stack cache element developed by Keller [2]
# up to 16 cores on die
# modular L2/L3 caches. L3 optional
# modular SIMD unit
# DDR4 support as standard
# HBM on-package as option
# LGA package

NOTES:

[1] Someone said that next gen FX CPUs will have 8/12/16 cores. This seems to show that AMD will be competing against 4/6/8 core Skylake using "moar cores".

[2] I asked to someone if the advantages over a traditional cache are (i) reduction of cache misses because the data is preloaded in advance to its use and (ii) latency reduction because the data needed at a given instant of the flow is already placed at stack N instead having to be loaded from different cache localizations. And if the disadvantages would be higher complexity to implement it and maybe latency penalty if data is preloaded in incorrect order. THe answer I received was "Yes, you are right".

To be fair he was right. We don't know any game using "mantel", only lots of games using MANTLE

Except the name is decided, just not yet released.

Also, if they aim to outperform PPW, and do so at similar TDP to Intel, how can you say their goal is not outright performance?

If PPW is greater on a chip with ~70-80W TDP over another chip with similar/same TDP, is outright performance better? Yes!

blackkstar · Jul 13, 2014

szatkus :

Actually, AVX2 can make a massive difference. But people insist on running Windows only and enjoying letting half their CPUs sit idle because all it does is run x87 or SSE2 code all day long and they are so afraid of moving away that they get all grumpy and defend things like crazy.

Also Juan, my point still stands. Yeah, the previous slide does have APUs, but I took those slides to mean that they were going for APU implementations right now and that they were leaving the door open to dGPU + dCPU HSA/hUMA later on.

No one said a dGPU in HPC needs to be on a discreet card and to be the biggest GPU you can find. APU is efficient because it has a GPU that offers very good performance per watt. Not because it is an APU.

You compare highest end dGPU to APU and then go "wow the PPW sucks!" GTX 750 Ti was twice as efficient at cryptocurrency mining as 260x.

https://litecoin.info/Mining_hardware_comparison

290x, not overclocked much, is about 700KH/s. That's a 300w card. It is twice as inefficient as GTX 750 Ti at 60w for 256KH/s.

If you are going to compare APU GPU for PPW, try and compare it to a GPU of a similar size and TDP. And you will see the benefit not only goes away, but it is worse on an APU because you need all the CPU stuff on the same die. You also have to optimize the process for GPU and CPU instead of just GPU.

And as I've been saying, you're going to have the problem of inter-chip communication regardless of if they're GPUs, CPUs, or APUs.

szatkus · Jul 13, 2014

blackkstar :

Not only Windows, most Linux software is compiled for Pentium 4-class CPU. Exceptions are distributions like Gentoo, but in general most Linux, Windows and Mac users install precompiled software. There's also Java which could be optimized to particular CPU during runtime, but it's almost non-existent on desktops.

Phoronix is biased, because they use -native flag during compiling their benchmarks, but it's the best we have.

AMD CPU speculation... and expert conjecture

Splendid

Distinguished

Honorable

Splendid

Splendid

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Splendid

Splendid

Splendid

Splendid

Splendid

Honorable

Splendid

Splendid

Honorable

Reputable

Splendid

Honorable

Distinguished

Honorable

Honorable

Share this page