AMD CPUs, SoC Rumors and Speculations Temp. thread 2

Page 45 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
The most expensive component of putting 4096 SPUs on die in a SoC, would be the fact that you now have to hardwire HBM to the system itself to get the performance that part needs.

Sure, you can fit the die together and make a big SoC now...but the memory to keep it fed is still relatively cost prohibitive for large volume. 5 years from now...might not be that big a deal...
 


% of the die that is cpu we may not know but to say its not going toward an SoC future would be a lie. there are simply too many benefits and eventually these APU's will be able to work for power users like CAD and gaming. hell they already can do these, just slowly.
 


5 years I would bet only enthusiast have it but 10 years on die ram will be all anyone uses.
(look at 2010. llano was crappy but enthusiasts gave it a try. my llano apu played JC2 and I remember how amazed I was at that thing.)

hell you could see 10 years only human interface devices plug into the SoC. screen port, usb ports (keyboard mouse etc..) network antenna and battery all plug into what would look like a big heatsink and be faster than todays i7. just direct wires to HID from the SoC as when on die latency is lowest. desktops would all move to be AIOs phones will be about the same. batterys have to be batterys. but what the next decade will hold just gets me excited. I really think APU's are the future for all users.

Take a second guys and remember what 2005 was like! the new Carrizo APU is faster than the best gaming desktop 7800 SLI ($2400 new) from that day. uses 15W... costs $600 in a laptop.
 


AMD management is well-know by taking wrong decisions and pushing odd strategies. Rory gave some sanity and stabilized the company. Lisa has returned to odd style of management.
 


The EHP APU described by AMD engineers has much more than 4096 streaming processors.
 


Indeed, the arguments about number of transistors, speed, power, TDP,... were all refuted in the older thread, where I showed that dGPUs have no future.

This year, AMD engineers published a paper on IEEE about their future plans. And they have confirmed my early thoughts. I finally managed to get a copy of their work. I will reproduce the part where AMS engineers explain why rejected CPU and dGPU:

Beyond the envisioned heterogeneous system we have described, we should consider other candidate system organizations. For example, a node organization comprising only CPU cores (that is, no GPU acceleration) could provide programmability benefits. However, our analysis indicates that a CPU-only solution does not provide a sufficient performance-per-watt advantage to enable the target levels of execution throughput within the desired power budgets. Another approach would be to augment a conventional multicore CPU with external (discrete) GPU cards. We believe that the integrated GPU approach has the following advantages over external GPUs:


  • ■ lower overheads (both latency and energy) for communicating between the CPU and GPU for both data movement and launching tasks/kernels,
    ■ easier dynamic power shifting between the CPU and GPU,
    ■ lower overheads for cache coherence and synchronization among the CPU and GPU cache hierarchies that in turn improve programmability, and
    ■ higher flops per m 3 (performance density).
 
Juan, notice nowhere in your paper does the conclusion of "better absolute performance" show up. The only benefits listed are efficiency and overhead related. As far as absolute performance goes, dGPUs are going to remain more powerful for reasons that have been explained to you over and over and over again.

Yes, continued shrinkage will allow more powerful APUs to be created, but as long as dedicated GPUs offer more performance, which they will due to power/thermal/cost constraints, there will continue to be demand for them.
 


No. The paper makes it clear that the goal is to provide the fastest computer in the history. The exascale goal is to achieve 50x the current performance. It is not viable to increase power consumption by 50x, which left engineers only one option: to increase efficiency by a factor of 25x in the next few years

DARPA-goals.png




Precisely the reason for all the exascale research relies on the fact that one cannot achieve that performance level using only process node shrinks. You cannot take current CPUs and GPUs port it to future 10nm or 7nm node and magically get 50x faster computer. As I explained you plenty of times the current architectures doesn't scale up enough and have to be abandoned. That is why new computer architectures are being explored/designed by engineers.
 


The thing is, eventually the performance benfit of a discreet GPU will be so small over an integrated solution, it will no longer make sense to produce them.

Currently the fastest APU sits around the performance of the R7 240 discreet gpu, which is capable (just about) of playing most games at 720p. This isn't an experience most gamers want, and as a result there is a big market for discreet cards.

Now, lets hypothesize that APUs were way further ahead now than they are, and that the top APU includes a fast processor, and a gpu in the R9 390 / GTX 970 range of performance. That would only leave 1 tier of gpu *above* the integrated solution. Now imagine if that APU cost the same or slightly less than the equivalent cpu + discreet card. So, the *only* situation a buyer wants the dicreet card is if they need the very maximum performance, and will have to add on the cost of an expensive cpu + the gpu compared to a combined solution that costs considerably less.

Now I agree there will probably be a small market for that top discreet card, however with another node shrink the performance difference gets *even less* (maybe getting to equivalent product position of Fury including cpu, vs a discreet card that would be equivalent to Fury X). At that point we are talking 5% performance gain of the discreet.

Also at that point, all of a sudden the small gains available through reduced latency and communication over an external bus start to make up that 5%. At that point, that is when dgpu's die off. We're nowhere near that now I agree, however I'll be interested to see what the Zen based APU looks like on 14nm- that could potentially have an R9 270X class gpu or even R9 380 level. That would certainly start to get to the stage of mainstream gaming at 'console +' levels of fidelity that PC games want.

So many other discreet accelerators have been absorbed into the CPU, I really can see how the cpu and gpu will be integrated in the future. I think the point people are missing however is it's the CPU portion that will be absorbed into what will effectively become a CPGPU, as it's the GPU that scales up to all available die space. As so many have pointed out, for gaming or consumer workloads, more than 4 core / 8 threads is pretty pointless right now. Maybe 8 core / 16 thread will become useful in future but so much software is reliant on single threaded code there is likely to be a limit. This is what permits the integration of the two without significant loss of performance, assuming the total transistor budget is large enough to make that cpu portion effectively insignificant.
 


I don't believe Lisa has been in charge long enough to have had *any* real impact on company decisions or strategies, other than fighting fires left over from previous company heads. What is happening now was all likely put into motion by Rory and was pre-determined before Lisa took her post. Look to what happens post Zen launch to see the results of Lisa's leadership imo.
 
https://en.wikipedia.org/wiki/Gustafson%27s_law

As processing power increases, developers will find a way to use it. Sure, you make an APU in a few years with 970 GTX performance. Too bad games will be requiring about 5x the GPU power by then, because of some super-accurate lighting effects that eat performance, new AA modes, more advanced AI, physics, and the like.

So you still have the same problem where APUs are going to be power/thermal limited, where dedicated GPUs are more free from that restriction, being on an external package.
 


As I explained you in the other thread, the scaling of silicon is non-linear and at 10/7nm the cost of moving data off-chip will be about 10x bigger than on current nodes. This is known as the locality principle (computations have to be local) in exascale research and the reason why future designs are SoCs. I quoted before to AMD engineers explaining why rejected discrete GPUs. I will quote now Nvidia engineers explaining why they do the same:

In this time frame, GPUs will no longer be an external accelerator to a CPU; instead, CPUs and GPUs will be integrated on the same die with a unified memory architecture. Such a system eliminates some of accelerator architectures’ historical challenges, including requiring the programmer to manage multiple memory spaces, suffering from bandwidth limitations from an interface such as PCI Express for transfers between CPUs and GPUs, and the system-level energy overheads for both chip crossings and replicated chip infrastructure.
 


She canceled Skybridge project and the 20nm APUs, delayed K12, re-emphasized Zen,...
 


Ah, to be honest though I guess she didn't have much choice.

There is no real market share for ARM high performance cores currently, and Skybridge would have been a great engineering exercise but also wouldn't drive revenue. Zen is currently the *only* product that makes sense in AMD's traditional markets which is what they must stabilize *first* before moving on their ARM initiative. I'm not sure I'd call it 'odd style', more moves dictated by necessity.

I mean long term K12 could be the more important design as it allows AMD to fight on different terms to Intel- however the market isn't there yet.
 
Quotes from deleted posts removed by Moderator.

I don't see high-performance ARM going anywhere anytime soon. The most powerful ARM core is currently the one that powers my Galaxy S6, and via Headline benchmark, my 2600k is still 10x more powerful. Even adjusting for clocks, I still get a 5x performance difference. And I'm sorry, K12 isn't closing that 500% performance per-core deficit.

Oh, nevermind the lack of software the power said CPUs in the performance markets.
 


But then you need to remember the performance deficit between Sandy Bridge and Atom, one is designed for performance and the other is for power consumption. Mobile ARM processors are built for minimum power consumption, even if they label it "performance", like the A57.

K12 can (could?) be a lot faster than mobile ARM CPUs, as long as it is designed to do so.

Also, just like Zen can have lower clocks and improved efficiency due to the process node, Sandy Bridge was built on a performance-optimized process. We can't just compare like this and extrapolate to K12, we need a server-grade ARM CPU to estimate its future.
 


Does the Silvermont core designed for phones and tablets gives the same performance than the Haswell core designed for servers? Then why would the ARM core designed for a phone/tablet provide the same performance than the ARM core used in a server SoC?

The A57 (for phone/tablet) has about one half the IPC of Haswell, whereas AMD K12 core (for servers) target 2x that IPC. The Vulcan core (for servers) is expected to provide ~90% of single thread performance of Haswell core but will be faster on throughput (thanks to its support for 4 threads per core).

Everyone here agrees that Zen will be somewhat at Haswell level IPC. And Keller said that K12 will be faster than Zen.
 


Agreed, there is no reason you technically *can't* build an ARM core that is as fast in raw performance as x86 unless I'm totally misunderstanding something. The main issue is lack of ecosystem / software to support such a product right now, although without the product there's no incentive to make the software. Classic 'chicken and egg' scenario 😛
 


Been awhile since I chipped in but I've answered that question already.

At their heart CPU's aren't complicated things, binary math hasn't changed in many decades. It's trivial to make an IC that adds binary digits together and outputs the result into memory. Problems arise when you start trying to make it "go fast" and execute unpredictable arbitrary code, because while the math units are extremely easy to make they can't do anything without an accompanying I/O subsystem. As you scale the workload up, that I/O subsystem starts being your bottleneck. It's no longer an issue about executing instructions, but one about getting those instructions there in the first place. If you were to look at the die map of a modern x86 CPU you would notice that an incredibly small area is actually devoted to executing the instructions, 5~10% tops. The rest is devoted to control logic and various caching / prediction algorithms, aka "magic", in an attempt to alleviate the I/O bottleneck. All those extra transistors use power, lots of it.

Knowing that we take another look at the ARM design. ARM CPU's are specifically built without much of that "extra magic" precisely to keep their power usage down. As long as the ARM CPU keeps it's throughput relatively low, then it can continue to operate without the magic and thus with a lower power usage. The moment sometime tries to ramp throughput up, into main stream x86 / Power / SPARC ranges, they suddenly slam right into that I/O bottleneck wall and need to start devoting more and more transistor space to "magic" to circumvent it. As they approach those higher performance levels they would notice their power advantage becoming less and less until suddenly they aren't any different then anyone else. Out of all the CPU manufacturers, nobody comes close to Intel for how advanced that "magic" secret sauce is. People like to rag on AMD, but in all honestly AMD is about on par with everyone else in the industry for CPU designs, they just happen to be in the unenviable position to compete with chipzilla who is so far advanced in control I/O that it's not even funny anymore.
 


Not for at least twenty years, if not longer. And that's assuming some sort of phenomenal never-before-seen increase in processing technology. We're talking borderline "I found it in a secret alien ship" type increase here.

Take the nVidia 980 Ti which has approximately 8bn transistors, not counting memory. Intel i7-6700K has somewhere around 1.75bn transistors, which includes it's small onboard iGPU. FX8350 has around 1.6bn transistors and the A10-7850K has 2.41bn transistors, including the iGPU.

That is the kind of power discrepancy that exists between dedicated vector processors (dGPU) and general purpose central processors (CPU/APU). There is almost an order of magnitude difference between the most powerful iGPU and the most powerful dGPU, not to mention dGPU's will have specialized ultra-wide, ultra-fast memory bus's dedicated while iGPU's will have to share with the central memory implementation.

We are decades away, at a minimum, from having "too much" vector processing power. We've barely scratched real time ray tracing and physics and have been experimenting with various implementations of 3D. Imagine what the graphics processing requirements will be once holographic displays become a consumer reality. There is just too much that you can do with a powerful vector co-processor available to the system.

For graphics, the PCIe bus presents zero issues. Programs simply upload their data sets to the dGPU's memory prior to execution, and as execution happens the program just keeps putting data into that memory before it's needed. The interesting thing about vector style processing is that it's incredibly predictable, compared to general processing and graphics memory is so large that there is never a problem of not having your data present prior to execution time. Thus the only issue from the PCIe (or any other) bus is latency, which is only an issue if your trying to use the dGPU has an integrated math co-processor instead of a dedicated graphics / physics co-processor. That is the only real advantage to having a local vector co-processor, and while it's a really good advantage it's not one that replaces the dGPU.

What your going to see isn't the iGPU replacing the dGPU, but rather the iGPU complimenting it. You will have a powerful CPU with a low to medium iGPU and a powerful dGPU. The CPU handles general computing with the iGPU acting as a co-processor while the dGPU acts as a graphics processor or physics when it's tied to graphics.
 


You said what I have been trying to say. ARM is great as it is since it is less complex, has less "magic". The second you put in those more complex features like x86 has the benefits go out the window.
 


I agree that you *cannot have enough vector processing*. You missed my point however- I'm not suggesting that gpu's go away, but rather as manufacturing process becomes smaller it's the CPU that becomes the insignificant component. Purely discreet CPU's are already tiny compared to a decent GPU- my thinking is that the cpu will be absorbed into the gpu rather than the other way around. How the system is built exactly around that I can't say yet- and the option to have multiple 'nodes' is still likely for enthusiast market....
 


General purpose and vector processing have entirely different requirements. Any system optimized for one will not be optimized for another. I mentioned earlier that doing binary math isn't difficult, it's the control logic and I/O subsystem that takes up all the space and resources. That subsystem wouldn't work well with powerful vector co-processors who need their own I/O subsystems. Also processing power is still needed, now that cheap consoles have hit their efficiency wall and we've finally started moving away from 32-bit computing, we will start to see a return to demanding software titles.

For these reasons the two will remain separate for a very long time, longer then the foreseeable future. Remember scalar computing is very bad at multiplexing complex math problems but amazing at arithmetic and control logic. Conversely vector computing is great at doing all that complex math but absolutely horrible at doing control logic. No single processor can do both and thus you'll always need two. The I/O subsystem that connects each will be very different.

For homework look up the cache subsystem on both the nVidia Maxwell design and the AMD Tahiti design, then compare them to Intel Skylake / Haswell / Broadwell and the AMD Kavari / Steamroller / Piledriver.
 
Status
Not open for further replies.