Currently, GPUs have a theoretical SP performance at around 5TFLOPs? I think the R9 290X sits at ~5.5TFLOPs at 250W. Let's use that. And for reference, Kaveri's A10-7850K is ~900GFLOPs at 95W. And to simplify more, TDP = W consumed. Also, both are at 28nm, right? Using process nodes within strike distance of each other, so the comparison is the more interesting. And also, I'll use SP, since DP is capped artificially for the consumer level cards.
So, for HEDT, if you want to do heavy computing, You'd need 6 A10-7850Ks to reach ~5.4TFLOPs, making 500W. That means you'll have to introduce a platform capable of having: 6 sockets, the sufficient interconnects to cable all that mess AND enough slots for all the RAM banks you'll need (that's not even counting space). And that's not even counting you can plex PCIe lanes to have more than 2 cards per X16 split.
Holy cow... I'm sure GPUs are totally dead. Yessir.
Is there any other formula you want us to try, Juan?
Cheers!
Your values of performance and TDP for the R9-290X are a bit off. Using correct values we obtain 19.4 GFLOPS/W
Then you compare the efficiency of the dGPU against an APU but that is a faulty comparison.
CPUs have poor efficiency because are made of latency optimized cores. By comparing the dGPU against APU you are including the bad efficiency of the iCPU but, at same time, you are ignoring the bad efficiency of the dCPU.
Therefore, you are artificially favoring the dGPU. And I say "artificially" because I don't know any dGPU that works standalone without dCPU, do you?
You would compare the efficiency of the APU against the whole [ dCPU + dGPU ] system. If you make a correct computation, and if you know basic silicon scaling laws, you will understand why AMD engineers are proposing HSA-APUs for the future supercomputer.
8350rocks :
Ok Juan, assume it is powered by one of your 5W APUs. In HPC that would likely be the case.
It does not change the situation, by that time it may even be something like a 3W ARM processor for all we know.
I was going to give a hypothetical scenario with an Apple A7 SoC as the "dCPU" of such system.
AMD will end up creating a GPU with an iCPU to have HSA numbers that justify such design decision. We'll be able to do raytracing in realtime, but the games will still run like crap because the CPU won't be fast enough for the regular "legacy" serial stuff.
The HSA version seamlessly shares data between CPU and GPU, without memory copies or cache flushes because it assigns each part of the workload to the most appropriate processor with minimal dispatch overhead. The net result was a 2.3x relative performance gain at a 2.4x reduced power level*. This level of performance is not possible using only multicore CPU, only GPU, or even combined CPU and GPU with today’s driver model.
Perhaps I am reading this wrong...but, it seems to me that they are discussing this in a manner that no longer requires a data copy to be stored on VRAM. Now, perhaps, I am missing additional information...as I honestly have not talked to a software engineer about the implementation details, I have only gotten the "high level" information at this point. So, I can see where you are coming from, but, I came away from the discussion with different impressions. Perhaps I am wrong in this instance...I will seek clarification.
It's impossible to do calculates without first having the data to do those calculations on. The data MUST be copied, nothing will save you from that. What they are talking about is that you, the programmer, don't need to copy the data manually or manage the coprocessors (GPU in this case) memory space. Traditionally you'd have to first copy your dataset into the coprocessors local memory, then establish a session with it and send your instructions, receive the result and send some more. Essentially you have to micromanage everything about the coprocessor, that micromanagement was a headache and presented some real limitations on coding. With something like HSA, the coprocessor manages itself. You just send the instructions with the pointers and the coprocessor will copy the data on it's own, and chances are the data is already there as the prefetcher read the code ahead of time and started the copy before it was needed.
May be slightly off-topic, but I figure you guys probably know just as much as anyone - what ever happened to those enthusiast Kaveri laptop parts (sounds like an oxymoron, doesn't it?) like the FX-7600p? Black Friday has come and gone, and still no FX-7600p laptops - what happened to all those design wins AMD was talking about back in July?
Good question and one I've been asking too. Apart from one monster sized MSI laptop, all the Kaveri laptops I've seen have been ULV parts. And AMD GPUs are almost entirely absent from laptops at the moment.
I am absolutely certain dGPUs will always have more raw performance than APUs. The difference lies in how well that performance can be put to use - if you can solve 5 addition problems in your head in a second, but it takes you 5 seconds to read each problem you're finding the solution to - and another 5 seconds to write them down - the guy who takes 2 seconds for each stage will still get the homework done faster.
I guess what I'm trying to say is that there are tasks for which the low-latency, serial nature of CPUs and the massively parallel nature of GPUs are both needed. An example I thought of just recently deals with collision detection - one could have the iGPU check for collisions between moving objects and all other objects (3 levels of data parallelism - multiple moving objects, multiple objects to check against each moving object, multiple triangles to check for each object), have the CPU sort it in order of which collisions occur first, then do physics calculations in order of collisions, passing off work for re-checking collisions at each update to the GPU. The CPU does quick, sequential operations where they need to be done in order (in this case, in order of time) and the iGPU successfully accelerates the actual finding of collisions (and possibly the resulting physics, depending on how sophisticated they are) thanks to the inherent data parallelism. With the lacking bandwidth and high latency of the PCI-Express bus, I highly doubt such efficient communication between the devices would be plausible with a dGPU.
I could be wrong, of course. It's been known to happen.
Good post! Let us do some computations.
The iCPU on Kaveri occupies about 50% of the die. On 7nm node the iCPU would occupy 3% of the same die. Duplicate the number of cores from 4 to 8 and will occupy 6%. Now double the size of each core and still the iCPU occupies only a 12% of the die.
But this is for a mainstream size die. The supercomputers will not use mainstream parts but top level high-performance parts. AMD engineers don't provide the size of their extreme scale APU. I could try to guess it, but luckily Nvidia engineers give the size of the their own extreme-scale 'APU': 650mm2.
The above 8-core iCPU (each core of the size of a Steamroller module) would occupy about 5% of the whole 650mm2 die. A separate dGPU with all the die spend on GPU-transistors would provide only 5% more "raw performance" than the APU. But this 5% more is in an idealized world assuming a perfect interconnect with zero latency and infinite bandwidth for the dGPU.
In real world, a real interconnect will reduce the performance due to latency and bandwith bottleneck with the final result that the dGPU will be slower. This is the reason why neither AMD nor Nvidia engineers use dGPUs for their extreme scale supercomputers.
AMD will end up creating a GPU with an iCPU to have HSA numbers that justify such design decision. We'll be able to do raytracing in realtime, but the games will still run like crap because the CPU won't be fast enough for the regular "legacy" serial stuff.
No. The architectural improvements will run faster the regular "legacy" serial stuff as well. Each iCPU core has more IPC than current Piledriver/Steamroller cores. Moreover, the improved cache and memory subsystem provides extra boost on serial "legacy" performance by reducing latencies compared to a dCPU such as FX-8350.
I am absolutely certain dGPUs will always have more raw performance than APUs. The difference lies in how well that performance can be put to use - if you can solve 5 addition problems in your head in a second, but it takes you 5 seconds to read each problem you're finding the solution to - and another 5 seconds to write them down - the guy who takes 2 seconds for each stage will still get the homework done faster.
I guess what I'm trying to say is that there are tasks for which the low-latency, serial nature of CPUs and the massively parallel nature of GPUs are both needed. An example I thought of just recently deals with collision detection - one could have the iGPU check for collisions between moving objects and all other objects (3 levels of data parallelism - multiple moving objects, multiple objects to check against each moving object, multiple triangles to check for each object), have the CPU sort it in order of which collisions occur first, then do physics calculations in order of collisions, passing off work for re-checking collisions at each update to the GPU. The CPU does quick, sequential operations where they need to be done in order (in this case, in order of time) and the iGPU successfully accelerates the actual finding of collisions (and possibly the resulting physics, depending on how sophisticated they are) thanks to the inherent data parallelism. With the lacking bandwidth and high latency of the PCI-Express bus, I highly doubt such efficient communication between the devices would be plausible with a dGPU.
I could be wrong, of course. It's been known to happen.
Good post! Let us do some computations.
The iCPU on Kaveri occupies about 50% of the die. On 7nm node the iCPU would occupy 3% of the same die. Duplicate the number of cores from 4 to 8 and will occupy 6%. Now double the size of each core and still the iCPU occupies only a 12% of the die.
But this is for a mainstream size die. The supercomputers will not use mainstream parts but top level high-performance parts. AMD engineers don't provide the size of their extreme scale APU. I could try to guess it, but luckily Nvidia engineers give the size of the their own extreme-scale 'APU': 650mm2.
The above 8-core iCPU (each core of the size of a Steamroller module) would occupy about 5% of the whole 650mm2 die. A separate dGPU with all the die spend on GPU-transistors would provide only 5% more "raw performance" than the APU. But this 5% more is in an idealized world assuming a perfect interconnect with zero latency and infinite bandwidth for the dGPU.
In real world, a real interconnect will reduce the performance due to latency and bandwith bottleneck with the final result that the dGPU will be slower. This is the reason why neither AMD nor Nvidia engineers use dGPUs for their extreme scale supercomputers.
every single speculation of your theoretical apu, is based on current size amd cpu cores/modules, its as if you intentionally don't want to even accept the fact that the cpu has to improve(in amd's case by a lot) which also means that the part of the die the cpu will occupy will be alot larger than 2-5%, if the cpu architecture and performance would stay the same then you would be right, but no one(who cares about performance of cpu) is happy with the performance of current amd parts, so the sizes would remain about the same proportion, also what makes you think we will still be on the same PCIE connections between cpu's and gpu's, as if progress wil stop completely on that front and everyone will just focus on APU. i get it you love amd and their apu's and you probably dont game or use your cpu to much, but for those who do their cpu performance is seriously lacking. HEDT market want bigger/faster cores and a dgpu with the whole space occupied just by gpu ideally the same for cpu(no igpu on the die) and have a lot better performance than your theoretical apu's. Dgpu will not die just because of this simple fact
I am absolutely certain dGPUs will always have more raw performance than APUs. The difference lies in how well that performance can be put to use - if you can solve 5 addition problems in your head in a second, but it takes you 5 seconds to read each problem you're finding the solution to - and another 5 seconds to write them down - the guy who takes 2 seconds for each stage will still get the homework done faster.
I guess what I'm trying to say is that there are tasks for which the low-latency, serial nature of CPUs and the massively parallel nature of GPUs are both needed. An example I thought of just recently deals with collision detection - one could have the iGPU check for collisions between moving objects and all other objects (3 levels of data parallelism - multiple moving objects, multiple objects to check against each moving object, multiple triangles to check for each object), have the CPU sort it in order of which collisions occur first, then do physics calculations in order of collisions, passing off work for re-checking collisions at each update to the GPU. The CPU does quick, sequential operations where they need to be done in order (in this case, in order of time) and the iGPU successfully accelerates the actual finding of collisions (and possibly the resulting physics, depending on how sophisticated they are) thanks to the inherent data parallelism. With the lacking bandwidth and high latency of the PCI-Express bus, I highly doubt such efficient communication between the devices would be plausible with a dGPU.
I could be wrong, of course. It's been known to happen.
Good post! Let us do some computations.
The iCPU on Kaveri occupies about 50% of the die. On 7nm node the iCPU would occupy 3% of the same die. Duplicate the number of cores from 4 to 8 and will occupy 6%. Now double the size of each core and still the iCPU occupies only a 12% of the die.
But this is for a mainstream size die. The supercomputers will not use mainstream parts but top level high-performance parts. AMD engineers don't provide the size of their extreme scale APU. I could try to guess it, but luckily Nvidia engineers give the size of the their own extreme-scale 'APU': 650mm2.
The above 8-core iCPU (each core of the size of a Steamroller module) would occupy about 5% of the whole 650mm2 die. A separate dGPU with all the die spend on GPU-transistors would provide only 5% more "raw performance" than the APU. But this 5% more is in an idealized world assuming a perfect interconnect with zero latency and infinite bandwidth for the dGPU.
In real world, a real interconnect will reduce the performance due to latency and bandwith bottleneck with the final result that the dGPU will be slower. This is the reason why neither AMD nor Nvidia engineers use dGPUs for their extreme scale supercomputers.
every single speculation of your theoretical apu, is based on current size amd cpu cores/modules, its as if you intentionally don't want to even accept the fact that the cpu has to improve(in amd's case by a lot) which also means that the part of the die the cpu will occupy will be alot larger than 2-5%, if the cpu architecture and performance would stay the same then you would be right, but no one(who cares about performance of cpu) is happy with the performance of current amd parts, so the sizes would remain about the same proportion, also what makes you think we will still be on the same PCIE connections between cpu's and gpu's, as if progress wil stop completely on that front and everyone will just focus on APU. i get it you love amd and their apu's and you probably dont game or use your cpu to much, but for those who do their cpu performance is seriously lacking. HEDT market want bigger/faster cores and a dgpu with the whole space occupied just by gpu ideally the same for cpu(no igpu on the die) and have a lot better performance than your theoretical apu's. Dgpu will not die just because of this simple fact
First, there is no "speculation". I am using data given by AMD and Nvidia for their respective designs.
Second, it is not my "theoretical apu". I am discussing the APUs presented by AMD and Nvidia.
Third, I already said that the each new CPU core from AMD will be faster than a Piledriver/Steamroller module and about so big on size. Nvidia own cores will provide 50% more IPC than Denver, but still the iCPU occupies less than 10% of the total die. Again this is not "speculation", but follows from the details of the designs given by AMD and Nvidia engineers.
Fourth, I did not even mention PCIe in my above post. I mentioned interconnects in general, because the limits that I have mentioned apply to any future interconnect was PCie 5.0, HTX 6.0, or otherwise.
Fifth, I am trying to explain you why AMD engineers and Nvidia engineers are not using dGPUs for their extreme scale designs. You can ignore my explanation, you can ignore the numbers, and you can ignore the physics behind, but that will not change the fact that their designs for the future are not using dGPUs.
Sixth, the HEDT argument was debunked before and by more than one poster...
I am absolutely certain dGPUs will always have more raw performance than APUs. The difference lies in how well that performance can be put to use - if you can solve 5 addition problems in your head in a second, but it takes you 5 seconds to read each problem you're finding the solution to - and another 5 seconds to write them down - the guy who takes 2 seconds for each stage will still get the homework done faster.
I guess what I'm trying to say is that there are tasks for which the low-latency, serial nature of CPUs and the massively parallel nature of GPUs are both needed. An example I thought of just recently deals with collision detection - one could have the iGPU check for collisions between moving objects and all other objects (3 levels of data parallelism - multiple moving objects, multiple objects to check against each moving object, multiple triangles to check for each object), have the CPU sort it in order of which collisions occur first, then do physics calculations in order of collisions, passing off work for re-checking collisions at each update to the GPU. The CPU does quick, sequential operations where they need to be done in order (in this case, in order of time) and the iGPU successfully accelerates the actual finding of collisions (and possibly the resulting physics, depending on how sophisticated they are) thanks to the inherent data parallelism. With the lacking bandwidth and high latency of the PCI-Express bus, I highly doubt such efficient communication between the devices would be plausible with a dGPU.
I could be wrong, of course. It's been known to happen.
Good post! Let us do some computations.
The iCPU on Kaveri occupies about 50% of the die. On 7nm node the iCPU would occupy 3% of the same die. Duplicate the number of cores from 4 to 8 and will occupy 6%. Now double the size of each core and still the iCPU occupies only a 12% of the die.
But this is for a mainstream size die. The supercomputers will not use mainstream parts but top level high-performance parts. AMD engineers don't provide the size of their extreme scale APU. I could try to guess it, but luckily Nvidia engineers give the size of the their own extreme-scale 'APU': 650mm2.
The above 8-core iCPU (each core of the size of a Steamroller module) would occupy about 5% of the whole 650mm2 die. A separate dGPU with all the die spend on GPU-transistors would provide only 5% more "raw performance" than the APU. But this 5% more is in an idealized world assuming a perfect interconnect with zero latency and infinite bandwidth for the dGPU.
In real world, a real interconnect will reduce the performance due to latency and bandwith bottleneck with the final result that the dGPU will be slower. This is the reason why neither AMD nor Nvidia engineers use dGPUs for their extreme scale supercomputers.
every single speculation of your theoretical apu, is based on current size amd cpu cores/modules, its as if you intentionally don't want to even accept the fact that the cpu has to improve(in amd's case by a lot) which also means that the part of the die the cpu will occupy will be alot larger than 2-5%, if the cpu architecture and performance would stay the same then you would be right, but no one(who cares about performance of cpu) is happy with the performance of current amd parts, so the sizes would remain about the same proportion, also what makes you think we will still be on the same PCIE connections between cpu's and gpu's, as if progress wil stop completely on that front and everyone will just focus on APU. i get it you love amd and their apu's and you probably dont game or use your cpu to much, but for those who do their cpu performance is seriously lacking. HEDT market want bigger/faster cores and a dgpu with the whole space occupied just by gpu ideally the same for cpu(no igpu on the die) and have a lot better performance than your theoretical apu's. Dgpu will not die just because of this simple fact
First, there is no "speculation". I am using data given by AMD and Nvidia for their respective designs.
Second, it is not my "theoretical apu". I am discussing the APUs presented by AMD and Nvidia.
Third, I already said that the each new CPU core from AMD will be faster than a Piledriver/Steamroller module and about so big on size. Nvidia own cores will provide 50% more IPC than Denver, but still the iCPU occupies less than 10% of the total die. Again this is not "speculation", but follows from the details of the designs given by AMD and Nvidia engineers.
Fourth, I did not even mention PCIe in my above post. I mentioned interconnects in general, because the limits that I have mentioned apply to any future interconnect was PCie 5.0, HTX 6.0, or otherwise.
Fifth, I am trying to explain you why AMD engineers and Nvidia engineers are not using dGPUs for their extreme scale designs. You can ignore my explanation, you can ignore the numbers, and you can ignore the physics behind, but that will not change the fact that their designs for the future are not using dGPUs.
Sixth, the HEDT argument was debunked before and by more than one poster...
and yet the new to be built supercomputers will use volta Dgpu's paired with IBM power8 Dcpu's and not apu's
AMD’s CFO Devinder Kumar talks about AMD’s Restructuring
http://semiaccurate.com/2014/12/05/amds-cfo-devinder-kumar-talks-amds-restructuring/
The article confirms that AMD continues accelerating the transition away from PC market to new markets: embedded, semi-custom...
Kumar confirms that the semi-custom business won a design for an ARM-based product and the other is for a traditional x86 product. He doesn't give details but I think are the SoCs for Apple.
He confirms the good relationships with Globalfoundries, repeats that "AMD is now producing computing, graphics, and semi-custom products all at Global Foundries" and praises Globalfoundries for "the FinFET cooperation deal with Samsung." Does someone still doubt that AMD 14nm FinFET products will be made by Globalfoundries?
The article confirms that "AMD’s outlook on the discrete graphics business is less rosy" and "considers the performance of AMD’s graphics business to be disappointing".
and yet the new to be built supercomputers will use volta Dgpu's paired with IBM power8 Dcpu's and not apu's
But those are not extreme-scale supercomputers but ~20x slower supercomputers... For your information, IBM engineers are working on a future extreme-scale supercomputer design based on a heterogeneous node with the accelerator included on the same die than the CPU (*) instead using dCPU+dGPU.
(*) I also have a copy of the their APU-like design. The difference with AMD or Nvidia is that instead using GPU cores, IBM engineers use other kind of throughput cores for the accelerator on die. I am not going to give details because this is an AMD thread, I will only mention that the IBM design has unified memory and uses a technology similar to the PIM in the design presented by AMD.
AMD will end up creating a GPU with an iCPU to have HSA numbers that justify such design decision. We'll be able to do raytracing in realtime, but the games will still run like crap because the CPU won't be fast enough for the regular "legacy" serial stuff.
No. The architectural improvements will run faster the regular "legacy" serial stuff as well. Each iCPU core has more IPC than current Piledriver/Steamroller cores. Moreover, the improved cache and memory subsystem provides extra boost on serial "legacy" performance by reducing latencies compared to a dCPU such as FX-8350.
Yes, I know where AMD is headed very well. It's boring for me. HEDT will be grabbed by Intel and AMD will leave us to rot buying Intel only stuff. The "dCPU" + "dGPU" landscape won't change anytime soon. I can't see beyond 2016 though, so 2020 will be interesting.
The HPC pet project is a halo project, something for them to sit back and say, "see we did that!". However, you are talking about specifically designed for a specific set of circumstances to run massively parallel compute tasks with many other interconnect nodes.
In short...you would be looking at an APU something along the lines of a single x86 module with many GCN SPs, as many as you can fit on die in fact. With just enough x86 horsepower to keep the GCN cores fed while running fluid dynamic simulations.
Now, considering the fact that even if APUs become 20 times more efficient...so, let us say 900 GFLOPs for ~4.5W, where this theoretical HPC APU Juan claims will live based on current performance today.
Now, by that time, assuming 10% efficiency per generation we are looking at 9x Generations to get there. dGPUs are growing at 20% compute performance (bare minimum, sometimes more) per generation and approx ~10% less power consumption. That means that by the time that APU does ~ 1 TFLOP for ~5W, dGPUs will be 516% more powerful. So, for ~110W, we will get approximately ~28.9 TFLOPS assuming only20% performance improvement per generation and 10% power consumption reduction.
So, to achieve the same ~29 TFLOPs, we will will need 29 APUs to hit what one dGPU accomplishes.
Now...those 29 APUs would consume 145W...while the single dGPU card consumes a mere 110W.
Additionally, we have not considered all the interconnects and other baggage such a system with 29 APUs would consume. Assuming just 1W per interconnect would now make that system 175W system...and most interconnects consume 2-3x what I am speculating.
Now, let us consider if you have 2 of these monster 29 TFLOP dGPUs in a motherboard with 2 expansion slots. For minimal power cost beyond the card itself, our system has now become a 220W 58 TFLOP HPC, meanwhile to do so with APUs would require another 29 interconnects for a grand total of 58.
Now, there are other ways to extrapolate that, and all of it is very hypothetical, however, my point is...while GPUs have the luxury of focusing on performance AND efficiency, APUs do not...because a CPU is inherently much less efficient than a GPU. The serial nature of the instructions make it so...you need much more aggressive logic/branch prediction, more cache layers, and much more power consuming things to go into an APU.
So basically, if you go for weak, efficient parts, you end up with more interconnects, thus making everything more expensive.
I.E. a 2m/4c APU like Kaveri with 512 GCN cores, in order to compete with a big dGPU with say, 4096 cores, would need 8 times the interconnects to get all those GCN cores spread out over 8 512 GCN core APUs?
HPC APU is not so viable unless you make huge APUs? Which means the entire thing isn't viable for HEDT as a 650mm^2 APU would cost a load of money. We spend all this time arguing how important efficiency is, yet we ignore that a huge, efficient HPC cluster of ARM APU would need a ton more interconnects, thus raising the price significantly.
When you switch to the new HPC benchmark, HPCG, as opposed to the ancient LINPACK, the results change quite a bit. The favor seems to go back to regular CPU (homogeneous) systems.
It's also saying that reaching exascale with a more realistic benchmark is going to take another decade or more. The most powerful supercomputer in the world doesn't even reach 1 petaflop with that benchmark.
When you switch to the new HPC benchmark, HPCG, as opposed to the ancient LINPACK, the results change quite a bit. The favor seems to go back to regular CPU (homogeneous) systems.
It's also saying that reaching exascale with a more realistic benchmark is going to take another decade or more. The most powerful supercomputer in the world doesn't even reach 1 petaflop with that benchmark.
#1 and #3 continue being heterogeneous machines. The K machine did jump from #4 to #2. The new benchmark is reflecting the cost of moving data outside the CPU. Prof. Dongarra confirms my claim of that the performance problem is in the interconnect, and it is noted how the problem is temporal and will disappear in a next gen of supercomputers where the GPU will be on the same die than the CPU (aka APU):
On that note, take a look at the results and notice the red markings. Those mean the presence of GPUs or coprocessors. If you start making a few quick connections, you’ll see that the GPU accelerated systems that tend to shine on LINPACK really don’t pull the same power they do on this real-world application-oriented benchmark. As Dongarra said, ““It’s not unlike HPL where GPUs and coprocessors have a lower achievable percent of peak because it’s harder to extract performance from these. It’s not just programming ease either, it’s about the interconnect. When that problems goes away it will change the game dramatically.”
Of course, this is not a message of doom and gloom when it comes to GPUs and coprocessors on these large machines—at least in the future. Since a great deal of the interconnect problem with coprocessors and GPUs will be a thing of the past once data never has to leave its chip home. In the meantime, it’s important for this benchmark to get traction—as well as the codes and systems—for this new generation of processors that will kick off the newest Knight’s family processors and work by the OpenPower Foundation and NVIDIA to nix the hop and keep the movement on the die.
I am absolutely certain dGPUs will always have more raw performance than APUs. The difference lies in how well that performance can be put to use - if you can solve 5 addition problems in your head in a second, but it takes you 5 seconds to read each problem you're finding the solution to - and another 5 seconds to write them down - the guy who takes 2 seconds for each stage will still get the homework done faster.
I guess what I'm trying to say is that there are tasks for which the low-latency, serial nature of CPUs and the massively parallel nature of GPUs are both needed. An example I thought of just recently deals with collision detection - one could have the iGPU check for collisions between moving objects and all other objects (3 levels of data parallelism - multiple moving objects, multiple objects to check against each moving object, multiple triangles to check for each object), have the CPU sort it in order of which collisions occur first, then do physics calculations in order of collisions, passing off work for re-checking collisions at each update to the GPU. The CPU does quick, sequential operations where they need to be done in order (in this case, in order of time) and the iGPU successfully accelerates the actual finding of collisions (and possibly the resulting physics, depending on how sophisticated they are) thanks to the inherent data parallelism. With the lacking bandwidth and high latency of the PCI-Express bus, I highly doubt such efficient communication between the devices would be plausible with a dGPU.
I could be wrong, of course. It's been known to happen.
Good post! Let us do some computations.
The iCPU on Kaveri occupies about 50% of the die. On 7nm node the iCPU would occupy 3% of the same die. Duplicate the number of cores from 4 to 8 and will occupy 6%. Now double the size of each core and still the iCPU occupies only a 12% of the die.
But this is for a mainstream size die. The supercomputers will not use mainstream parts but top level high-performance parts. AMD engineers don't provide the size of their extreme scale APU. I could try to guess it, but luckily Nvidia engineers give the size of the their own extreme-scale 'APU': 650mm2.
The above 8-core iCPU (each core of the size of a Steamroller module) would occupy about 5% of the whole 650mm2 die. A separate dGPU with all the die spend on GPU-transistors would provide only 5% more "raw performance" than the APU. But this 5% more is in an idealized world assuming a perfect interconnect with zero latency and infinite bandwidth for the dGPU.
In real world, a real interconnect will reduce the performance due to latency and bandwith bottleneck with the final result that the dGPU will be slower. This is the reason why neither AMD nor Nvidia engineers use dGPUs for their extreme scale supercomputers.
every single speculation of your theoretical apu, is based on current size amd cpu cores/modules, its as if you intentionally don't want to even accept the fact that the cpu has to improve(in amd's case by a lot) which also means that the part of the die the cpu will occupy will be alot larger than 2-5%, if the cpu architecture and performance would stay the same then you would be right, but no one(who cares about performance of cpu) is happy with the performance of current amd parts, so the sizes would remain about the same proportion, also what makes you think we will still be on the same PCIE connections between cpu's and gpu's, as if progress wil stop completely on that front and everyone will just focus on APU. i get it you love amd and their apu's and you probably dont game or use your cpu to much, but for those who do their cpu performance is seriously lacking. HEDT market want bigger/faster cores and a dgpu with the whole space occupied just by gpu ideally the same for cpu(no igpu on the die) and have a lot better performance than your theoretical apu's. Dgpu will not die just because of this simple fact
I'm honestly puzzled on this. Juan said 4 modules or 8 piledriver cores would be 3% of the total die space of the APU you can triple that size(brute force the design pretty sure that's not happening this time around) and you still have the CPU only using 9% of the die space. I could go on and on why more transistors doesn't always mean more performance but i'll just let you look at Intel vs Amd benchmarks from 2008 and up.
AMD will end up creating a GPU with an iCPU to have HSA numbers that justify such design decision. We'll be able to do raytracing in realtime, but the games will still run like crap because the CPU won't be fast enough for the regular "legacy" serial stuff.
No. The architectural improvements will run faster the regular "legacy" serial stuff as well. Each iCPU core has more IPC than current Piledriver/Steamroller cores. Moreover, the improved cache and memory subsystem provides extra boost on serial "legacy" performance by reducing latencies compared to a dCPU such as FX-8350.
Yes, I know where AMD is headed very well. It's boring for me. HEDT will be grabbed by Intel and AMD will leave us to rot buying Intel only stuff. The "dCPU" + "dGPU" landscape won't change anytime soon. I can't see beyond 2016 though, so 2020 will be interesting.
Cheers!
Yeah lets see the type of iGPU power they have for us gamers i just got done with their new 4600 GPU in the haswell and it can't even hold up to gamecube emulation graphics. I guess they will have to improve that end if dgpu's get replaced.
Also, I wonder if they actually added "advanced" features to the "advanced" tab for the drivers... The "advanced" options are stupid basic ones.
Cheers!
Not bad for the people who already own their hardware. Yes they need to put adaptive v-sync in their drivers and i like FXAA a lot to but not sure how others feel about it(since it works for all games in nvidia settings)
Yeah lets see the type of iGPU power they have for us gamers i just got done with their new 4600 GPU in the haswell and it can't even hold up to gamecube emulation graphics. I guess they will have to improve that end if dgpu's get replaced.
Oh, I'm certainly not talking about their "APUs" overtaking the market, but they do trail AMD in OCL stuff on their own. I'm talking about how dCPU and dGPU will still be relevant for a lot more years, and since AMD's path is not focusing on CPU prowess anymore, Intel will take over HEDT too darn easily. I'm sure they'll improve their "APUs", but the CPU portion will be better than AMD.
But FreeSync requires a compatible monitor, which I don't have. Adaptible v-sync is easy to add as an option. I'm sure of that. RadeonPro does it outside the driver, so it can't be THAT hard to include.
Just to clarify to people who might not know what adaptive V-sync is, its nothing more then a V-sync switch over 60hz it turns V-sync on and below 60hz(or your monitors hz) it turns it off its simple yet amazing. I use it for all my games.