blackkstar :
@juan, it seems like you think that the most important part of a GPU and GPGPU is latency.
A GPU/GPGPU is a TCU and those are rater insensitive to latency. This is why a GPU memory controller is optimized for bandwith, not latency.
blackkstar :
PCIe is a massive problem no matter how you all want to take it. If it were actually meeting the needs of heterogenous computing, Nvidia and Intel would be using it. You would have Xeons connected by PCIe because it'd be fast enough.
And it's not.
At this point, Nvidia has NVLink and Intel has QPI. AMD would normally have Hypertransport, but it's in dire need of an upgrade.
I agree.
blackkstar :
Which is why I think that we're going to see AMD come out with a new version of Hypertransport or a replacement that is either more competitive than HT 3.1 (which was last updated in 2008, by the way) or even better than QPI and NVLink.
The biggest problem with everyone who thinks APU is the only future is that if there is no fast interlink between APUs, your HPC load is going to be bottlenecked by the same things that dCPU and dGPU are once you have a workload that uses more than one APU, because one APU won't be able to access the memory of another APU.
There is an important difference. The GPU works as a coprocessor to the CPU. Thus you need to offload the computation to the GPU, and move the result of the computation back to the CPU. This is why an interconnect CPU--GPU is slow and inefficient, and rejected for exascale compute.
With APUs you don't need to offload the computation from one APU to another. The computations are local, and only in some few cases you need to transfer data from one APU to another using the interconnect.
This is the reason why Nvidia will first use NVLINK to connect a CPU to a dGPU, but latter will use NVLINK only to connect their ultra-high-performance APUs. In fact the interconnect between CPU and GPU cores
inside the ultra-high-performance APU has to be so fast that Nvidia is not using a traditional bus system as in current CPUs/APUs.
AMD follows a similar approach to Nvidia. The internal docs show that they use a 100GB/s interconnect for the APU--APU connection. For the sake of comparison Hypertransport peaks at only 26GB/s.
I doubt that AMD will improve Hypertransport. I believe it is death. For Seattle SoC AMD is taking a different approach with the acquired Freedom Fabric.
blackkstar :
So, even then an APU only ecosystem still needs the fast links I'm saying we need. But by the time you reach that point, you might as well be using dCPU and dGPU since they'll both be using HSA over the same links that the APUs would be using anyways. Except dCPU and dGPU don't have to deal with the problems of APUs such as:
1. GPU and CPU have different requirements from fabrication process.
2. GPU has to sacrifice density in fab process for CPU
3. CPU as to sacrifice clock speed in fab process for GPU
4. Both must contend for die space
5. Both must contend for heat dissipation (as mentioned earlier 100w CPU + 250w GPU = 350w+ total)
6. Power delivery of motherboard must also provide the power of GPU and CPU to a single socket.
7. Multiple APU system loses HSA shared memory advantage and is back to square one with multiple devices that have to talk over (currently) slow busses.
I don't know how else to explain it. To make an HSA APU, you need to make a CPU that supports HSA, you also need to make a GPU that supports HSA. If you can provide solid bandwidth and latency between dCPU and dGPU, you might as well go that route since you already have a CPU and GPU design that can do it. Except at this point, you're at a massive advantage over APU because you can just keep adding devices. Imagine HSA between 4 8m/16c CPUs and 8 Hawaii-class GPUs. All sharing one giant memory pool. It'd be a bloodbath of Phi and Nvidia NvidiaLink.
1. No. AMD has have problems because initially chose wrong SOI process, but their move to bulk and next to FINFETS correct that.
2. No. AMD problems are the result of their use of automated design tools.
3. No. Again this is exclusive to AMD designs.
4. True for current APUs, but irrelevant for future CPU:GPU ratios as explained before.
5. True for current APUs, but irrelevant for future CPU:GPU ratios as explained before.
6. No problem with providing energy to a 300W socket.
7. Not a real problem because HUMA is relevant for computations made on the same data. CPU and GPU on same die will be working in same data but an APU will be not working in data from another APU very often.
Yes, if one can invent a CPU-GPU interconnect that provides solid bandwidth and latency between dCPU and dGPU at exascale level then a traditional dCPU and dGPU architecture would be superior. The problem is that magic interconnect cannot be invented by the same reason no enginner can invent a perpetual motion machine or a car that breaks the speed of light limit.
The laws of physics favour an APU solution. I already explained why and give some basic data like the wire to compute ratio. All the engineers know the laws of physics. This is why all the engineers working on exascale are developing APUs, despite some people here negating it as hell.