AMD CPU speculation... and expert conjecture

juanrga · Dec 3, 2014

There is no HSA dGPUs only HSA APUs (bold emphasis mine)

http://www.amd.com/en-us/innovations/software-technologies/server-solution/hsa

The key to it all is Heterogeneous System Architecture (HSA). HSA seamlessly combines the specialized capabilities of the CPU, GPU and various other processing elements onto a single chip – the APU. By harnessing the untapped potential of the GPU, HSA promises to not only boost performance – but deliver new levels of performance (and performance-per-watt) that will fundamentally transform the way we interact with our devices.

Only HSA APUs can extract all the potential from unified system memory, integration, and the lack of PCIe. Moreover, hUMA means heterogeneous Uniform Memory Access. "Uniform" implies a single memory pool

http://www.amd.com/en-us/innovations/software-technologies/processors-for-business/compute-cores

8350rocks · Dec 3, 2014

@juanrga:

They are hyping it for APUs, but the technology is capable of dcpu and dgpu. I will stop there, but suffice to say, it already works that way.

The other option reepca, albeit a slightly more expensive one, would be to convert htx to operate as a hybrid pcie for dedicated cards. This has already been patented for supercomputers with htx running at 3200mhz.

Now, the other option would be something like freedom fabric integrated into the hardware to allow a direct path instead of using the slower pcie 3 bus.

Now, keep in mind, pcie 3 is only slow by comparison to your northbridge, if we compare other buses in a given system pcie 3 is still relatively quick until you get to large scale commercial deployments with interconnect technology reaching beyond the public and small business niches.

-Fran- · Dec 3, 2014

juanrga :

"The HSA design allows multiple hardware solutions to be exposed to software through a common standard low-level interface layer, called HSA Intermediate Language (HSAIL). (...) And HSAIL frees the programmer from the burden of tailoring a program to a specific hardware platform – the same code runs on target systems with different CPU/GPU configurations."

http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/

HSA is aimed at much more than just APUs, Juan... Good thing you didn't get your information from marketing slides 8)

In any case, ASICs are also part of what HSA aims to integrate. Currently, they only have the APUs to showcase the benefits, that's why you only see APUs being targeted (by marketing).

Cheers!

EDIT: This will also clear some doubts about HSA: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/hsa10.pdf

juanrga · Dec 3, 2014

-Fran- :

I am not restricting the concept of APU to CPU+GPU, I am using a more general concept of APU:

An accelerated processing unit (APU, also advanced processing unit) is a computer's main processing unit that includes additional processing capability designed to accelerate one or more types of computations outside of a central processing unit (CPU). This may include a graphics processing unit (GPU) used for general-purpose computing (GPGPU), a field-programmable gate array (FPGA), or similar specialized processing system.

[...] APUs can also include video processing and other application-specific accelerators.

An example of "similar specialized processing system" is a DSP.

When I say HSA APU I am saying "LCU+TCU", sorry if this generated confusion.

I suppose that marketing is also the reason why some people associates APU to "CPU+GPU made by AMD". I have always found interesting how some people in forums or in tech journalism calls APU to the A10-7850K but doesn't call APU to the i5-4570R and doesn't call APU to that heterogeneous processor running their phone.

-Fran- · Dec 3, 2014

juanrga :

I am not restricting the concept of APU to CPU+GPU, I am using a more general concept of APU:

An accelerated processing unit (APU, also advanced processing unit) is a computer's main processing unit that includes additional processing capability designed to accelerate one or more types of computations outside of a central processing unit (CPU). This may include a graphics processing unit (GPU) used for general-purpose computing (GPGPU), a field-programmable gate array (FPGA), or similar specialized processing system.

[...] APUs can also include video processing and other application-specific accelerators.

An example of "similar specialized processing system" is a DSP.

When I say HSA APU I am saying "LCU+TCU", sorry if this generated confusion.

I suppose that marketing is also the reason why some people associates APU to "CPU+GPU made by AMD". I have always found interesting how some people in forums or in tech journalism calls APU to the A10-7850K but doesn't call APU to the i5-4570R and doesn't call APU to that heterogeneous processor running their phone.

Then even regular CPUs are APUs.

Cheers!

Cazalan · Dec 3, 2014

HSA's intention is efficiency, plain and simple. The issue is we just don't have hardware that can fully leverage it yet. The current HSA products still have low memory bandwidth compared to CPU+dGPU solutions. Even the PS4 with GDDR5 has lower bandwidth than CPU+dGPU (Dual DDR3+ GDDR5). You'll need an HSA APU with a 3D memory to really take advantage.

8350rocks · Dec 3, 2014

Cazalan :

You would need a vastly more complex, and more expensive, solution with technologies not readily available to the general public integrated onto them, to take advantage of what HSA could possibly be.

However, I know for a fact hsa will run on dcpu + dgpu over a bus...there is just not a solution available to do it...yet.

palladin9479 · Dec 3, 2014

8350rocks :

No, you have absolutely zero data being saved. The exact same datasets are copied reguardless as the dGPU needs all the data locally before it can process it. What it does save you is the large amount of administrative overhead micro-managing the dGPU which is quite considerable. It's more important for iGPU type situations where the system memory is also the graphics memory, then the dataset doesn't need to be copied over an external bus. You would still have the sizable administrative overhead of managing the semantics of memory addressing but with HSA that's not necessary and the iGPU can manage itself. HSA is about reducing administrative overhead by creating a single standard for dissimilar processors to co-exist inside the same system using the same memory pool without needing a third party to micromanage them.

juanrga · Dec 4, 2014

HSA means (Heterogeneous System Architecture). HSA main goal is to fully integrate processors of different kind for compute. This architecture proposes a combination of both hardware and software techniques. The software part removes the programming difficulties of current heterogeneous programming paradigms such as GPGPU programing. Before I gave a figure showing how HSA reduces the amount of code needed compared to traditional programming techniques such as OpenCL.

The hardware part of HSA is designed to support the new software but also to increase performance. hQ, hUMA... increase performance. In fact, one of the specific goals of HSA is the reduction of LCU/TCU "communication latency" as mentioned in the spec. hUMA increases performance by eliminating latencies and bottlenecks associated to motion of data among different memory pools. hQ increases performance by avoiding extra calls (bold emphasis mine):

HSA devices communicate with one another using queues. Queues are an integral part of the HSA
architecture. Latency processors already send compute requests to each other in queues in popular task
queuing run times like ConcRT and Threading Building Blocks. With HSA, latency processors and
throughput processors can queue tasks to each other and to themselves.

The HSA runtime performs all queue allocation and destruction. Once an HSA queue is created, the
programmer is free to dispatch tasks into the queue. If the programmer chooses to manage the queue
directly, then they must pay attention to space available and other issues. Alternatively, the programmer
can choose to use a library function to submit task dispatches.

A queue is a physical memory area where a producer places a request for a consumer. Depending on
the complexity of the HSA hardware, queues might be managed by any combination of software or
hardware. Queue implementation internals are not exposed to the programmer.

Hardware-managed queues have a significant performance advantage in the sense that an application
running on a LCU can queue work to a TCU directly, without the need for a system call. This allows for
very low-latency communication between devices, opening up a new world of possibilities. With this, the
TCU device can be viewed as a peer device, or a co-processor.

OpenCL doesnt have hQ, under OpenCL only a LCU can queue to a TCU. I also give before figures with benchmarks of HSA versus OpenCL.

It is evident that only HSA APUs (i.e LCU+TCU on die) can extract the full potential of heterogeneous compute. dCPU+dGPU for HSA makes very little sense because it is several steps backward. First, the use of of-die interconnect will increase latency, killing performance; second, it will require two memory polls killing hUMA; third, PCIe or similar bus will kill hQ and the peer character of TCUs; fourth, data will need to be copied and constantly synchronized between the LCUs and TCUs, killing performance; fifth, it will reduce efficiency by increasing power consumption associated to moving data of die...

It is not a surprise that the node architecture of the extreme-scale supercomputer announced recently by AMD is based on a future HSA APU

http://www.amd.com/en-us/press-releases/Pages/extreme-scale-hpc-2014nov14.aspx

There is no dCPU+dGPU in the design, despite someone here insists on the contrary. Note: I know the details of the node architecture and even to the engineers behind it.

8350rocks · Dec 4, 2014

http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/

The HSA version seamlessly shares data between CPU and GPU, without memory copies or cache flushes because it assigns each part of the workload to the most appropriate processor with minimal dispatch overhead. The net result was a 2.3x relative performance gain at a 2.4x reduced power level*. This level of performance is not possible using only multicore CPU, only GPU, or even combined CPU and GPU with today’s driver model.

Perhaps I am reading this wrong...but, it seems to me that they are discussing this in a manner that no longer requires a data copy to be stored on VRAM. Now, perhaps, I am missing additional information...as I honestly have not talked to a software engineer about the implementation details, I have only gotten the "high level" information at this point. So, I can see where you are coming from, but, I came away from the discussion with different impressions. Perhaps I am wrong in this instance...I will seek clarification.

8350rocks · Dec 4, 2014

juanrga :

Juan, I do not dispute that they are going to try to build that with APUs.

However, I dispute the efficiency of it all. I seriously doubt this is going to be the "be all, end all" of designs considering the costs in terms of physics and thermodynamics to make it work from an efficiency standpoint.

What this is going to do, is end up producing many very small node APUs designed for massively parallel workloads that are impractical beyond the vantage point of HPCs.

No one in the real world who is not using a HPC now will use something designed for a very specific niche role. and while it may advance HPCs considerably, the impact on the public sector will be very miniscule in comparison because the work loads are completely different.

You are proposing someone at home will use a dump truck to take their trash to the curb...(best analogy I could think of to show scale). It is impractical, unfeasible, and the average consumer/small business work loads are primarily single thread in nature.

So rant on about APUs, but for your average HEDT or server, this is worthless tech, and only a "halo" product leap for HPCs. I applaud AMD for going after it...I hope they succeed. However, very little, if any, of the water droplets from that pond will spill down to the HEDT sector. You may see them in workstations, at some point, in the distant future, once they are done with HPCs. However, that day is a long way off still. They have not even built this HPC, and you are already touting it as the be all, end all of all computers.

Reepca · Dec 4, 2014

What I hear Juan saying: Trying to pair HSA features with a dGPU that has so many obstacles to efficiently using those features will cause little gains. For this reason, the planned supercomputer isn't using dGPUs (which could probably have more overall compute capability), instead going for an efficient iGPU.

What I hear Mr. 8350 saying: The average consumer isn't going to use a supercomputer, therefore APUs (or maybe he's just talking about HSA APUs in particular) aren't an improvement for the average consumer.

"We used stones to build the Great Wall"
"But most people want a house, not a Great Wall! Therefore stone is a useless resource for everyone else."

My comprehension must suck. I'm missing some connection. I mean personally, I think that parallel workloads on an easy-to-program-GPU would be great for the average consumer who does anything with physics, graphics, or large data (*cough* games *cough*).

-Fran- · Dec 4, 2014

Reepca :

There you go.

Cheers!

blackkstar · Dec 4, 2014

Reepca :

For games, you could use a spare GPU to calculate

Physics like PhysX does
Global Illumination

and probably more. I think AMD hinted at a lot of this when they were talking about what you could do with Mantle. I recall one video where they were talking of things like using one GPU to just render geometry, another to do shading, etc. Basically to completely abandon Crossfire and SLI approaches. However the industry abandoning PhysX and going for DirectCompute physics or OpenCL would be good for everyone. OpenCL with Bullet would probably be best though.

8350 hinted that HSA will work across dGPU and dCPU. He is also hinting that APU is not worthwhile for consumers, which I'm assuming he's thinking is HEDT. Given how we have already beaten to death that an APU will never scale to the performance of a dGPU or dCPU given socket TDP issues, I'm guessing he's saying HEDT will get HSA across dGPU and dCPU eventually and that APU won't be enough to satisfy the needs of HEDT users.

I suggested that HSA would also allow systems with two or more APUs all functioning as basically one HSA virtual device. Ergo a system of 4 8 core APUs would end up as a 32 core CPU + one big GPU. However I haven't seen 8350 mention anything about that so it's probably just baseless conjecture at this point. However I would assume it'd have to be working for HPC APU systems. I saw a lot of potential in AMD turning their HEDT platform into one united with what is currently the FM platform and allowing people to just buy multiple APUs and put them in a single board. But I'm guessing that it's somehow significantly cheaper if that is staying only with HPC than a single socket motherboard with add in cards like dGPU.

As for Jaguar SoC consoles, they are not a case for efficiency. The design goals of those devices were not to be the most efficient, it was to get the best performance possible in a given power envelope. Meaning if a 200w PS4 was more efficient than the ~100w PS4 we have today, they'd be forced to take the ~100w PS4 because 200w is just too much.

It seems to me like efficiency has a sweet spot. If it's too low, it's not using that much power and it's not fast enough (hence why we don't see massive arrays of 100mhz CPUs). If it goes too high, it's faster than everything else (generally), but energy consumption is not increasing linearly with performance.

Mobile wants to be in the spot where power saving is more important than efficiency. HPC wants to be as efficient as possible. HEDT wants to have the most performance possible and the only limits to heat and power consumption is literally if you can keep the chips cool (and price).

juanrga · Dec 4, 2014

@Reepca, AMD will be reusing the design for the rest of the line. What happens is that the APU that will power your future laptop or that will power your future HEDT will be a cut-down version: cheaper, with less memory, and consuming less power. The consumer version will use the GPU for computational tasks outside of games.

juanrga · Dec 4, 2014

blackkstar :

During the presentation of the asymmetric multi-GPU support on Mantle, AMD mentioned using the dGPU for graphics and the iGPU (aka APU) for post-processing/physics...

As explained before, HSA across dGPU and dCPU implies going several steps backward compared to existent HSA hardware. Not only the 'argument' against APUs scaling up has been also rebated innumerable times before, but it goes in the other direction and AMD has finally confirmed that will use HSA APUs for extreme scale compute (as I predicted one or two years ago).

Your suggestion that HSA would also allow "systems with two or more APUs all functioning as basically one HSA virtual device" has all the difficulties mentioned before and some more.

The Jaguar cores on the PS4 were selected by their superior efficiency over Piledriver. Again a simple computation explains why engineers did chose that configuration for the console. The PS4 is bound to about 100W, but any computational device from phones to supercomputers is power bound. HEDT is limited by about 1000W. Thus HEDT users try "to get the best performance possible in a given power envelope" as rest of users because the equation

Performance = Efficiency x Power-consumption

holds for any computational device. If you are bound by 1000W then the only possibility to double the performance of the hardware is when engineers increase the efficiency by 2x. And increasing the efficiency is what engineers designing HEDT devices have been doing and will continue to do in next years.

8350rocks · Dec 4, 2014

Juan:

No, they will not.

You have freedom fabric, which are the most efficient interconnects available for large scale data moving. Any idea what cost is to implement that in HEDT? A 5960x system would be a bargain comparatively.

Additionally, ray tracing is not far off. You will not be able to get a raytracing capable gpu on a single socket board igpu. It will not happen in my lifetime unless a mass producible room temperature superconductor is marketed in the next 40ish years...(not likely).

If AMD was so sure of your perception being true...why would they be aiming to sell the most dgpus and not aiming to put an APU in everything imaginable?

Think about that...why would they be pushing "dead" tech hard if they were killing it off themselves?

They would not.

logainofhades · Dec 4, 2014

-Fran- :

IDK, I think I would much rather have the stone house. Given the fact I live in the midwest, a stone house would be a bit safer, when tornados hit.

juanrga · Dec 4, 2014

logainofhades :

-Fran- :

IDK, I think I would much rather have the stone house. Given the fact I live in the midwest, a stone house would be a bit safer, when tornados hit.

I live in a stone house but here is no tornados here. ;-)

-Fran- · Dec 4, 2014

logainofhades :

Well, I wonder if a F1 (or was it T1?) is enough to pick up your entire house, even if it's made from Stone or Wood. I'm sure as well with a F5 (T5?) it won't matter one bit, haha.

In any case... Performance and efficiency... Interesting.

Given that simplistic formula is easy to play ball and give dumb scenarios.

Currently, GPUs have a theoretical SP performance at around 5TFLOPs? I think the R9 290X sits at ~5.5TFLOPs at 250W. Let's use that. And for reference, Kaveri's A10-7850K is ~900GFLOPs at 95W. And to simplify more, TDP = W consumed. Also, both are at 28nm, right? Using process nodes within strike distance of each other, so the comparison is the more interesting. And also, I'll use SP, since DP is capped artificially for the consumer level cards.

So, for HEDT, if you want to do heavy computing, You'd need 6 A10-7850Ks to reach ~5.4TFLOPs, making 500W. That means you'll have to introduce a platform capable of having: 6 sockets, the sufficient interconnects to cable all that mess AND enough slots for all the RAM banks you'll need (that's not even counting space). And that's not even counting you can plex PCIe lanes to have more than 2 cards per X16 split.

Holy cow... I'm sure GPUs are totally dead. Yessir.

Is there any other formula you want us to try, Juan?

Cheers!

jdwii · Dec 4, 2014

Sadly I’m siding with Juan a bit more once again.
"During the presentation of the asymmetric multi-GPU support on Mantle, AMD mentioned using the dGPU for graphics and the iGPU (aka APU) for post-processing/physics..."
Means we will probably see them both do work for gaming what's wrong with that? If anything that makes me want that product today.

Efficiency as stated 3 pages ago is the most important measure for designing a CPU architecture "(especially of a system or machine) achieving maximum productivity with minimum wasted effort or expense." Again for Amd's HEDT parts they will simply clock them up and make them less efficient but when making the architecture they will be focusing on making the design that can get as much performance at a specific power envelop. That way they can make super computer parts with the design and parts that consume at low at 3 watts.

As many of you already know it’s getting harder to shrink designs which is why I think dgpu’s will be around for quite a long time unless CPU’s consume only 10% of the die space in which that CPU still has to have strong single threaded performance and probably 8 cores. Still wondering what the cost will be to the consumer when they do stacked memory because currently its way to expensive.
Anyways my rant is over.

8350rocks · Dec 4, 2014

No...

Let me elaborate once and for all.

The HPC pet project is a halo project, something for them to sit back and say, "see we did that!". However, you are talking about specifically designed for a specific set of circumstances to run massively parallel compute tasks with many other interconnect nodes.

In short...you would be looking at an APU something along the lines of a single x86 module with many GCN SPs, as many as you can fit on die in fact. With just enough x86 horsepower to keep the GCN cores fed while running fluid dynamic simulations.

Now, considering the fact that even if APUs become 20 times more efficient...so, let us say 900 GFLOPs for ~4.5W, where this theoretical HPC APU Juan claims will live based on current performance today.

Now, by that time, assuming 10% efficiency per generation we are looking at 9x Generations to get there. dGPUs are growing at 20% compute performance (bare minimum, sometimes more) per generation and approx ~10% less power consumption. That means that by the time that APU does ~ 1 TFLOP for ~5W, dGPUs will be 516% more powerful. So, for ~110W, we will get approximately ~28.9 TFLOPS assuming only20% performance improvement per generation and 10% power consumption reduction.

So, to achieve the same ~29 TFLOPs, we will will need 29 APUs to hit what one dGPU accomplishes.

Now...those 29 APUs would consume 145W...while the single dGPU card consumes a mere 110W.

Additionally, we have not considered all the interconnects and other baggage such a system with 29 APUs would consume. Assuming just 1W per interconnect would now make that system 175W system...and most interconnects consume 2-3x what I am speculating.

Now, let us consider if you have 2 of these monster 29 TFLOP dGPUs in a motherboard with 2 expansion slots. For minimal power cost beyond the card itself, our system has now become a 220W 58 TFLOP HPC, meanwhile to do so with APUs would require another 29 interconnects for a grand total of 58.

Now, there are other ways to extrapolate that, and all of it is very hypothetical, however, my point is...while GPUs have the luxury of focusing on performance AND efficiency, APUs do not...because a CPU is inherently much less efficient than a GPU. The serial nature of the instructions make it so...you need much more aggressive logic/branch prediction, more cache layers, and much more power consuming things to go into an APU.

juanrga · Dec 4, 2014

-Fran- :

Your values of performance and TDP for the R9-290X are a bit off. Using correct values we obtain 19.4 GFLOPS/W

http://en.wikipedia.org/wiki/AMD_Radeon_Rx_200_Series

Then you compare the efficiency of the dGPU against an APU but that is a faulty comparison.

CPUs have poor efficiency because are made of latency optimized cores. By comparing the dGPU against APU you are including the bad efficiency of the iCPU but, at same time, you are ignoring the bad efficiency of the dCPU.

Therefore, you are artificially favoring the dGPU. And I say "artificially" because I don't know any dGPU that works standalone without dCPU, do you?

You would compare the efficiency of the APU against the whole [ dCPU + dGPU ] system. If you make a correct computation, and if you know basic silicon scaling laws, you will understand why AMD engineers are proposing HSA-APUs for the future supercomputer.

Reepca · Dec 4, 2014

I am absolutely certain dGPUs will always have more raw performance than APUs. The difference lies in how well that performance can be put to use - if you can solve 5 addition problems in your head in a second, but it takes you 5 seconds to read each problem you're finding the solution to - and another 5 seconds to write them down - the guy who takes 2 seconds for each stage will still get the homework done faster.

I guess what I'm trying to say is that there are tasks for which the low-latency, serial nature of CPUs and the massively parallel nature of GPUs are both needed. An example I thought of just recently deals with collision detection - one could have the iGPU check for collisions between moving objects and all other objects (3 levels of data parallelism - multiple moving objects, multiple objects to check against each moving object, multiple triangles to check for each object), have the CPU sort it in order of which collisions occur first, then do physics calculations in order of collisions, passing off work for re-checking collisions at each update to the GPU. The CPU does quick, sequential operations where they need to be done in order (in this case, in order of time) and the iGPU successfully accelerates the actual finding of collisions (and possibly the resulting physics, depending on how sophisticated they are) thanks to the inherent data parallelism. With the lacking bandwidth and high latency of the PCI-Express bus, I highly doubt such efficient communication between the devices would be plausible with a dGPU.

I could be wrong, of course. It's been known to happen.

juanrga · Dec 4, 2014

jdwii :

Absolutely right! By improving efficiency AMD will provide the best possible performance on each one of the power budgets:

phone (~1W) < tablet (~3W) < laptop (~30W) < console (~100W) < HEDT (~1000W) < server (~100KW) < extreme-scale supercomputer (~20MW).

jdwii :

You got the number of cores right! The APU that AMD is designing for the extreme-scale supercomputer uses 8 strong cores for the iCPU. Each core has more performance than a Piledriver module (as said before I know the details of this project).

Nvidia engineers are also working in an 'APU' for future supercomputers but they call it HCN. The older design used 16 cores for the iCPU, but recent revisions of the design reduce the number of cores to 8. Those cores are a scaled-up version of Denver. If the data that Nvidia gives is accurate and if I am not making some mistake, I obtain that those future cores will provide about 50% more IPC than current Denver cores. The iCPU occupies less than the 10% of the whole die.

The cost of those designs will be very high, because will use the best available technology and it will be pushed at its limits. For instance, the stacked memory will run at 4TB/s of bandwidth.

AMD CPU speculation... and expert conjecture

Distinguished

Distinguished

Illustrious

Distinguished

Illustrious

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Honorable

Illustrious

Honorable

Distinguished

Distinguished

Distinguished

Titan

Distinguished

Illustrious

Splendid

Distinguished

Distinguished

Honorable

Distinguished

Share this page