AMD CPU speculation... and expert conjecture

Reepca · Dec 1, 2014

gamerk316 :

We have one thing good at serial workloads (which we're using for serial and parallel workloads) and one thing good at parallel ones (which we're using for a handful of parallel workloads). The goal is not to make a serial processor do parallel or vice versa, the goal is the opposite - to not mix and match. And for many parallel tasks, the initial transfer overhead is far too high because the GPU requires different data for each computation (I remember reading about GPU-accelerated collision detection that would be accelerated upwards of 100x if it didn't have to be copied back and forth between GPU and system memory). Not all parallel tasks are low-data high-processing, which is what dGPUs are best at.

I'm just imagining the nightmare that would occur if you tried to copy to the dGPU information about the state of a bunch of objects, have it check for collisions, copy the resulting collisions back to the CPU for it to determine which one occurred first and if there are dependencies among the collisions (because you can't analyze data dependencies in parallel), then copy the ones that do have dependencies back to the dGPU to re-evaluate with the updated information, and repeating until all the objects have had all collisions for the given time frame evaluated. It would be a pain to try to program, and an even bigger pain to try to make work efficiently.

I can imagine it working much better with HSA. But no, I don't think dGPUs are going to suddenly go extinct - GDDR5 system memory is still impractical, and the main use of graphics cards will remain rendering - one of the tasks where the initial transfer cost is well worth the high-speed RAM speedup. And I would imagine that trying to fit high-end graphics chips like the R9 290X (doesn't that have stock water cooling? Or am I thinking of the wrong one?) into the same chip as a CPU will remain impractical for a long time to come.

Summary: Graphics cards will remain around, to do graphics. Perhaps we should start thinking about APUs as... well... APUs instead of as a CPU + iGPU. Personally I would be thrilled to have an APU doing the main execution of a game I'm playing and having a dGPU doing the rendering (in terms of GFLOPS, doesn't the a10-7850k overall outdo intel's higher-end parts?).

gamerk316 · Dec 2, 2014

The whole purpose of HSA is to try and correct this issue. As it stands when your working on a bit of software that isn't 3D graphics but has large blocks of parralell codde in it, this code runs on the serial focused CPU which is slow. Now if the portions of the code are fairly short, using Open CL to manually move this to the GPU, perform you calculation, then move it back again will cost more time than the time saved by the GPU doing the work. With HSA (and an included GPU portion on die) the code can be passed to the GPU for the parallel portion with no moves or copying. This allows the best processing resource to work on the code when required, without the inherent costs of shoving across a slow bus.

False. Most parallel tasks end up requiring significant amounts of memory access, which means the time to copy data twice over the PCI-E bus is far less then the time needed to constantly hammer main memory over and over again. That's why those tasks, like encoding, see such speedups via OpenCL. HSA is trading one problem for another, and frankly doing it in a worse way.

8350rocks · Dec 2, 2014

Why use pcie 3.0 when htx could do it?

colinp · Dec 2, 2014

8350rocks :

Right, so they can have a Linux Opengl driver that implements an open standard they don't control, but which works on any distro, but they can't make a Linux Mantle driver that implements their proprietary API that is completely under their control.

And if they are going the open source route: who is the maintainer, what is the project name / source forge name (a - la Mesa), where is the API standard published and under what accreditation and where can I read about this exciting development on Phoronix?

8350rocks · Dec 2, 2014

colinp :

8350rocks :

Right, so they can have a Linux Opengl driver that implements an open standard they don't control, but which works on any distro, but they can't make a Linux Mantle driver that implements their proprietary API that is completely under their control.

And if they are going the open source route: who is the maintainer, what is the project name / source forge name (a - la Mesa), where is the API standard published and under what accreditation and where can I read about this exciting development on Phoronix?

Emphasis mine.

Try actually reading the post next time.

Also, for your enjoyment: http://www.pcper.com/news/Graphics-Cards/AMD-Planning-Open-Source-GameWorks-Competitor-Mantle-Linux

Straight from the gentleman I got it from...my, that only took ~60 seconds on google to find...

juanrga · Dec 2, 2014

gamerk316 :

No. In the first place, HSA is an architecture, OpenCL is only software. In the second place, OpenCL 2.0 implements new elements directly taken from the HSA specification (HSA foundation members are Khronos group members as well)

https://www.khronos.org/news/press/khronos-finalizes-opencl-2.0-specification-for-heterogeneous-computing

The Khronos Group’s OpenCL 2.0 is the first key, foundational, programming language to truly support the core capabilities of HSA enabled hardware.

In the third place, only current PC APUs are limited by slow DDR3 memory. The PS4 APU uses GDDR5. Moreover, this limitation of PC APUs is temporary, because post-Carrizo APUs will use HBM (which is faster than your beloved GDDR5).

In the fourth place, there are benchmarks showing the advantage of HSA over traditional OpenCL 1.x implementation.

hc4017-hsa-compilers-technology-by-debyendu-das-15-638.jpg

and there are benchmarks where HSA doesnt improve performance over OpenCL, but reduces the difficulty of programing by one order of magnitude

In the fifth place. The advantages of HSA are so obvious that everyone in the industry is developing something similar: Intel has its neo-heterogeneity approach, Nvidia has CUDA, IBM has its CAPI-based accelerator approach, ARM has big.LITTLE HMP... In fact just after HSA presented hUMA, Nvidia has included unified memory in last version of CUDA

colinp · Dec 2, 2014

8350rocks :

I did read your post and quite apart from the fact that it showed you have no idea how the open source model actually works, it just sounded like a lot of apologist drivel.

I saw that video back in June when it came out. I noted back then that there was no commitment to Linux, except to say, "we'll think about it." Got anything more recent, by the way?

And when I mean "open" I mean open like Opengl, nor like DirectX. Managed by an independent third party like Khronos with ISO accreditation.

Mantle isn't coming to Linux, I'll bet you a beer on that, and be happy to lose.

juanrga · Dec 2, 2014

colinp :

Yes, his last comments didn't make any sense. But what caught my attention is that now he gives excuses about the lack of Mantle on linux drivers, when some months ago he said us full mantle for linux was ready:

I would take AMD word about linux and we have AMD official position updated to November. The answer to the question "do you have plans to bring Mantle to Linux?" was

-Fran- · Dec 2, 2014

colinp :

Just to add a little bit: "Free" != "Open".

With openness, anyone can implement code on it's own accord using the spec. When free, you have to make do with the proprietary implementation or ask permission to implement your own using the spec (if it's disclosed and public might be easier, but you still do need permission - Java's JVM is a prime example).

In the case of HSA, from all the things I've read, they're aiming for "Free" and not "Open". And careful with semantics. Phrases like "Open for everyone to use" means "Free", but doesn't imply "Open". In any case, the practical difference between the 2, is how hard it will be for the Open Source community to develop something on their own, but at regular end user level, you can still use your software that is compatible with the spec (or uses the closed implementation).

Cheers!

EDIT: I got another quoted text from somewhere, haha. Weird.

blackkstar · Dec 2, 2014

8350rocks :

Hopefully AMD puts Mantle in the FLOSS Radeon driver and makes the distros sort things out.

Also, stop moving goalposts. HSA solves the problem of having to move entire working memory over a bus to a different device if that device is better suited for those calculations.

Is it going to get rid of all bottlenecks like memory speed and latency? No. Does it make the situation better? Yes.

Lets look at competing solutions to HSA.

Nvidia: generic GPGPU, same problems OpenCL on AMD hardware faced. Working on HSA-like system but is many steps behind
Intel: Xeon Phi, basically a separate computer on a board. You have to send working set to basically a machine inside your machine, let it calculate on a lot of weak cores, and then ship it back to main memory.

AMD are the only ones who have a solution here that doesn't involve transferring entire working memory over some sort of bus when you want to share something between different compute devices. Intel solution is specially awful because you have basically a neutered x86 GPU, meaning no good single thread performance.

Everyone loves to bash on AMD single thread performance, but Steamroller or Excavator single thread performance is going to be a lot better than running a single thread on a GPU or Phi.

juanrga · Dec 2, 2014

-Fran- :

We are discussing Mantle not HSA. In any case, what you say about HSA is not true:

The HSA Foundation members are building a heterogeneous compute software ecosystem built on open, royalty-free industry standards and open-source software: the HSA runtimes and compilation tools are based on open-source technologies such as LLVM and GCC.

HSA Foundation and it member are committed to the success of HSA and OpenSource project. HSA is natural fit for open source projects since the specification will be public and royalty free. Key members of the foundation are driving key technologies into open source comunity like IOMMU support in the linux kernel as well as bring memory management enhancement, driver stacks, exceptions handling, user mode interfaces, graphic driver interop and arbritration for the support for HSA enablement in operating system. HSA core compiler for generating HSAIL will be based on LLVM. They are also bring open source standards template library for Parallel Algorithms and Data Structures and other tools like simulators into the open source community. This is only the start of the activities.

-Fran- · Dec 2, 2014

juanrga :

-Fran- :

We are discussing Mantle not HSA. In any case, what you say about HSA is not true:

The HSA Foundation members are building a heterogeneous compute software ecosystem built on open, royalty-free industry standards and open-source software: the HSA runtimes and compilation tools are based on open-source technologies such as LLVM and GCC.

HSA Foundation and it member are committed to the success of HSA and OpenSource project. HSA is natural fit for open source projects since the specification will be public and royalty free. Key members of the foundation are driving key technologies into open source comunity like IOMMU support in the linux kernel as well as bring memory management enhancement, driver stacks, exceptions handling, user mode interfaces, graphic driver interop and arbritration for the support for HSA enablement in operating system. HSA core compiler for generating HSAIL will be based on LLVM. They are also bring open source standards template library for Parallel Algorithms and Data Structures and other tools like simulators into the open source community. This is only the start of the activities.

Ah, you're right. I meant "MANTLE" when mentioning "HSA". Sorry about that.

Cheers!

Reepca · Dec 2, 2014

So after looking through some HSA slides, one thing I noticed is that they talk about "passing pointers, not data" to the dGPU. I assume this is just a change in the programming model (to simplify stuff), and the data still has to be copied to GPU memory? Even if you actually copy a pointer into dGPU memory, it's still going to have to fetch what it's pointing to over the molasses-speed PCI-E bus, right?

Was wondering if you guys could provide some clarification?

8350rocks · Dec 2, 2014

No, GPU can now read system memory to work directly from RAM instead of copying.

@colinp: Yes Richard Huddy was on a conference call with myself and a few others. He said that AMD cannot commit to anything more than putting up an open source API at this time. He said they are working on drivers and would provide people with the same opportunity intel and nvidia have to use mantle. They are doing something similar to what they did with freesync. Then the rest remains for people to adopt it.

I am sure their linux driver will be able to use mantle based on the conversation, however, getting people to incorporate it into a distro is another matter. The example he gave was debian, and how many forks there are in debian alone. So, he said they would help as much as they could, but cannot dedicate a team to doing a comprehensive implementation for any single distro. He did comment that ubuntu would be most likely to begin work once the API was released though, and so that may trickle down to other distros accordingly.

Reepca · Dec 2, 2014

8350rocks :

320px-HSA-enabled_virtual_memory_with_distinct_graphics_card.svg.png

I guess I must be missing something here. Don't the thick arrows denote actual physical connections? If so, in order to access system memory, wouldn't the dGPU still need to transfer over PCI-E, albeit not to copy? PCI-E is still the physical connection between the dGPU and the rest of the system, right?

colinp · Dec 2, 2014

The only way I see Mantle making it to Linux is in partnership with a game studio who want a Linux version of their game to be Mantle enabled. No point in having an API no one uses or software waiting for an API. Chicken and egg. Civ:BE would be an obvious one.

8350rocks · Dec 2, 2014

Reepca :

Negative, in your diagram it says "Unified Virtual Memory". That means that the GPU and CPU are sharing total memory in the system, so that the GPU does not have to actually copy anything from system memory, or the CPU would not have to copy from VRAM to do operations on the data stored there.

8350rocks · Dec 2, 2014

colinp :

This I agree with wholeheartedly, and I would LOVE for Civ: BE to be the first Mantle on Linux title...

etayorius · Dec 2, 2014

juanrga :

colinp :

As much as i love what MANTLE does, i see little point in making it to Linux... first, not many games are being developed for Linux even after Valve announced SteamOS, second after the announcement of OpenGLN makes it even less relevant on Linux.

The only way i see that it may be worth it, is if AMD quickly releases MANTLE (and v2.0) on Linux before DX12 and OpenGLN... but MANTLE just barely got out of BETA, and DX12 arrives on the second half of 2015.

In any case DX12 and OpenGLN were heavily based on MANTLE, i just wanted MANTLE to be in at least 10 games before this year ends, but we barely got 6 MANTLE supported games as of now:

PlantsVSZombies Garden Warfare
Thief
Sniper Elite 3
Battlefield 4
Dragon Age Inquisition
Civilization Beyond Earth

That`s it and it`s quite far from the 14+ titles AMD promised before this year ends, so they now change it to "20 games Released or in Development", with only Thief and Civilization BE really taking advantage of MANTLE performance wise.

MANTLE is great on Windows but on Linux is kinda pointless now after OpenGLN announcement, and i wish more games would start supporting Linux, but evidence states the contrary...

palladin9479 · Dec 2, 2014

8350rocks :

No they can't, not even CPU's can work "directly from memory" and the memory in question happens to be attached to them. Processors don't work directly on memory, they work on registers. First the data must be fetched from memory and stored into the register. The results of the calculation are also stored in registers and must be copied from that register into a location in memory. This process is known as load / store. In the case of HSA, the data must still be copied into the registers on the processors inside the dGPU, just like that data would need to be copied into the registers inside the iGPU or the SIMD / SSE / FPU co-processor. The component that handles this the prefetcher and it's often deeply integrated with the cache controller since it's always a good idea to have the data you need in local cache before you need it. Which brings us back to this bad idea of how HSA works, accessing data across any non-local BUS is abhorrently slow, we are talking hundreds of cycles instead of dozens, something on the order of 10~12x latency increase going from L2 cache to system RAM, would be 100x latency increase going from cache across a multipurpose interconnect bus like PCIe (or any replacement). This is the reality of non-local data storage and the entire reason Intel is so much faster at single thread performance.

So no HSA doesn't magically mean a non-native processor would be working with data directly in memory, it means that you don't need to translate the address or maintain memory semantics across different processors. This is important because the CPU doesn't need to babysit and spoon feed data to the co-processor, instead it pass's the pointer and the co-processors own memory unit manages the movement of the data into local cache and putting the results of the processing back into system memory. It's a similar concept to what DMA is. This does provide a very beneficial advantage when working in a hybrid processor environment. You not only have a CPU but also several co-processors that all speak their own language and specialize in different types of calculations. Normally in that scenario a process on the CPU would need to maintain memory semantics and manage the contents of all the co-processors memory for them, otherwise you would have chaos and anarchy with all sorts of negative unpredictable results. Now the co-processors can all play nice together and use the same system memory without needing a third party to micromanage them. For programmers it would allow them to embed code for those co-processors without having to micro-manage their memory state. You can just do the instructions and pass the pointer to the memory address you want for those instructions and allow the hardware to do it's job. Especially useful in the case of APU's because while they have fairly weak CPU's they do have gigantic high-performance vector processors bolted onto them that are used for pixel rasterization and visual effects processing. You could easily re-task those high-performance co-processors for doing physics math inside a video game without first needing to establish a session and all that administrative crap that bogs the whole thing down.

8350rocks · Dec 3, 2014

@palladin:

Yes, you still have load/store, but my point was, in hsa gpgpu situations you will not have to copy the entire set of information into VRAM in order for the gpu to do so. You would merely have the most basic, low level, information traveling across a bus, not duplicate sets of data occupying both system and VRAM.

Of course you will always have load/store. We are talking about x86...

Reepca · Dec 3, 2014

So if I understand correctly, in HSA GPdGPU scenarios, I could just pass a pointer to an array of x integers to the dGPU and it would somehow access that location in system memory without... using... the PCIE bus? This is the part I don't understand. Even if the dGPU can find stuff in system memory on its own, won't it still need to operate on it via PCIE? This is why I was wondering if the main change is just to the programming model (like Palladin said, CPU not having to babysit GPU), since regardless of how the GPU accesses system memory (by pointers or by duplicating data), it's still happening across the PCI-E bus.

cdrkf · Dec 3, 2014

esrever :

Reepca :

I think the point they were making is that although there is still some transfer over the bus, what is being moved is small and more managable.

I mean processors are pretty simple things at their most basic level. You have a calculation to do on your dGPU, what does the dGPU actually need for that particular calculation? Probably a few bytes of data- which isn't an issue to send over a bus like PCIe... the problem currently is that those few bytes of data are likley part of a great big chunk of data. Currently you have to copy the whole thing before working on it. With pointers the dGPU can access exactly what it needs from main memory (copying just that section into it's cache for processing), without needing it's own duplicate in VRAM.

Reepca · Dec 3, 2014

cdrkf :

esrever :

Reepca :

I think the point they were making is that although there is still some transfer over the bus, what is being moved is small and more managable.

I mean processors are pretty simple things at their most basic level. You have a calculation to do on your dGPU, what does the dGPU actually need for that particular calculation? Probably a few bytes of data- which isn't an issue to send over a bus like PCIe... the problem currently is that those few bytes of data are likley part of a great big chunk of data. Currently you have to copy the whole thing before working on it. With pointers the dGPU can access exactly what it needs from main memory (copying just that section into it's cache for processing), without needing it's own duplicate in VRAM.

Thank you, that makes sense :~).

blackkstar · Dec 3, 2014

Reepca :

Let me give you a good comp sci 101 example.

We know the PCIe bus has little bandwidth compared to what a GPU needs.
We currently have to ship entire working sets to the GPU to work on it, ship it back to the CPU to work on it, and back and forth.

Now considering these restrictions, and considering a large data set (lets say you need to sort a hundred million 32-bit integers, what sounds better to transfer over a slow bus? A single memory pointer to the dataset stored somewhere else or the big dataset?

Of course, I'm assuming GPU and CPU will have memory controllers that talk directly to memory instead of relying on a bus like PCIe or HT. And as Palladin stated, this is somewhat simplified. The CPU will do what it can to put things into registers first, the L1, then L2, then L3, then L4 caches (of course a CPU without L3 will stop at L3 and go to system memory after L2 cache). And IIRC GPU will have registers and caches it wants to use as well.

So HSA is not a magic solution. There will probably still be times when CPU is hogging something in cache or registers and GPU is stalled. But not having to transfer a huge working set over a slow bus is a huge advantage that only AMD is going to have. The point is that it is a huge advantage (even with limitations) that Intel and Nvidia won't have.

AMD CPU speculation... and expert conjecture

Honorable

Glorious

Distinguished

Honorable

Distinguished

Distinguished

Honorable

Distinguished

Illustrious

Honorable

Distinguished

Illustrious

Honorable

Distinguished

Honorable

Honorable

Distinguished

Distinguished

Honorable

Splendid

Distinguished

Honorable

Judicious

Honorable

Honorable

Share this page