AMD's HBM Promises Performance Unstifled By Power Constraints

InvalidError · May 22, 2015

jimmysmitty :

Well, this article from March 2014 says Nvidia will start using HBM in 2016... so it has apparently been known for over a year that Nvidia was going HBM.
http://www.anandtech.com/show/7900/nvidia-updates-gpu-roadmap-unveils-pascal-architecture-for-2016

HMC might not be HBM but they are fundamentally identical in most ways that matter to GPUs. GPUs primarily depend on bandwidth and HBM is about on par with HMC there while using cheaper DRAM and less power. For latency and access concurrency, the GPU schedulers have been doing a fine job coping with GDDR5's limitations and HBM will provide about 8X better concurrency on top of 3X as much bandwidth, so there should be no issue there any time soon. This might not be as good as the ~64X improvement with HMC but there is no point in overkilling what is already overkill.

Where HMC-style memory should shine is applications with non-linear memory access patterns like servers, routers, workstations compiling code, etc. I can imagine how a raw HMC memory die stack (no stack controller) would integrate nicely as a semi-programmable hardware resource in FPGAs.

gamersglory · May 22, 2015

how about replace system ram with hbm as well not just on graphic cards? wouldn't that make overall system much, much faster and less power hungry?

just imagine the system spec:

ram: 8 gb hbm
gfx: 8 gb hbm

😛

Would be great for DDR replacement as well in the long term

kcarbotte · May 22, 2015

how about replace system ram with hbm as well not just on graphic cards? wouldn't that make overall system much, much faster and less power hungry?

just imagine the system spec:

ram: 8 gb hbm
gfx: 8 gb hbm

😛

One step at a time. There's only so many things that can be added to the interposer at this time.
the technology will allow for more integration in the future. There was no mention of a roadmap for system memory other than a generic comment about there being many verticals this technology will benefit in the future.

I would expect to see APU's adopt this platform before we see regular system memory for standard CPUs done this way, however this is likely not something we'll see until gen2 HBM.

norseman4 · May 23, 2015

One thing about AMD, we really don't know what's coming, but taht it should be announced during or a bit before E3. The 300 series was a node shrink, now it's still 28nm, it was a rebrand for all but the top tiers, then it was an entire range, now it's back to the top tier again. HBM was going to be on all tiers, hence the all new GPUs, then it was only the R9-390 and R9-390X, and the last leaked slide I saw was HBM is only going to be on the 390X, though both 390s were going to be Fiji.

I'm hoping for a killer, and that as soon the NDA's are lifted, that AMD provides some great Adaptive Sync (FreeSync) enabled cards that are better in multiple ways, from the 200 series. (And that the 390s causes peoples jaws to drop, and that the 300 series in general will be affordable)

chrissy4605 · May 25, 2015

I am looking forward to using the new architecture. I will wait a while as it matures to more than 4GB Then all bets are off. I suspect that in three years the product should be mature.

chrissy4605 · May 25, 2015

how about replace system ram with hbm as well not just on graphic cards? wouldn't that make overall system much, much faster and less power hungry?

just imagine the system spec:

ram: 8 gb hbm
gfx: 8 gb hbm

😛

Click to expand...

That's what AMD is trying to do with HSA. Once they put HBM on an APU die, it'll pretty much be exactly what you said

For your education, AMD is not solely a Chinese company. The are made in countries like Taiwan and Malaysia Ireland and the USA. The designing goes on in Korea and Japan as well as the good old USA!

j0ndafr3ak · May 26, 2015

notsleep :

That's where HSA is trying to get to, I guess

Jaroslav Jandek · May 26, 2015

As a side note, according to NVidia's roadmap, the Pascal GPUs that should launch in 2016 will have HBM (v2) as well.

jimmysmitty · May 27, 2015

chrissy4605 :

AMD is an American company, based in Sunnyvale, CA which is where all the R&D happens. They used to have their own FABs, one in California and one in Dresden, Germany where their chips were produced but they sold their FABs off and became what is known as FAB-lite which spun off and became Global Foundries.

They had a contract with Global Foundries to utilize them, however they are able to look into other companies if they want such as Samsung or TSMC to do the FABing of their products.

Giroro · May 27, 2015

HBM seems to be all about the speed increases right now (and I'll be surprised if they use their technology with anything but their highest end next gen GPU).

Where I'm really excited though, is what the lower clock rate and simpler package will mean for bringing high performance to laptops and servers in a year or two when they aren't constrained by the 4GB limit. Putting RAM in the same package as an APU will mean very good things when it comes to heat and power efficiency. Plus the APU will need far fewer pins-out, which means a smaller package and a simpler mobo overall (which is most exciting).

Imagine a high-performance dual CPU/GPU setup easily fitting within a 13-inch "we can't legally call these ultrabooks" form factor. If anything, enthusiast level desktop graphics is the least interesting thing they can do with HBM, it's just a stepping stone and proof-of-concept until they can get to their real goal.

InvalidError · May 27, 2015

Giroro :

HBM will likely get used across the whole mid-high-end range of GPUs starting with the 14/16nm parts next year since it eliminates the expensive 8+ layers PCB necessary to run 256-512bits-wide GDDR5 from the GPU BGA to the GDDR5 chips.

4GB is not really a limiting factor since it is entirely possible to have a heterogeneous memory architecture with HBM used primarily for the IGP and DDR3/4 for the CPU. Since HBM introduces an extra intermediate physical and logic layer between the CPU and DRAM, it likely has higher latency than DDR4 and that may translate into worse performance for CPU-intensive algorithms that are sensitive to latency such as any code that relies heavily on conditional branching - compiling code, parsing text, building decision trees, sparse arrays, etc. That's why there is HMC specifically for CPUs which moves part of the DRAM addressing logic from the DRAM dies to the stack controller to eliminate the associated latency and redundant circuitry while improving access concurrency. Even then, you will still have external DDR4 for systems that require more than whatever memory is built on-package - people who need over 32GB RAM are not going to disappear and I doubt AMD/Intel will produce chips with 32GB or more HBM/HMC on-package any time soon.

GPUs that rely entirely on built-in memory will likely become a thing next year. For CPUs though, I think DDR4 or future replacements will be around for the foreseeable future even with HMC, albeit as a secondary RAM pool to store extended working sets and disk caches - an intermediate storage space between primary working memory and the swap file.

Giroro · May 27, 2015

InvalidError :

Of course they will use it anywhere they possibly can cut costs in the future, but I'm talking about this year in terms of using it on only one high end card. Frankly, I think AMD will be totally hosed if they don't get this to market within months because they will have lost their competitive advantage by 2016.

Maybe I'm misinterpreting where you think where redundant memory controlling logic will come into play but I see no reason for that to happen if you're using HBM only within a system. HBM is shouldn't be between the CPU and DRAM, it eliminates the DRAM entirely. That is why I find the concept interesting - it allows you to bake all the memory into the APU package. A DDR/HBM hybrid system would be more complicated than current computers, meaning you lose all the advantages in efficiency and mobo size. Which could be fine for the desktop workstation market where size and efficiency are not issues, but I think using this to build overcomplicated workstations is pretty far off from AMD's goals, or in the very least uninspired.

Other than that, and assuming that HBM never advances beyond the limit of 4 stacks/processor, that could still mean 8 stacks per APU with 4 stacks on the CPU and 4 on the GPU.
The way I envision this going, is each APU is basically a full-powered system on a chip, minus the bulky interfaces that have to live on the motherboard and things you would never need more of one of, like a sound card.

My question is, with the memory baked into the APU, what stops you from just adding more APUs when you need more RAM, or more GPU power? One of the biggest challenges to multiprocessor systems is the memory interface, and the sheer number of pins each processor needs to interface with said memory. All those interfaces need a lot of board space after all You also need pins for things like PCI, but that need is lessened considerably if you don't need that bandwidth to talk to a separate GPU. Regardless, I believe that even with a lot of pins, an APU module wouldn't need to be bigger than what is used for DDR.

So, lets assume your 32GB limit and you're building a desktop workstation, what can you do with a single computer with 128 GB of memory that couldn't be done equally well with what could be essentially be 4 or more computers, each with 32GB memory, within the same general form factor? I don't think the concept I'm laying out is much different than what is used in supercomputers and data-centers, just how you may be able to fit many small/inexpensive computing units on a single board with a shared power supply/usb controller/whatever

Granted the actual engineering to make all those modules/resources play together nicely like one big, easy to use bucket may be a big challenge. It might even require a dedicated and centralized processor just organize things between all the CPUs (I know what I just wrote sounds silly). I do know AMD and Directx12 are already making progress toward using mixed GPU resources. I know we aren't at the point this is possible yet, but AMD may be focusing on long term success here. I at least hope the delays to their next GPU are more in line with "We are going to get this new concept perfect" and less "this could never possibly be done, quickly come up with a plan B to ship".

InvalidError · May 27, 2015

Giroro :

There is a fundamental misunderstanding of what HBM is right there: a HBM stack is 4-8 DRAM dies on top of the HBM interface chip which connects to the GPU/APU through the silicon interposer. It does not "get rid" of DRAM, HBM and HMC are just different ways of partitioning and interfacing with DRAMs to work around the fact that the DRAM manufacturing process simply sucks at high-speed signaling and logic. HBM DRAMs are basically standard DDR-DRAM with a very wide data bus (128bits per macro, up to two macros per die instead of the usual 8-16 bits per die) while HMC DRAM exposes the control and data lines of individual memory banks within each DRAM die, eliminating the traditional DRAM access multiplexing logic. Both still rely on DRAM at their core, it is just wired differently with HMC enabling about 8X as much access-level concurrency as HBM does.

As for increasing memory by adding GPUs/CPUs/APUs, how would you route 200+GB/s between sockets? You would end up with socket pinouts having a crapload of power-guzzling high-speed socket-to-socket interconnects and uber-expensive 12-16 layers $1000+ server-like motherboard if you wanted to maintain memory access performance across sockets, not counting all the added complexity of implementing full ccNUMA support in your CPU/GPU/APU/etc. Managing a heterogeneous memory from a single host on a 4-6 layers motherboard is far simpler and cheaper.

The other problem with using HBM as the only type of system memory and slapping extra CPU/GPU/APUs in to increase memory size is: how much will you end up having to pay per extra GB of RAM when all you need/want is extra RAM? Likely more than twice as much as what you would pay for plain DDR4-3200 since high-capacity HBM/HMC will likely only be available on higher-end CPU/GPU/APU. Instead of paying $150 for 16GB, you might end up paying $350 for a hypothetical 16GB i5-8670 or R9-670X. Having to buy a whole CPU/GPU/APU to upgrade RAM also means increasing power draw by 10-20W more than adding RAM alone, not counting the extra power for the ccNUMA and links between sockets when the memory is actually being used.

My "32GB limit" is just for DDR3 on dual-channel systems. For quad-channel DDR4 systems, that goes up to 128GB. For server motherboards with LR-DIMM risers, which is what you may find in large servers and memory-intensive HPC systems, it may go over 1TB per socket. Performance-wise, splitting memory across multiple sockets with all the cache-coherence and latency that come with it is a great way to kill performance in memory-intensive tasks. This is one of those things you want to avoid doing unless you have no other choice, such as TB-scale memory-resident databases.

Long story short, I do not think HBM or HMC is going to replace DIMMs in desktop PCs and servers any time soon due to the very wide requirement spread. They probably will nearly everywhere else though.

toadhammer · Jun 3, 2015

CaedenV :
CaedenV :

What about latencies ? at a 500Mhz clock, seems impossible to improve latencies...

Click to expand...

Latency in graphics is all about completing the task and exporting a frame every 1/60th (or 1/144th for OC'd panels) of a second. As long as it can render a frame at 60 or 140 Hz then it is fine. At 500,000,000Hz and super high bandwidth I really don't think that latency is going to be a noticeable issue. On the GPU side they may need to increase their buffer size a little bit to compensate for the lower clock rate and the insane increase in input, but that goes up with each generation anyways.

it's a fragmentation issue. If you need to access lots of textures, shaders, etc then the memory latency becomes an important factor (because every access will have to wait for the data to arrive). Although there are some tricks (caches) to improve this, look at the trend at current graphics cards: high speed memory (Ghz) with relatively narrow data paths (and, probably, doubling to 256 or 512 bits wide memory buses is something that boards makers can do, problably raising the cost to more than double, but not to be prohibitive.

And they can play games if they need a higher effective clock. If each stack is on its own quarter-cycle, you get an effective 4x clock rate.

Jay Hopkins · Jun 3, 2015

This seems very interesting and also promising.Wouldn't this design be ideal for regular memory too though?If this is such an improvement,can we look forward to HBM DiMMs on our next generation motherboards?

InvalidError · Jun 3, 2015

Jay Hopkins :

The bus between HBM stacks and the CPU is 256 data bits wide. Standard DIMMs are 64 bits wide. Imagine what a 256 bits wide DIMM might look like. It would be wider than typical ATX motherboards are.

Even as a discrete chip soldered on a PCB, routing ~300 signals from the 100sqmm HBM chip stack to the CPU/GPU would be nearly impossible due to the extreme signal density, precision and the number of PCB layers required.

Wide chip-to-chip interfaces like HBM are only practical on a silicon interposer or as direct stacking.

WyomingKnott · Jun 10, 2015

@notsleep

System RAM and graphics RAM are optimized for different tasks. The end of the first response here lays it out fairly well, IMHO: http://www.techspot.com/community/topics/whats-the-difference-between-ddr3-memory-and-gddr5-memory.186408/

cutekitsune · Jun 11, 2015

If this is true, then AMD did it once again...They always seem to make something new. First the 64 bit then duo core then quad core, way better apus now this. And who will benefit from this in the end? Yea Intel as always.

InvalidError · Jun 11, 2015

cutekitsune :

If you count Crystalwell's eDRAM as an HMC field test, then Intel is roughly two years ahead of AMD at putting DRAM on the CPU package.

jimmysmitty · Jun 12, 2015

cutekitsune :

And technically Intel had both the dual core and quad core first.

AMD's HBM Promises Performance Unstifled By Power Constraints

Titan

Honorable

Contributing Writer

Honorable

Honorable

Honorable

Distinguished

Honorable

Champion

Splendid

Titan

Splendid

Titan

Distinguished

Honorable

Titan

Legenda in Aeternum

Honorable

Titan

Champion

Share this page