News SK Hynix Samples 24GB HBM3 Modules: Up to 819 GB/s

sickbrains

Honorable
Oct 6, 2017
3
1
10,515
Wait, how large is a single module? Could you potentially replace 2gb GDDR6X dies with these HBM3 dies? That would really disrupt the market segmentation going on right now with 24gb being reserved for halo products and more for workstations. Imagine a 6090 with 8 packets of these HBM3 modules! Or more likely a 9900XTX since AMD already used HBM2 in earlier GPU's.

Or is my understanding flawed and these HBM3 modules have to actually be on the GPU Die? Which means only 1 maybe 2 modules per GPU Die?
 

bit_user

Polypheme
Ambassador
Or is my understanding flawed and these HBM3 modules have to actually be on the GPU Die? Which means only 1 maybe 2 modules per GPU Die?
Yeah, they have to be in the same package as the GPU die. The article mentions the interface per stack is 1024 data bits, which it's only feasible to route & drive through an interposer. That compares with 32 bits per GDDR6 chip. However, it runs at a much lower frequency and you don't have as many stacks as you typically have GDDR6 chips. So, it's only like 3-5 times the bandwidth, rather than 32x.

Cost-wise, HBM3 is currently much more expensive than the same capacity of GDDR6.
 
With how things are going, i am seeing that consumer versions get the GDDR6X treatment and professional GPU just have the GRRD6X replaced with HBM3 for higher bandwidth and VRAM.

I dont expect to see HBM in consumer space anytime in the near future...
 

bit_user

Polypheme
Ambassador
With how things are going, i am seeing that consumer versions get the GDDR6X treatment and professional GPU just have the GRRD6X replaced with HBM3 for higher bandwidth and VRAM.
Nothing about it is a drop-in replacement, though. The memory controllers are very different, between the two. It's just one of many things that differentiate the AI/HPC processors from their rendering-oriented GPU cousins.
 

InvalidError

Titan
Moderator
I dont expect to see HBM in consumer space anytime in the near future...
Necessity will probably bring HBM or something HBM-like with fewer or narrower channels to the GPU and CPU consumer space within the next five years. It'll be the only practical way to meet bandwidth requirements without stupidly high external memory bus power and related PCB costs.

Nothing about it is a drop-in replacement, though. The memory controllers are very different, between the two. It's just one of many things that differentiate the AI/HPC processors from their rendering-oriented GPU cousins.
The HBM interface is very similar to DDR5 apart from having separate access to RAS and CAS lines, an optional half-row activation feature if you want to split each sub-channel into two more semi-independent channels, only one DQS per 32bits and no bus termination at either end, which I'd say makes HBM simpler overall.

The only genuinely problematic difference IMO is needing eight of those slightly modified DDR5 controllers per stack.
 
Necessity will probably bring HBM or something HBM-like with fewer or narrower channels to the GPU and CPU consumer space within the next five years. It'll be the only practical way to meet bandwidth requirements without stupidly high external memory bus power and related PCB costs.


The HBM interface is very similar to DDR5 apart from having separate access to RAS and CAS lines, an optional half-row activation feature if you want to split each sub-channel into two more semi-independent channels, only one DQS per 32bits and no bus termination at either end, which I'd say makes HBM simpler overall.

The only genuinely problematic difference IMO is needing eight of those slightly modified DDR5 controllers per stack.


I highly doubt that will be the case. If they do that, it would be for the halo models.

take a look at the 3070 and the 4070 for example, they reduced the memory bus from 256bit 192bit. the throughput remained about the same at 448 and 505gb/s by using GRRD6 vs GDDR6X
 

bit_user

Polypheme
Ambassador
The HBM interface is very similar to DDR5 apart from having separate access to RAS and CAS lines, an optional half-row activation feature if you want to split each sub-channel into two more semi-independent channels, only one DQS per 32bits and no bus termination at either end, which I'd say makes HBM simpler overall.
At some level, DRAM is DRAM. I get it. But, HBM3 has 32-bit sub-channels, which means you have 32 of those per stack, rather than the 2 that you get per DDR5 DIMM. So, that's a pretty big deal, and not something you can just gloss over.

Another key difference, it seems to me, is the drivers you'd need for communicating with DIMMs. As you point out, the power requirements are very different.

Finally, since the interface of HBM3 runs at a much lower frequency, perhaps certain aspects of the memory controller could be simplified and made more efficient.
 

bit_user

Polypheme
Ambassador
take a look at the 3070 and the 4070 for example, they reduced the memory bus from 256bit 192bit. the throughput remained about the same at 448 and 505gb/s by using GRRD6 vs GDDR6X
They also increased the amount of L2 cache by about 10x. We saw how much Infinity Cache helped RDNA2, so it's a similar idea.

What's interesting is that I'm guessing you think the RTX 3070 needed that entire 448 GB/s of bandwidth. We don't know that, however. GPUs sometimes have more memory channels just as a way to reach a certain memory capacity. The memory chips they used for the RTX 4070 each have twice the capacity. If they'd kept the width at 256-bits, then it would've either stayed at 8 GB or gone all the way up to 16 GB. And 16 GB would've made it more expensive.
 

InvalidError

Titan
Moderator
I highly doubt that will be the case. If they do that, it would be for the halo models.
HBM is still fundamentally still the same technology as any other DRAM. The main reason it is more expensive is relatively low volume production. All that would be necessary to bring the price down is for GPU and DRAM manufacturers to coordinate a hard switch.

It isn't much different from how next-gen DDRx is prohibitively expensive for the first 2-3 years from initial launch, then prices start dropping as more of the mainstream commits to next-gen stuff until the new stuff becomes the new obvious leader on price-performance.

At some level, DRAM is DRAM. I get it. But, HBM3 has 32-bit sub-channels, which means you have 32 of those per stack, rather than the 2 that you get per DDR5 DIMM. So, that's a pretty big deal, and not something you can just gloss over.
Nothing forces you to deploy independent memory controllers all the way down to the finest sub-banking option. You can operate each chip in the stack as a single 128bits-wide channel too and you can use stacks with fewer than eight chips if you don't need the largest capacity configuration.

The most logical thing to do in the future is buying raw HBM DRAM stacks and supply your own base chips to adapt the stack whichever way you need. Then you can mux a full stack through a single 128bits interface if all you want is the 8-32GB single stack capacity. Doubly so if your design already spits out its memory controllers into chiplets/tiles like AMD is doing with the RX7800-7900, you get your custom base die for the cost of TSVs to attach the raw HBM stack to extra silicon that is already designed-in.

With more advanced chiplet/tile based designs featuring active interposers, which will likely be commonplace three years from now, the HBM base die functions could also be baked directly into the active interposers themselves.
 

bit_user

Polypheme
Ambassador
You can operate each chip in the stack as a single 128bits-wide channel too and you can use stacks with fewer than eight chips if you don't need the largest capacity configuration.
I'm not convinced that's how it works. In this article, they talk about a 12-high stack with a 1024-bit interface.

I'm talking about actual HBM3, not simply what's plausible to do with stacked DRAM. Sure, you can dream up lots of plausible options, but once you start cutting the width of the HBM stack's interface, it comes directly at the expense of its bandwidth. Increasing clocks will hit efficiency. So, it's better to go slow-and-wide. That's why they do it that way.
 

InvalidError

Titan
Moderator
I'm not convinced that's how it works. In this article, they talk about a 12-high stack with a 1024-bit interface.
The standard HBM interface is 128bits per interface and intended to be one or two interfaces per die in the stack. What the manufacturer probably did there to get to 12 stacks while maintaining a 1024bits interface is design its DRAM dies so a pair of single-ported 2GB dies can share a port with the two halves of a dual-ported 2GB die with the triplet operating as a pair of 3GB dies.

I'm talking about actual HBM3, not simply what's plausible to do with stacked DRAM. Sure, you can dream up lots of plausible options, but once you start cutting the width of the HBM stack's interface, it comes directly at the expense of its bandwidth.
When you want to cut costs and drive adoption, sometimes sacrifices must be made. Not every applications needs 1TB/s of bandwidth per stack. On-package memory for a relatively high performance APU or lower-end GPU would be perfectly fine at 250-300GB/s.

With how pervasive modular chip designs are likely to be 3+ years down the road, everyone slapping raw HBM(-like) stacks on top of their own base die seems more than just plausible. My timetable may be off by a few years but it is the inevitable logical conclusion that will ultimately get forced by necessity if it doesn't happen by any other merit first.
 
  • Like
Reactions: bit_user