[SOLVED] What happened to HBM memory?

Solution
Most games aren't limited by memory bandwidth, so it didn't make financial sense to use HBM over GDDR6x.
They would be without the buffed buffers and caches. It just turns out that increasing on-chip SRAM, which is itself ludicrously expensive, to get by on GDDR6(X) is still cheaper overall than HBM and good enough in the consumer space at least for now.

That said, the main reason HBM costs more is low volume. If there was mass adoption of HBM2/3, it would become marginally more expensive than GDDR6.

InvalidError

Titan
Moderator
Still being used on niche hardware like datacenter GPGPUs, network switching gear, fabric switches, special CPUs, etc.

On the consumer side of things, AMD and Nvidia have decided to increase GPU buffer and cache sizes to stretch the viability of GDDR6(X) some more and shave some costs.
 

InvalidError

Titan
Moderator
Most games aren't limited by memory bandwidth, so it didn't make financial sense to use HBM over GDDR6x.
They would be without the buffed buffers and caches. It just turns out that increasing on-chip SRAM, which is itself ludicrously expensive, to get by on GDDR6(X) is still cheaper overall than HBM and good enough in the consumer space at least for now.

That said, the main reason HBM costs more is low volume. If there was mass adoption of HBM2/3, it would become marginally more expensive than GDDR6.
 
Solution
I may not be able to confirm this, but one of the main problems I see with HBM is that it decreases the tolerance for defects. Once you bind an IC onto the package substrate, it's probably really hard to correct it. And HBM based systems have more points of failure per substrate over a single die on a substrate, and all of these points have to pass for it to work.

Also if the Semiengineering article posted before is correct, only Samsung and SK Hynix make HBM stacks. Unless you can get Samsung to make everything and produce the final package, you have to buy HBM stacks separately, find someone to assemble the final package, and hope that they have a good track record for doing so.

So basically, I think HBM is simply to complex to bring to scale and remain profitable.
 

InvalidError

Titan
Moderator
Once you bind an IC onto the package substrate, it's probably really hard to correct it. And HBM based systems have more points of failure per substrate over a single die on a substrate, and all of these points have to pass for it to work.
Dies are usually tested while they are still on the wafer to avoid wasting time cutting and handling defects, so there shouldn't be many defects making it into HBM stacks and the stacks themselves get tested again before shipping to the customer.

On the points of failure side of thing, I think AMD's Zen 2/3 Ryzen/TR/EPYC lineups have proven that having thousands of signals between CCD and IOD is not a major yield or reliability issue and Intel's crazy 47 tiles monster shows that Intel is fairly confident in its fancy packaging abilities.
 
Dies are usually tested while they are still on the wafer to avoid wasting time cutting and handling defects, so there shouldn't be many defects making it into HBM stacks and the stacks themselves get tested again before shipping to the customer.
The defects may not be there when they arrive for final assembly, but something may happen during the final assembly.

On the points of failure side of thing, I think AMD's Zen 2/3 Ryzen/TR/EPYC lineups have proven that having thousands of signals between CCD and IOD is not a major yield or reliability issue and Intel's crazy 47 tiles monster shows that Intel is fairly confident in its fancy packaging abilities.
While do bring up a point about MCM manufacturing being sufficient for Zen 2/3, I don't think there's "thousands of signals". Most of the diagrams I've seen say IF is a 32-bit wide bus in each direction, so 128 lines total (64 for data, 64 for ground) per CPU die. So at most this is 1024 lines (half of which are ground) for an 8-die EPYC. The other thing is the package itself from what I can tell is no more different than a bog standard PCB. The interposer for HBM based devices is typically silicon based.

I was also going to comment on the cache thing in GPUs. I would argue the primary reason for increasing cache is because NVIDIA and later AMD went to a tiled rasterization rendering scheme. This has the benefit of not requiring as much bandwidth between the GPU and VRAM since only 32x32 tiles are being passed around and having a sufficiently large enough cache can let you hide memory latency.
 

InvalidError

Titan
Moderator
Most of the diagrams I've seen say IF is a 32-bit wide bus in each direction
Not much information out there besides the 32B/fclk read and 16B/fclk write on marketing slides. Something is clearly not symmetrical there.

Well, CPUs are going to be the small fries of fancy packaging once chiplet GPUs become a thing. Those will need far more than 64GB/s to spare developers the same or similar headaches to what they had with the various SLI/CF modes.