News Micron's New HBM3 Gen2 is World's Fastest at 1.2 TB/s, Teases Next-Gen at 2 TB/s+

bit_user

Titan
Ambassador
The HBM3 Gen 2 memory has the same channel arrangement as HBM3, with 16 channels and 32 virtual channels (two pseudo-channels per channel).
So, does anyone know what this actually means? Is it like DDR5, where you have 2x 32-bit channels per DIMM, or is it more like you have two virtual channels interleaved over each real 64-bit channel?

Also, is HBM typically interleaved? If so, at what granularity?
 

Diogene7

Prominent
Jan 7, 2023
72
13
535
Micron announced that its HBM3 Gen 2 memory, which is the Fastest, densest and most power efficient yet, is now sampling to its partners.

Micron's New HBM3 Gen2 is World's Fastest at 1.2 TB/s, Teases Next-Gen at 2 TB/s+ : Read more


I am wondering how much more energy efficient per bit HBM3 is compared to LPDDR5 ? And also how much more expensive it is at iso-capacity (ex: In July 2023, let say that 32GB LPDDR5 6400 cost 100$ to 150$, (fictious numbers) how much would be 32GB HBM3 6400 ?)

Why don’t we have such technology in smartphones and laptops ? One or two 32GB stacks would provide lot of memory and bandwith to the CPU…

Also what would be a disruptive innovation / improvement is to have 32GB or more low power consumption non-volatile (VCMA) MRAM HBM stack: in theory, it could be used to replace LPDDR DRAM memory and NAND Flash storage, with very fast non-volatile memory in one device : most software actions would likely feel near instantaneous (no or less loading time), with nearly no need for time to boot the system (« instant-on »).
 

bit_user

Titan
Ambassador
I am wondering how much more energy efficient per bit HBM3 is compared to LPDDR5 ? And also how much more expensive it is at iso-capacity (ex: In July 2023, let say that 32GB LPDDR5 6400 cost 100$ to 150$, (fictious numbers) how much would be 32GB HBM3 6400 ?)

Why don’t we have such technology in smartphones and laptops ? One or two 32GB stacks would provide lot of memory and bandwith to the CPU…
Nvidia addressed some of these questions in their justification for choosing LPDDR5X for use with their Grace CPU:

Compared to an eight-channel DDR5 design, the NVIDIA Grace CPU LPDDR5X memory subsystem provides up to 53% more bandwidth at one-eighth the power per gigabyte per second while being similar in cost. An HBM2e memory subsystem would have provided substantial memory bandwidth and good energy efficiency but at more than 3x the cost-per-gigabyte and only one-eighth the maximum capacity available with LPDDR5X.

Source: https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip-architecture-in-depth/

Note that I think they're comparing on-package LPDDR5X-7400 against DDR5-4800 DIMMs, which is how they get they 53% figure and such a large energy savings.

Also what would be a disruptive innovation / improvement is to have 32GB or more low power consumption non-volatile (VCMA) MRAM HBM stack: in theory, it could be used to replace LPDDR DRAM memory and NAND Flash storage, with very fast non-volatile memory in one device : most software actions would likely feel near instantaneous (no or less loading time), with nearly no need for time to boot the system (« instant-on »).
NAND-based NVMe drives are already fast enough, with access latencies typically in the single or low double-digit microseconds. Whatever you're "feeling" isn't so much the storage device, but whatever overhead the OS and antivirus are adding. Or, maybe the app is doing more computation than you expect.
 
  • Like
Reactions: Sluggotg

Diogene7

Prominent
Jan 7, 2023
72
13
535
Nvidia addressed some of these questions in their justification for choosing LPDDR5X for use with their Grace CPU:
Compared to an eight-channel DDR5 design, the NVIDIA Grace CPU LPDDR5X memory subsystem provides up to 53% more bandwidth at one-eighth the power per gigabyte per second while being similar in cost. An HBM2e memory subsystem would have provided substantial memory bandwidth and good energy efficiency but at more than 3x the cost-per-gigabyte and only one-eighth the maximum capacity available with LPDDR5X.​



https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip-architecture-in-depth/

Note that I think they're comparing on-package LPDDR5X-7400 against DDR5-4800 DIMMs, which is how they get they 53% figure and such a large energy savings.


NAND-based NVMe drives are already fast enough, with access latencies typically in the single or low double-digit microseconds. Whatever you're "feeling" isn't so much the storage device, but whatever overhead the OS and antivirus are adding. Or, maybe the app is doing more computation than you expect.
Nvidia addressed some of these questions in their justification for choosing LPDDR5X for use with their Grace CPU:
Compared to an eight-channel DDR5 design, the NVIDIA Grace CPU LPDDR5X memory subsystem provides up to 53% more bandwidth at one-eighth the power per gigabyte per second while being similar in cost. An HBM2e memory subsystem would have provided substantial memory bandwidth and good energy efficiency but at more than 3x the cost-per-gigabyte and only one-eighth the maximum capacity available with LPDDR5X.​

Note that I think they're comparing on-package LPDDR5X-7400 against DDR5-4800 DIMMs, which is how they get they 53% figure and such a large energy savings.

Thanks, at least it give a ballpark idea, which I didn’t have : thanks very much for that :).


NAND-based NVMe drives are already fast enough, with access latencies typically in the single or low double-digit microseconds. Whatever you're "feeling" isn't so much the storage device, but whatever overhead the OS and antivirus are adding. Or, maybe the app is doing more computation than you expect.

First I would like to say that I am not a specialist, and I am not an English native so my wording below may not be fully accurate. My apologies for that.

Even though I would agree there is overhead from the « file system », Flash Nand memory die (media) does have quite a lot of latency (micro-second), and plugged on PCI lane (NOT memory lane), and is block accessible (and not byte accessible)

When Intel was selling Persistent Optane memory, it was possible to buy it on a special Persistent DRAM Module, or as a SSD and the latency of the Optane « memory die » (media) is I think in the 100 of nanoseconds

With the Optane Persistent DRAM Module, plugged on the memory channel, it is then possible to emulate a virtual Disk (a RAMDisk), but a Persistent RAMdisk (which sadly was not bootable, I would because UEFI wasn’t design to take this possibility into account).

Even with the overhead of the File System, I think it was much faster to launch a software from the Persistent RAMdisk thanks to both lower latency of DRAM channel, and the Optane media itself (I think I watched a Youtube video from Linus on that, sorry I don’t have the link here).

One step further would be to optimize the Operating System to be « Persistent memory aware » for « Direct Access », and I would think for attempting to lower / eliminate the overhead from the File System : This would further decrease the latency…

So I would think that with Persistent Memory (ideally something like VCMA MRAM HBM stack) + Operating System optimizations, loading most softwares would feel near instantaneous (« always-on »), but then yes, I think you are right processing big data files (ex: video files) would still require time due to the processing capabilities…

My belief is that the advent of « Persistent Memory » would be disruptive and is a much, much needed technology (especially in IoT)to enable VERY big improvements in the way IT system are designed (related to « Normally-Off computing »).

It is a bit like how OLED technology is a key enabler in display technology (self emitting diode, that also allow to create flexible, rollable,… displays and open new opportunities).
 

bit_user

Titan
Ambassador
Even though I would agree there is overhead from the « file system », Flash Nand memory die (media) does have quite a lot of latency (micro-second), and plugged on PCI lane (NOT memory lane), and is block accessible (and not byte accessible)
Even with all of those caveats, we're still talking about access times in the double-digits of microseconds:

2FVgBEcWNrWKCafpRzTPUW.png

Source: https://www.tomshardware.com/reviews/solidigm-p44-pro-ssd-review/2

When Intel was selling Persistent Optane memory, it was possible to buy it on a special Persistent DRAM Module, or as a SSD and the latency of the Optane « memory die » (media) is I think in the 100 of nanoseconds
I know what they did. It made sense for memory tiering, in highly-scalable applications, but that's about it.

Even with the overhead of the File System, I think it was much faster to launch a software from the Persistent RAMdisk thanks to both lower latency of DRAM channel,
A RAM disk typically has a lighter-weight type of filesystem, and maybe antivirus isn't configured to perform on-access scanning?

One step further would be to optimize the Operating System to be « Persistent memory aware » for « Direct Access », and I would think for attempting to lower / eliminate the overhead from the File System : This would further decrease the latency…
You could, and Intel tried. They had at least a proof of concept that enabled direct, userspace access to regions of PMem. I'm not sure if any applications ever used it, though. It would only make sense for things like high-volume databases or other server applications.

So I would think that with Persistent Memory (ideally something like VCMA MRAM HBM stack) + Operating System optimizations, loading most softwares would feel near instantaneous
You don't need it. Do the math, sometime. With a decent NVMe drive, the bottlenecks aren't in the hardware.

My belief is that the advent of « Persistent Memory » would be disruptive and is a much, much needed technology (especially in IoT)to enable VERY big improvements in the way IT system are designed (related to « Normally-Off computing »).
Perhaps it could've been disruptive for mobile or IoT, where the power needed for DRAM-refresh is non-negligible. Oddly, they didn't really go after either of those markets, possibly because the active power demands of Optane seemed higher than NAND.
 
  • Like
Reactions: TJ Hooker

Diogene7

Prominent
Jan 7, 2023
72
13
535
Even with all of those caveats, we're still talking about access times in the double-digits of microseconds:
2FVgBEcWNrWKCafpRzTPUW.png


I know what they did. It made sense for memory tiering, in highly-scalable applications, but that's about it.


A RAM disk typically has a lighter-weight type of filesystem, and maybe antivirus isn't configured to perform on-access scanning?


You could, and Intel tried. They had at least a proof of concept that enabled direct, userspace access to regions of PMem. I'm not sure if any applications ever used it, though. It would only make sense for things like high-volume databases or other server applications.


You don't need it. Do the math, sometime. With a decent NVMe drive, the bottlenecks aren't in the hardware.


Perhaps it could've been disruptive for mobile or IoT, where the power needed for DRAM-refresh is non-negligible. Oddly, they didn't really go after either of those markets, possibly because the active power demands of Optane seemed higher than NAND.

I think I see at least 2 challenges that prevented Phase Change Memory (PCM) / Optane to be used in mobile devices :
1. Much too high power consumption which make it unfit for mobile devices
2. Too limited lifecycles (I think somewhere 10^6)
3. Much too high cost for mobile phones market which is much more cost sensitive than (AI) data centers

As of 2023, according to some research papers published by the European research center IMEC, one of the Non-Volatile Memory (NVM) that seems to gain traction/some maturity is MRAM which different flavor exists, with different trade-offs (Toggle, STT, SOT, VG-SOT, VCMA) and as of 2023, VG-SOT-MRAM seems to combine many of the requirements (10^12 cycle life,…)

As a concept, just project the idea of scaling MRAM technology to create 64Gbits Non-Volatile Memory/Persistent VG-SOT-MRAM DRAM die, or even 64GB VG-SOT-MRAM HBM DRAM stack (for AI servers at first to absorb the R&D costs) : I strongly believe the low-latency, low power and persistence would open many new opportunities for AI, and then many years down the line for IoT devices,…

The main thing that prevent this to happen is mainly costs (new unfamiliar manufacturing tools needs to be created, and then train highly skilled employees to use them, and then create extremely expensive fabs) which, as of 2023/2025, would likely make a MRAM DRAM module 100x / 1000x more expensive than a regular DDR5 module…

It is a bit like NAND Flash as enabled music players (iPod), smartphones (iPhones),… that HDD wasn’t really fit for…