News Adata Demos Next-Gen Memory: CAMM, CXL, and MR-DIMM Modules

I'm really curious what the latency penalty on MR/MCR-DIMM memory looks like as from that picture a potential client version wouldn't have to be larger than current DDR5. This could be a way to increase memory bandwidth a fair bit without needing huge overclocks/shifting to a 256 bit memory bus.
 

bit_user

Polypheme
Ambassador
I'm really curious what the latency penalty on MR/MCR-DIMM memory looks like
I think I read that it's an additional cycle, at the memory bus speed. I'm not sure if that's relative to Registered DIMMs (which already add a ~1 cycle penalty), but I would assume so.

The bigger issue is probably that they increase the burst size from 64 bytes to 128 bytes. Unless the two halves of that burst can come from different addresses, that would make the practical speedup you'd get from them sub-linear (too often, we only want 64 bytes from a given address, due to how CPU caches currently work).

This could be a way to increase memory bandwidth a fair bit without needing huge overclocks/shifting to a 256 bit memory bus.
It's a good stop-gap on the way towards greater use of in-package memory. That's where you're going to see non-linear increases in memory bandwidth.

In-package memory is an inevitability. Not only because memory bandwidth hasn't kept pace with compute scaling, but also because it's necessary for efficiency reasons. And even that won't be enough. Longer-term, we're probably looking at hybrid die stacks of compute + DRAM.
 
Last edited:
  • Like
Reactions: thestryker
It's a good stop-gap on the way towards greater use of in-package memory. That's where you're going to see non-linear increases in memory bandwidth.
Yeah that's my line of thinking.
In-package memory is an inevitability. Not only because memory bandwidth hasn't kept pace with compute scaling, but also because it's necessary for efficiency reasons. And even that won't be enough. Longer-term, we're probably looking at hybrid die stacks of compute + DRAM.
I'm mostly curious what client desktop looks like in this circumstance, because I agree completely this is where it's going. The benefits of integration outweigh the problems in the case of laptops, but I'm not sure about desktop. I assume that we'll likely see more integration in servers and they'll use CXL for capacity, but I'm not sure that will work on client desktop unless we see changes in connectivity (which I'd welcome).
 

bit_user

Polypheme
Ambassador
I'm mostly curious what client desktop looks like in this circumstance, because I agree completely this is where it's going. The benefits of integration outweigh the problems in the case of laptops, but I'm not sure about desktop. I assume that we'll likely see more integration in servers and they'll use CXL for capacity, but I'm not sure that will work on client desktop unless we see changes in connectivity (which I'd welcome).
Look at what Apple achieved with in-package DRAM. The M1 Ultra came with 128 GB of DRAM, which is as much as you could put in a x86 desktop platform... until 48 GB DIMMs came along and elevated them to 192 GB. Then again, you could probably just put those same dies in-package, increasing it to match. So, capacity isn't necessarily the kind of problem you might think it is.

If you want to scale beyond that, then some form of slower, external DRAM will be necessary. It could be a conventional DIMM, but I think it will eventually shift to CXL.mem. If external DRAM is integrated via page-migration, then the latency impact of CXL (vs. directly-connected DIMM) should be negligible. The advantage of CXL is that you have the flexibility to use the same lanes for either memory, storage, or I/O. So, it's a lot more versatile than the current split where some pins on the CPU are dedicated to PCIe and others to DDR5.
 
Look at what Apple achieved with in-package DRAM. The M1 Ultra came with 128 GB of DRAM, which is as much as you could put in a x86 desktop platform... until 48 GB DIMMs came along and elevated them to 192 GB. Then again, you could probably just put those same dies in-package, increasing it to match. So, capacity isn't necessarily the kind of problem you might think it is.
FWIW you can't use DDR5 in package as the chip capacities are only 16/24Gb (max spec is 64Gb). LPDDR5/5X currently are 144/128Gb (I wasn't able to find max spec) which still would require minimum of 8 chips which takes up considerable package space.
 

bit_user

Polypheme
Ambassador
FWIW you can't use DDR5 in package as the chip capacities are only 16/24Gb (max spec is 64Gb). LPDDR5/5X currently are 144/128Gb (I wasn't able to find max spec) which still would require minimum of 8 chips which takes up considerable package space.
Yeah, the M1 Ultra has a 1024-bit memory interface, so that implies 16. But, that's just what Apple did in 2021. Nvidia's new Grace has 512 GB of LPDDR5X, and I'm pretty sure it just uses a 512-bit data bus. If we look at HBM, Intel's Xeon Max has up to 64 GB in 4 stacks.

In spite of how I was talking, I'm not really worried about capacity. The lowest-memory Apple M1 has just 8 GB of memory, and that's shared between the CPU and GPU. To make it work, they use hardware memory compression and swap. For basic stuff, it's fine.

The way it's looking like CXL.mem would work is as a faster substitute for swapping. So, you migrate pages to/from the CXL module(s) on demand. Ideally, this would have some level of hardware support. An interesting difference between swapping to a storage device and CXL.mem is that it's not too painful to access data on the memory module without migrating to faster, in-package memory. You'd only want to migrate it, if it starts getting a lot of accesses.
 
  • Like
Reactions: thestryker
The way it's looking like CXL.mem would work is as a faster substitute for swapping. So, you migrate pages to/from the CXL module(s) on demand. Ideally, this would have some level of hardware support. An interesting difference between swapping to a storage device and CXL.mem is that it's not too painful to access data on the memory module without migrating to faster, in-package memory. You'd only want to migrate it, if it starts getting a lot of accesses.
CXL implementations have been the most interesting part of forthcoming hardware to me. I'm mostly looking forward to MTL just to see the specifics on how Intel is using it within the CPU itself.
 

bit_user

Polypheme
Ambassador
CXL implementations have been the most interesting part of forthcoming hardware to me. I'm mostly looking forward to MTL just to see the specifics on how Intel is using it within the CPU itself.
Huh? I saw them mention that it uses UCIe, but that has a "raw" mode. CXL feels a little heavy-weight for the kind of chiplet interconnects on Meteor Lake. Did you see them say they're using it, or is that just conjecture?
 
Huh? I saw them mention that it uses UCIe, but that has a "raw" mode. CXL feels a little heavy-weight for the kind of chiplet interconnects on Meteor Lake. Did you see them say they're using it, or is that just conjecture?
Hotchips last year is when they talked about it. I don't recall if they specified where all they were using it, but they're using the protocol over their own interconnects. I caught it while watching Ian Cutress doing live commentary.

I forgot Chips and Cheese covered it and they're saying IGP: https://chipsandcheese.com/2022/09/10/hot-chips-34-intels-meteor-lake-chiplets-compared-to-amds/
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
Hotchips last year is when they talked about it. I don't recall if they specified where all they were using it, but they're using the protocol over their own interconnects. I caught it while watching Ian Cutress doing live commentary.

I forgot Chips and Cheese covered it and they're saying IGP: https://chipsandcheese.com/2022/09/10/hot-chips-34-intels-meteor-lake-chiplets-compared-to-amds/
Thanks for the link. In fact, it actually does say where they use it:

"Intel says the CPU and SoC tiles communicate with the IDI protocol, while the iGPU tile uses an iCXL protocol to talk to the SoC."
That makes a lot of sense. You don't want CXL between the CPU cores and memory controller, but putting it between the CPU and GPU isn't that different than using CXL or PCIe to talk to a dGPU.
 
  • Like
Reactions: thestryker

TRENDING THREADS