News AMD Working to Bring CXL Technology to Consumer CPUs

Hm... I don't see how CXL would be really good on consumer systems just yet, but it is definitely a "nice to have" right now. I wonder if by the time Intel and AMD have a platform with solid CXL support, that is when we'll get proper examples on how CXL will benefit us poor plebs.

Anyone with some good examples? Other than "soopah fast gaems loadz", that is.

Regards.
 
  • Like
Reactions: TCA_ChinChin

rluker5

Distinguished
Jun 23, 2014
620
371
19,260
"Unlike Optane, which Intel is killing off due to poor adoption, CXL already enjoys broad industry support through an open protocol. In fact, AMD and Intel, among many others, are working together on the new specification. "
And we already have both pagefile and more ram than most need (32GB for good DDR5 performance). Windows10 doesn't even fill your extra ram well and some prefer it.

I'm a fan of Optane so hopefully I'm not too biased to see something here, but what are the benefits of including something with orders of magnitude worse latency than Optane in your ram pool if Optane wasn't worth it? Optane dimms had a read latency as low as 250ns. ns! Compared to 100+ microseconds for nand. Not even in the same prefix.

I totally wanted my OS suspended in persistent memory so I could forgo the penalties of using a traditional filesystem, and it while would still be nice, having my pc fishing for bits and pieces in Nand is 1/2 the way to using just an HDD. At that point why even bother?
 

mamasan2000

Distinguished
BANNED
To my knowledge, CXL aims to "solve" the problem with machine learning and other similar tasks where you have huge datasets. You can't fit it all in RAM. Just not possible. But you want the next storage to be as close to RAM as possible to not bottleneck like crazy. Seems like SSDs will be the cache for DRAM. And SSDs are much cheaper and bigger in terms of storage than RAM. So it is about cost as well.
The other thing that seems to be a trend is, RAM can't keep up with CPUs anymore. Say a CPU has improved throughput by 50% while RAM only increased bandwidth by 20%. And it gets worse year by year. HBM seems to be one solution but an expensive one. Then there is calculations done on the RAM sticks themselves. RAM with compute. Seems to be very simple computational tasks but even a little work done is better than zero, if the CPU is busy.
This might be of interest, even if it isn't an exact answer:
View: https://www.youtube.com/watch?v=6eQ7P2TD7is
 

InvalidError

Titan
Moderator
Hm... I don't see how CXL would be really good on consumer systems just yet, but it is definitely a "nice to have" right now.
I can imagine at least one obvious use-case: once CPUs have 8+GB of on-package RAM, the DDR interface can be ditched in favor of 32 additional PCIe links for CXL and then you can simply use CXL-to-whatever riser cards for extra memory of whatever type your needs dictate. Need a good amount of cheap memory? Get a DDR4 riser. Need a massive amount of space for a resident-in-memory database? Get a NVDIMM riser. Want something for which additional DIMMs will still be available 4-5 years from now? Get a DDR5 riser. Need to add memory but DDR5 is no longer obtainable for a reasonable price after being supplanted by DDR6? Get a DDR6 riser. SO-DIMMs are significantly cheaper than DIMMs? Get a SO-DIMM riser instead of a standard DIMM one.

By switching out the DRAM interface for CXL, you could use whatever memory type you want in whatever form factor you want for expansion as long as there is an appropriate riser for it. No need to worry about which CPU is compatible with what memory type.

HBM seems to be one solution but an expensive one.
The main reason HBM is expensive is low volume. Start using it everywhere and the price should drop close to parity with normal memory since it is fundamentally the same, just with a different layout to accommodate the wide interface and stacking.
 

jasonf2

Distinguished
CXL from what I have read will allow a disaggregation of the entire memory structure from bound devices. So imagine a machine infrastructure that instead of having dedicated memory for individual components all memory for all devices coexists in a memory pool with many to many communications capabilities. The huge implication for this that I could see in the consumer space is the ability to side step many of the physical memory limits presented with dedicated discrete GPU and DIMM modules running independently. Honestly The datacenter is going to get the most gain. Being able to have multiple CPUs share the memory pool by direct many to many communication will reduce TCO while improving performance. But in that same vein it could be at least plausible to have a GPU with no onboard memory (or a very small base amount) that is pulling directly off of the main system pool. Such a setup would have a cost advantage in mid tier setups. While shared memory has been in play since the 90s in the prebuilt scene it really hasn't been that great because of the latency and overhead necessary to communicate between multiple busses. If it is all on the same interconnect layer a huge part of that issue should be mitigated. The many to many ability should also let a whole new set of accelerators be built that can precache slow to fast storage and reduce CPU overhead.
 

SSGBryan

Reputable
Jan 29, 2021
136
116
4,760
Anyone with some good examples? Other than "soopah fast gaems loadz", that is.

Regards.

I do 3d art - anything that can get my render engine more memory is a fine thing, and something that I am very interested in.

At the hobbyist level, I was running out of ram (128Gb) a decade ago. The ability to add a Tb of ram to my render engine would really be helpful, since we have too many idiot vendors that believe that 8K texture maps are a good thing.
 

bit_user

Polypheme
Ambassador
Hm... I don't see how CXL would be really good on consumer systems just yet, but it is definitely a "nice to have" right now. I wonder if by the time Intel and AMD have a platform with solid CXL support, that is when we'll get proper examples on how CXL will benefit us poor plebs.
Imagine an era where CPUs have in-package DRAM, like Apple's M-series. That lowers latency, potentially increases speed (i.e. if they take advantage of being in-package by adding more memory channels), and could decrease motherboard cost/complexity, due to no DIMM slots or need to route connections to them.

Great! So, what's the down side? The main one would be capacity. High-end users will bristle at the limits of what can be had in-package, plus there's no ability to upgrade over time (i.e. without replacing your entire CPU + RAM, making it an expensive proposition).

So, the way out of this conundrum is to add CXL links, which can now even be switched (i.e. enabling them to fan out to a much larger number of devices), and put your memory on the CXL bus. The down side of this is that CXL memory would be much slower to access. Still, waaay faster than SSDs, but slower than directly-connected DIMMs. That can be mitigated by the use of memory tiers, where frequently-used memory pages can be promoted to the "fast" memory, in package. Infrequently-used pages can be evicted and demoted to "slow" memory, out on CXL. This isn't merely a fiction, either. Support for memory tiering and page promotion/demotion already exists in the Linux kernel.

What this means is that you could scale out your memory to TBs, in a relatively low-cost workstation. You'd just need a motherboard with a CXL switch and enough slots. Or, if you get a cheaper board without a switch, you can still add RAM while enjoying the benefits that in-package memory stacks can provide.
 
  • Like
Reactions: -Fran-

InvalidError

Titan
Moderator
What this means is that you could scale out your memory to TBs, in a relatively low-cost workstation. You'd just need a motherboard with a CXL switch and enough slots.
Hypothetically, you could have a motherboard without a CXL switch, just use a riser/expansion card with its own CXL switch for even more CXL slots as long as your application doesn't mind the extra ~100ns latency for each additional CXL hop.
 

bit_user

Polypheme
Ambassador
To be honest, this is an exponential gain as CXL will get wider adoption, which would make better incentive to use it.
I can imagine at least one obvious use-case: once CPUs have 8+GB of on-package RAM, the DDR interface can be ditched in favor of 32 additional PCIe links for CXL and then you can simply use CXL-to-whatever riser cards for extra memory of whatever type your needs dictate. Need a good amount of cheap memory? Get a DDR4 riser. Need a massive amount of space for a resident-in-memory database? Get a NVDIMM riser. Want something for which additional DIMMs will still be available 4-5 years from now? Get a DDR5 riser. Need to add memory but DDR5 is no longer obtainable for a reasonable price after being supplanted by DDR6? Get a DDR6 riser. SO-DIMMs are significantly cheaper than DIMMs? Get a SO-DIMM riser instead of a standard DIMM one.

I would love that solution as it would mean my everyday workhorse can be compressed, maybe even replaced by laptop. Imagine a dock that adds +256GB+GPU of ram to your machine.
Right now I need 2 separate machines to generate and to analyze, maybe I could somehow have just laptop and extension slot to do it.
Modularity and plug and play pieces is the way of future, but I am not sure if companies would be happy.
100% someone will try to lock down their platform to their own "raisers" like hp does with Wi-Fi and stuff.
 
  • Like
Reactions: mikewinddale

bit_user

Polypheme
Ambassador
Imagine a dock that adds +256GB+GPU of ram to your machine.
Seems plausible. You'd just need the OS to have some strategy for dealing with "undocking". I imagine it would migrate the content of the CXL pages to a compressed swapfile on the laptop's SSD. Could take a little time, so you'd have to click some "remove device" icon to initiate it.

Do I think it'll happen? Probably not. Complex and too niche. Interesting to contemplate...
 
  • Like
Reactions: Rdslw

InvalidError

Titan
Moderator
Seems plausible. You'd just need the OS to have some strategy for dealing with "undocking". I imagine it would migrate the content of the CXL pages to a compressed swapfile on the laptop's SSD. Could take a little time, so you'd have to click some "remove device" icon to initiate it.
The simpler and quicker way would be to keep system-essential and local apps in local memory, suspend/hibernate everything that was running on CXL so it can resume right where it left when the dock gets re-connected.
 
Seems plausible. You'd just need the OS to have some strategy for dealing with "undocking". I imagine it would migrate the content of the CXL pages to a compressed swapfile on the laptop's SSD. Could take a little time, so you'd have to click some "remove device" icon to initiate it.

Do I think it'll happen? Probably not. Complex and too niche. Interesting to contemplate...
you could have memory "pools" and just stop os from using other one, (integrated -> extented ->swap) so it will crash your random program if undocked, but OS will be fine, but
IMHO just crash, and restart if you undock, docking would need a restart to recognize as well, STILL option to have extensions of some kind in laptops, kinda like framework do right now, except on such low level like ram and gpu... I would dig 100%.
even if its lenovo style, just a giant brick added to your laptop, so you can do something special and disconnect to continue with your day, that would just be neat. I have to use 17'' laptop because no 15 had 64GB ram when I bought it, and I utilize all 64 maybe once a month.
I could settle for 14'' and some expansion brick, it this would exist, which would be easier on my back in general.
 

cusbrar2

Honorable
Jan 19, 2019
5
0
10,510
I'm a fan of Optane so hopefully I'm not too biased to see something here, but what are the benefits of including something with orders of magnitude worse latency than Optane in your ram pool if Optane wasn't worth it? Optane dimms had a read latency as low as 250ns. ns! Compared to 100+ microseconds for nand. Not even in the same prefix.

At that point why even bother?

Because NAND isn't as slow as you claim... the main reason NAND SSD's are that slow today are the controllers and housekeeping they do. The wide fast low latency part you can GET is always better than the single supplier technically better part that you cannot get. If you spend more money on the nand controller in enerprise parts you start approaching Optane performance... that is just all there is to it. Then all optane has going for it is durability.
 

rluker5

Distinguished
Jun 23, 2014
620
371
19,260
CXL has native support for DRAM. It's like accessing memory on your dGPU, except faster (and cache-coherent).
So a bleeding edge, proprietary packaging ram module further from the cpu would help the mainstream user how? Would it be cheaper or help their performance somehow? I could see how loading everything that would have to be loaded into ram for say a few games, programs, os into a large, low latency nonvolatile drive, then treating that as ram and piping that directly to vram or the cpu, system memory or wherever would save a bunch of cpu work and speed things up, but that is what one of the uses of Optane was supposed to be.

Trying to replace that with the same thing on some new pcie form factor and the most expensive chunk of ram in a rgb box, that has to be loaded again every reboot, and maybe after sleep sounds not as good. A new, expensive type of ssd that pales in performance compared to optane to do some software trick also sounds like trash.

I'm just talking from some plebian, mainstream user perspective here. Not some specialized use where you would spend $100k per workstation. The article's title is "
AMD Working to Bring CXL Memory Tech to Future Consumer CPUs"

Because NAND isn't as slow as you claim... the main reason NAND SSD's are that slow today are the controllers and housekeeping they do. The wide fast low latency part you can GET is always better than the single supplier technically better part that you cannot get. If you spend more money on the nand controller in enerprise parts you start approaching Optane performance... that is just all there is to it. Then all optane has going for it is durability.
Could you imagine if a company that made Optane could put that controller tech into a nand ssd? Guess what? Intel could, and even mixed the two on the same ssd and nand was still the same latency. You can still buy those on ebay. The fact that pcie5 nvme ssds still have roughly the same latency as the first sata ones from 10 years earlier seems to indicate nand might just be slower.

Edit: the first, cheapest Optane released, the 16GB module, does 280MB/s Q1T1 rr. The fastest goes to about 400. What does a ramdisk get for Q1T1? 600MB/s? Totally held back by where they are and how they are read. Put Optane in dimms and use the right format and they get way faster, but still 4-5x the latency of dram. That is from removing the bottleneck. Nand is so slow it is the bottleneck wherever you put it. Just like an HDD.
 
Last edited:

George³

Prominent
Oct 1, 2022
228
124
760
I'm just talking from some plebian, mainstream user perspective here. Not some specialized use where you would spend $100k per workstation. The article's title is "
AMD Working to Bring CXL Memory Tech to Future Consumer CPUs"

Perhaps in this case the term "consumers" is used in its full sense. ;)
 
  • Like
Reactions: rluker5

TJ Hooker

Titan
Ambassador
o, the way out of this conundrum is to add CXL links, which can now even be switched (i.e. enabling them to fan out to a much larger number of devices), and put your memory on the CXL bus. The down side of this is that CXL memory would be much slower to access. Still, waaay faster than SSDs, but slower than directly-connected DIMMs. That can be mitigated by the use of memory tiers, where frequently-used memory pages can be promoted to the "fast" memory, in package. Infrequently-used pages can be evicted and demoted to "slow" memory, out on CXL. This isn't merely a fiction, either. Support for memory tiering and page promotion/demotion already exists in the Linux kernel
As a layman, it's not obvious to me why this is different/better than the current situation of using a pagefile for arbitrary amounts of virtual memory, and having peripherals access memory via DMA. Can you elaborate?
 
As a layman, it's not obvious to me why this is different/better than the current situation of using a pagefile for arbitrary amounts of virtual memory, and having peripherals access memory via DMA. Can you elaborate?
I'd imagine one is OS-level implementation (swap/virtual mem) and the other is specific to a single type of device (or small family of devices) via some protocols.

Regards.