News Jim Keller Shares Zen 5 Performance Projections

Page 4 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Tradeoffs that likely won't make any lick of sense by the time DDR6 becomes mainstream:
You're saying tradeoffs made today that won't make sense in the future? Sure, but any design involves tradeoffs, and they need to be made at some point. That's what I meant.

I'm just saying you can't design this stuff purely in your head. Whatever you dream up won't be optimal until you model, test, tune, and iterate.
 
You're saying tradeoffs made today that won't make sense in the future? Sure, but any design involves tradeoffs, and they need to be made at some point. That's what I meant.
I mean the tradeoffs that would come with Kamen's mega-scaling of everything else. Most of the things I hypothesized are either already happening in the server space, a mere continuation of existing trends or what seems to me like the logical conclusion of things.

They've tried to do that in the past multiple times; every time they did try to go "Serial Interface", the industry has stated "NEIN!".
And Adoption failed.

I wouldn't expect HBM memory to be any consumer client platforms, that seems to be relegated to Enterprise Customers due to cost reasons.

That's why I want the ½ GiB DRAM Package to remain a option for implementation.

You also won't need a dedicated DRAM package for ECC,
Serial RAM had no chance of working until CPUs had enough on-package DRAM to operate primarily from that and use off-package memory as a very fast swap file instead. On-package DRAM will almost certainly be a thing in the consumer space before DDR7 comes along as it already is in the server space, as is CXL-mem.

While HBM may be too expensive for the consumer space today, there is plenty of time for that to change and other variants to come along over the next 7-10 years before DDR7 might hypothetically come out. As I have written many times before, HBM is still fundamentally the same tech as any old DRAM, the only things keeping it expensive are low volume and TSVs. TSVs will be in everything soon enough: it is already in every Zen 3/4 CCD in case AMD needs to shave the back of the die and expose them to slap a V-cache die on there. 7+ years is a long time to increase capacity and bring down costs. Meanwhile, faster, more convoluted external parallel interfaces will keep getting more problematic and expensive.

Lastly, putting only 0.5GB of DRAM on-package doesn't make any sense when nobody makes modern DRAM smaller than 1GB and that already has only a ~60sqmm footprint. (SK-Hynix' 2nd-gen 1GB DDR5 is 56sqmm.)
 
  • Like
Reactions: bit_user
Your information is obsolete. DDR5 already introduced active circuitry between the DIMM interface and the DRAM chips.
Are you referring to the PMIC or the Data Buffer?
r3brDia.png

Apple is using LPDDR5X in its consumer CPUs.
That's different than HBM, having a few chips next door is far cheaper to implement than HBM.

Without models or data, how can you plan even 1 generation, much less 8?
That's why I want to work with members within JEDEC on making suggestions, but they don't exactly let random people interface with them.
Their organization requires membership to even talk to them.

Apple ships Mac Minis with as little as 8 GB. They use memory compression to make up a lot of the deficit. Not that it handles all cases adequately, but it's a fact that they're doing it.
That's 8 GiB today. I'm talking many years into the future. But your suggestion of 32-64 GiB was correct for a min spec by that time frame.


Sounds expensive.
Not really, by that time (DDR 12-15) it would be pretty standard

Serial RAM had no chance of working until CPUs had enough on-package DRAM to operate primarily from that and use off-package memory as a very fast swap file instead. On-package DRAM will almost certainly be a thing in the consumer space before DDR7 comes along as it already is in the server space, as is CXL-mem.
I have a feeling that L4$ using on-package SRAM would be faster / lower latency than using DRAM.

Given that everybody wants performance, SRAM is inherently faster than DRAM given it's fundamental structural design.
Also SRAM is natively Full-Duplex vs Half-Duplex nature of DRAM.
SRAM would work better as a Victim Cache for L3$.

I still see CXL Memory's being the swap page for Main Memory, especially given the physical realities of the latency penalties that naturally occur.
DbvSGPm.jpg
It's literally in the same range as Optane's latency.
CwgBuZV.jpg


While HBM may be too expensive for the consumer space today, there is plenty of time for that to change and other variants to come along over the next 7-10 years before DDR7 might hypothetically come out. As I have written many times before, HBM is still fundamentally the same tech as any old DRAM, the only things keeping it expensive are low volume and TSVs. TSVs will be in everything soon enough: it is already in every Zen 3/4 CCD in case AMD needs to shave the back of the die and expose them to slap a V-cache die on there. 7+ years is a long time to increase capacity and bring down costs. Meanwhile, faster, more convoluted external parallel interfaces will keep getting more problematic and expensive.
You've stated yourself that the ideal solution for Memory is to have the Memory Die sit on top of the Memory Controller.
If you go with a OMI style off-package Memory Controller, there is no reason why you can't have a very short hop similar to RDNA3's MCD where the Controller is physically next to the core die and have HBM sit on top of the MCD having the shortest possible interface run.
That would be a much more viable option of slapping HBM on top of the Memory Controller Die designed for DRAM.
That would be the ultimate form of HBM.
Also cooling would become much more reasonable since you're only dealing with cooling the Memory Controller Die & HBM dies.


Lastly, putting only 0.5GB of DRAM on-package doesn't make any sense when nobody makes modern DRAM smaller than 1GB and that already has only a ~60sqmm footprint. (SK-Hynix' 2nd-gen 1GB DDR5 is 56sqmm.)
That's why I want the next update/iteration of the standard to make a allowance for that.
Unless the DRAM makers want to make 32 GiB the minimum size for a DIMM module (I'd personally be fine with that, but I think others would find that to be too big & expensive for the bottom end of the market, and we don't want to forget the bottom end, not everybody is loaded with money).
I think there should be a allowance in the spec for a smaller DRAM Package size from yester year.
And given how many DRAM packages I want to fit on one row starting with (16x on each side, 32x when double-sided) by the time of DDR 12-15.
Having the smallest DRAM package size be ½ GiB should be easily do-able.
 
You've stated yourself that the ideal solution for Memory is to have the Memory Die sit on top of the Memory Controller.
If you go with a OMI style off-package Memory Controller, there is no reason why you can't have a very short hop similar to RDNA3's MCD where the Controller is physically next to the core die and have HBM sit on top of the MCD having the shortest possible interface run.
OMI is irrelevant for on-package memory where chip manufacturers have absolute control over everything and can make their interfaces whatever they want including shuffling functionality between dies. As for putting HBM-like memory on top of MCDs, I was probably the first on the forum to bring that idea up when the MCD topic first came up. As I have mentioned previously, that is only a temporary work-around until memory can be stacked directly on CPUs/GPUs to eliminate the off-die hop altogether.

That's why I want the next update/iteration of the standard to make a allowance for that.
Unless the DRAM makers want to make 32 GiB the minimum size for a DIMM module (I'd personally be fine with that, but I think others would find that to be too big & expensive for the bottom end of the market, and we don't want to forget the bottom end, not everybody is loaded with money).
If you want to make smaller external memory sizes still viable when we get to 4GB ~100sqmm DRAM dies around the time DDR7 may come around, there is a simple cheap solution for that: half-DIMM slots.

If JEDEC sticks to the same general slot form factor, the other option would be for chip manufacturers to add 32-bits wide chips to their x4/x8/x16 lineup, then you can halve the DIMM size by putting half as many twice-as-wide (bus-wise) DRAM chips on it, no need to make physically under-sized chips that would waste an excessive amount of wafer space on saw lines between dies.
 
Are you referring to the PMIC or the Data Buffer?
Unbuffered DDR5, not LR.

That's different than HBM, having a few chips next door is far cheaper to implement than HBM.
The point was that it's in-package memory in consumer products. Mac Mini's starting at $600.

That's 8 GiB today. I'm talking many years into the future. But your suggestion of 32-64 GiB was correct for a min spec by that time frame.
They start at 8 GB and go up from there. The base SoC can be ordered with 16 or 24 GB, while the M2 Pro can be ordered with up to 32 GB.

If you step up to the Mac Studio line, the M1 Max can take up to 64 GB, while the M1 Ultra can handle 128 GB, since it's just 2 M1 Max SoCs in a package.

Not really, by that time (DDR 12-15) it would be pretty standard
The number of chips, I mean.

You've stated yourself that the ideal solution for Memory is to have the Memory Die sit on top of the Memory Controller.
Only if it's in-package. That way, you can avoid the SERDES hit on latency, power, and bandwidth.
 
OMI is irrelevant for on-package memory where chip manufacturers have absolute control over everything and can make their interfaces whatever they want including shuffling functionality between dies. As for putting HBM-like memory on top of MCDs, I was probably the first on the forum to bring that idea up when the MCD topic first came up. As I have mentioned previously, that is only a temporary work-around until memory can be stacked directly on CPUs/GPUs to eliminate the off-die hop altogether.
But given AMD's GPU chiplet paradigm where the Memory Controller was seperated from the GPU into the GCD & MCD
(How On-Package) do you see that going in the GPU world?

nVIDIA famously wants to keep it's memory controller attached to the sides of it's GPU by sticking to Monolithic, but that also creates a giant heat issue with the massive amount of heat on the GPU.

The CPU went from Monolithic w/ Memory Controller on the CPU package to Chiplet -> Memory Controller gets shunted to cIOD or a different tile.
Both AMD & Intel have already done that with Ryzen & upcoming Meteor Lake.

And my suggestion of OMI where you shunt the Memory Controllers even further out for flexibility of platform design architcture.

Apple is the only one to have the LPDDR DRAM package shoved next door (Not on the Memory Controller), that interposer must add some extra cost to manufacturing.

And with the massive heat issues and extra thermal barriers, the chiplet paradigm, I don't see the memory controller ever going back into the CPU/GPU/IC's anymore.
At best it's next door. At worst the Memory Controller is modular and the DRAM Package gets bonded on top as a chiplet with some sort of hop to the primary chiplet.

If you want to make smaller external memory sizes still viable when we get to 4GB ~100sqmm DRAM dies around the time DDR7 may come around, there is a simple cheap solution for that: half-DIMM slots.
That sounds like a compatibility nightmare to architect for.

We don't need another type of DIMM slots, we already have 3x types of DIMM slots in existence.
0vlVUEa.png
Trying to make a new DIMM slot that allows a half size DIMM would be a nightmare to architect.
Especially when we're already entrenched w/ 3x different slot formats.


If JEDEC sticks to the same general slot form factor, the other option would be for chip manufacturers to add 32-bits wide chips to their x4/x8/x16 lineup, then you can halve the DIMM size by putting half as many twice-as-wide (bus-wise) DRAM chips on it, no need to make physically under-sized chips that would waste an excessive amount of wafer space on saw lines between dies.
Can't you shrink the saw lines down?

I remember reading an article on new techniques to min max everything on the die, especially with Hexagonal Dies.
A5fjQil.png

UTfp8yR.png

Combine Hexagonal Dies & Chemical Slicing, you have a recipe for massively increasing # of dies per waffer.
 
Last edited:
Unbuffered DDR5, not LR.
I don't think Unbuffered DIMMs are going to stay around for much longer past DDR5.
bAAjTUp.png
Given the ever increasing DRAM capacity and the electrical load that it creates, I think Load Reduced DIMMs will become the normal required standard.
This should help make clocking the DIMMs to ever higher speeds easier while supporting ever increasing DRAM capacities.


The point was that it's in-package memory in consumer products. Mac Mini's starting at $600.


They start at 8 GB and go up from there. The base SoC can be ordered with 16 or 24 GB, while the M2 Pro can be ordered with up to 32 GB.

If you step up to the Mac Studio line, the M1 Max can take up to 64 GB, while the M1 Ultra can handle 128 GB, since it's just 2 M1 Max SoCs in a package.
Doesn't Apple charge a obscene amount for their M2 Pro line and over-charge for memory, especially compared to equivalent DIMM costs?
I remember the M1 Max / Pro / Ultimate being stupid expensive.
And their RAM mark-up is also obscene.

Should anything go bad with On-Package RAM, it's a pain to fix.

While for DIMM's, it's a easy swap.

The number of chips, I mean.
It's not the packaging of the chips that is expensive, or the amount of chips realistically.
All that is fully automated placed, & soldered by machine.
It's the total memory capacity you're buying into along with the binning / speed grade + any extra BS features like RGB.

The nice part about lower Memory Capacity DRAM is that it's easier to clock higher.
Less Electrical Load on your Memory Controller = Higher Clock Speeds for OC-ing or lower timings.
The higher Memory Capacity DRAM you have, the harder it'll be to drive it to ever higher speeds.
That's also another reason to keep ½ GiB around along with 1-8 GiB in 1 GiB increments w/ that special carve out for ½ GiB.
There's a place for many capacities in 1 GiB increments.

Only if it's in-package. That way, you can avoid the SERDES hit on latency, power, and bandwidth.
Given the chiplet world we're seeing and the move to de-couple the Memory Controller from the CPU / GPU / IC, I don't see that happening.
 
Last edited:
We don't need another type of DIMM slots, we already have 3x types of DIMM slots in existence.
Who cares about DIMM types that currently exist? Each new DRAM generation brings a clean slate of new DIMM specs. When I say "half-DIMM" I mean it as in making the official "full-size" DIMM half the size of current DIMMs.

My bet, in case you may have already forgotten my 100 previous posts on this topic, is on on-package memory making external memory obsolete for most applications before DDR7 hits the market. At that point, it will make sense to downsize external memory interfaces to cut costs and complexity now that external memory is relegated to bridging the gaps between needing just enough extra memory that a swapfile on NVMe isn't acceptable anymore and needing enough extra high-speed memory to justify going one CPU model/tier up to get it.
 
  • Like
Reactions: bit_user
Who cares about DIMM types that currently exist? Each new DRAM generation brings a clean slate of new DIMM specs. When I say "half-DIMM" I mean it as in making the official "full-size" DIMM half the size of current DIMMs.

My bet, in case you may have already forgotten my 100 previous posts on this topic, is on on-package memory making external memory obsolete for most applications before DDR7 hits the market. At that point, it will make sense to downsize external memory interfaces to cut costs and complexity now that external memory is relegated to bridging the gaps between needing just enough extra memory that a swapfile on NVMe isn't acceptable anymore and needing enough extra high-speed memory to justify going one CPU model/tier up to get it.
Ok, DDR7 should easily be well within our life times. We'll see how that pans out =D
 
Ok, DDR7 should easily be well within our life times. We'll see how that pans out =D
It probably won't take that long to see the signs: on-package memory is already a thing in high-end servers, mobile SoCs and Apple/s M1/M2, doubt it will take more than 2-3 years before we see the proliferation into many more product categories and confirm my predictions of inevitability.
 
That's why I want to work with members within JEDEC on making suggestions, but they don't exactly let random people interface with them.
Their organization requires membership to even talk to them.
What I can't understand is why you think you have any ideas they haven't considered. We know that companies like Apple, Intel, and AMD are working with memory manufacturers directly, so it's not as if interests of CPU vendors aren't being represented.

If you're interested in hardware design, why not start by playing with simulators and FPGAs? Why do you think you should go straight to JEDEC, and why would they even take you seriously if you clearly haven't put in the work to learn all the fundamentals as they all had to do?

It's like if some online strategy game enthusiast tried to get in touch with the Joint Chiefs of Staff to advise them on force readiness and deployment strategies.

I still see CXL Memory's being the swap page for Main Memory, especially given the physical realities of the latency penalties that naturally occur.
I think we're all saying it's rather like swap. I envision full 4 kB pages being migrated in and out of HBM. Compared to the page-miss penalty, that amount of latency is a non-issue.

Here's a fun thought-experiment. Compare the memory capacity of a machine with its memory bandwidth. You'll quickly see that machines on the higher-end of the capacity spectrum cannot have a working set that's very large, without being totally memory-starved. For instance, a top-end mainstream desktop would take nearly 2 seconds just to read all of its memory, at the max capacity it can hold. For servers, the ratio is even worse.

If you flip this on its head, that means because the working set of most software is far smaller than the max memory capacity, only a subset of that memory needs to be fast. And that's where having a chunk of fast, in-package memory is a big net win, even if it means you need to supplement with even slower memory to reach the same capacity.

AMD recently made the point that you cannot scale performance without improving efficiency. In-package memory is necessary, if the energy efficiency of memory is to improve. This is exactly the rationale Nvidia gave for packaging LPDDR5X with its Grace CPUs.

Doesn't Apple charge a obscene amount for their M2 Pro line and over-charge for memory, especially compared to equivalent DIMM costs?
I remember the M1 Max / Pro / Ultimate being stupid expensive.
And their RAM mark-up is also obscene.
My point was simply that they eliminated external memory. That shows most consumers don't even need any external memory. Keep in mind that the in-package memory quantities cited in my last post are shared by both the CPU and GPU! They're essentially using memory compression as their tier between main memory and swap.

Should anything go bad with On-Package RAM, it's a pain to fix.

While for DIMM's, it's a easy swap.
Yes, but you could work around that with ECC and by having the ability to disable portions of it. I'm not sure which of those strategies, if any, that Apple uses.

We accept soldered memory on our graphics cards, so why not CPUs?

It's not the packaging of the chips that is expensive, or the amount of chips realistically.
From someone so keen on reducing the trace count from CPUs, this is a rather surprising statement.
 
Last edited:
  • Like
Reactions: TJ Hooker
I think we're all saying it's rather like swap. I envision full 4 kB pages being migrated in and out of HBM. Compared to the page-miss penalty, that amount of latency is a non-issue.
[...]
If you flip this on its head, that means because the working set of most software is far smaller than the max memory capacity, only a subset of that memory needs to be fast. And that's where having a chunk of fast, in-package memory is a big net win, even if it means you need to supplement with even slower memory to reach the same capacity.
I was thinking the same, literally using PCIe-mapped memory for page-wise swaps almost exactly like a normal swapfile, minus the 10-50us of file system and SSD controller overhead.

I typically use 10-16GB of RAM and like to have 16GB of spare for those times where I want to run a couple of extra things or just have plenty of room for file caching. I (and the vast majority of semi-normal people) would likely be perfectly fine with 16GB of on-package memory to host whatever I am actively using and let all of the background stuff get shuffled off to 16-32GB of M.2-RAM or whatever the post-on-package-memory-era off-package memory expansion standard may be if that is cheaper than getting the next CPU model up with more on-package RAM.

If you want an extra cache tier in there, it could be something like 4MB for IO buffering and command queuing on the M.2-RAM controller.
 
I was thinking the same, literally using PCIe-mapped memory for page-wise swaps almost exactly like a normal swapfile, minus the 10-50us of file system and SSD controller overhead.
Phoronix has been publishing periodic updates on the state of tiered memory support, in the Linux kernel. So far, the support already involves migration of pages between tiers. I had my own thoughts about this, but it serves as some confirmation that the industry is moving in this direction.



I (and the vast majority of semi-normal people) would likely be perfectly fine with 16GB of on-package memory to host whatever I am actively using and let all of the background stuff get shuffled off to 16-32GB of M.2-RAM or whatever the post-on-package-memory-era off-package memory expansion standard may be if that is cheaper than getting the next CPU model up with more on-package RAM.
As I mentioned, a lower-tier option that exists today is memory compression. I can't speak from personal experience, but I hear it's very usable in Linux. Apple has gone so far as to implement compression & decompression in hardware. I have no idea on the state of such a technology in Windows.
 
What I can't understand is why you think you have any ideas they haven't considered. We know that companies like Apple, Intel, and AMD are working with memory manufacturers directly, so it's not as if interests of CPU vendors aren't being represented.
Of course their interests are being represented, but in what ways.
Everybody has different aspects of what they want to focus on.
What some people at these companies may care about isn't what somebody else would care about.
That's part of the diversity of thought process.

If you're interested in hardware design, why not start by playing with simulators and FPGAs? Why do you think you should go straight to JEDEC, and why would they even take you seriously if you clearly haven't put in the work to learn all the fundamentals as they all had to do?
Not everybody has that kind of $$$ to access those simulators of FPGA's. I'm not rich, I can't afford lots of high end hardware or expensive ass software licenses like you think it's easy to do. Those things aren't cheap for most average people to simulate.

It's like if some online strategy game enthusiast tried to get in touch with the Joint Chiefs of Staff to advise them on force readiness and deployment strategies.
Weird <Mod Edit> does happen IRL:
James Baldwin: World's fastest gamer to real life racer

But anyways, different people want to do things differently.
Steve Jobs wasn't a very conventional person, but he got alot of <Mod Edit> done that was pretty outlandish at the time.


I think we're all saying it's rather like swap. I envision full 4 kB pages being migrated in and out of HBM. Compared to the page-miss penalty, that amount of latency is a non-issue.

Here's a fun thought-experiment. Compare the memory capacity of a machine with its memory bandwidth. You'll quickly see that machines on the higher-end of the capacity spectrum cannot have a working set that's very large, without being totally memory-starved. For instance, a top-end mainstream desktop would take nearly 2 seconds just to read all of its memory, at the max capacity it can hold. For servers, the ratio is even worse.

If you flip this on its head, that means because the working set of most software is far smaller than the max memory capacity, only a subset of that memory needs to be fast. And that's where having a chunk of fast, in-package memory is a big net win, even if it means you need to supplement with even slower memory to reach the same capacity.

AMD recently made the point that you cannot scale performance without improving efficiency. In-package memory is necessary, if the energy efficiency of memory is to improve. This is exactly the rationale Nvidia gave for packaging LPDDR5X with its Grace CPUs.
But if you're going with in-package memory, why does it have to be DRAM, why not SRAM?

SRAM is Full-Duplex vs Half-Duplex nature of DRAM, so that makes it better for parallel data access between multiple threads.
SRAM can be scaled up at cost for massive speed. SRAM is also MASSIVELY faster than DRAM, especially when it's "In-Package".
You only have to look at 3DvCache to see it's benefits.
VZhqgnr.png

If you want it "On-Package", SRAM offers far better bandwidth than DRAM due to the inherent nature of the technology.
If you're worried about "Energy Efficiency", accessing SRAM is WAY more energy efficient than DRAM.
Literally orders of magnitude more efficient.

hZ6PCzx.png


My point was simply that they eliminated external memory. That shows most consumers don't even need any external memory. Keep in mind that the in-package memory quantities cited in my last post are shared by both the CPU and GPU! They're essentially using memory compression as their tier between main memory and swap.
Yeah, but what if the end consumer wants more memory?
They don't want to suffer the latency penalty of CXL.mem.

Having extra DIMM slots helps with that.


Yes, but you could work around that with ECC and by having the ability to disable portions of it. I'm not sure which of those strategies, if any, that Apple uses.

We accept soldered memory on our graphics cards, so why not CPUs?
Ok, if you want soldered memory, are you going to try to force people to not have expandable DIMM slots?

What if the end user wants more memory than what you offer in your package?

Are you going to force them onto CXL.mem?

What if nobody wants to deal with the extra Over-head / latency of CXL.mem?

Are you going to tell them, sucks to be you. Do what I say?


From someone so keen on reducing the trace count from CPUs, this is a rather surprising statement.
That's the point of OMI, you have the Memory Controller seperated and have a simple, (Low Pin Count, High-Speed/Bandwidth) serial connection to the CPU.
The seperate Memory Controller can be designed to take any amount of pins / traces as needed by any new standards change.

The Serial Interface on the CPU's cIOD or wherever it exists is far simplified and lowers the total die area needed for it.

You also have the option to add MUCH more Serial Memory Interfaces.

That's the 8:1 Pin-Count advantage.

The 300-pin DIMM Slot format of 1x Memory Channel vs 75-pins for Dual-Memory Channel controller with 2x Independently clockable Memory Channels
 
Last edited by a moderator:
Not everybody has that kind of $$$ to access those simulators of FPGA's. I'm not rich, I can't afford lots of high end hardware or expensive ass software licenses like you think it's easy to do. Those things aren't cheap for most average people to simulate.
You can get hobby FPGA boards for as little as 10s of $, and there is free software available development/simulation.
 
  • Like
Reactions: bit_user
Not everybody has that kind of $$$ to access those simulators of FPGA's. I'm not rich, I can't afford lots of high end hardware or expensive ass software licenses like you think it's easy to do. Those things aren't cheap for most average people to simulate.
Who said anything about being rich? I'll bet there are some open source simulators you can play around with, as well as FPGA prototyping boards that are within most budgets.

Find a good hardware development community and I'm sure they can point you in the right directions.

Proper racing sims are a pretty good approximation of actual cars, so that counts as relevant experience. It would be somewhat like if you designed and prototyped a CPU core on a FPGA.

Steve Jobs wasn't a very conventional person, but he got alot of <Mod Edit> done that was pretty outlandish at the time.
So, you're Steve Jobs, now? Remind me what part of Steve Jobs' life was spent writing about elaborate plans on bulletin boards or CompuServe.

But if you're going with in-package memory, why does it have to be DRAM, why not SRAM?
Cost.

If you're worried about "Energy Efficiency", accessing SRAM is WAY more energy efficient than DRAM.
Literally orders of magnitude more efficient.

"Energy Efficiency of basic operations"
hZ6PCzx.png
Please don't post these kinds of graphics without a clear source.

If you just use a little common sense, that obviously not saying 32 bits of storage consume so much energy. The units are wrong - it must refer to a read, write, or refresh.

Citing clear sources would clear up any confusion about not only what is being measured, but also the process nodes involved, since that's a dominant variable in the energy-efficiency of such operations. These details don't just matter - they're crucial. Getting them wrong can topple the entire tower of assumptions you build atop them.

Yeah, but what if the end consumer wants more memory?
They don't want to suffer the latency penalty of CXL.mem.
I already explained that part. I'm not going to repeat myself, if you can't read.

Ok, if you want soldered memory, are you going to try to force people to not have expandable DIMM slots?
Apple did it. That proves it's not the kind of necessity, for most users, that you seem to think it is.

What if the end user wants more memory than what you offer in your package?
Apple uses memory compression and then swap. Relative to that, I think having the expandability option of CXL.mem is a benefit.

Are you going to force them onto CXL.mem?
Sure, and people will accept it. The reason I'm so sure is just by looking at what's happened to the HEDT market. There are options to get 2x and 4x memory configurations than what's available on the mainstream desktop, but the additional performance isn't worth enough to most users to justify the additional cost.

Look how popular the 5800X3D and 7800X3D are? They're popular because they offer leading performance on most games, even if they suffer on a few others. That's similar to what would happen with HBM. It doesn't really matter if you can find a corner-case where conventional DRAM DIMMs are faster. For the small subset of users who care about those corner cases, they can buy CPUs without those caveats or even go HEDT.

That's the point of OMI,
It doesn't scale in speed or power. The only way it's usable is like CXL.mem, except it's slower than CXL (25 GT/s instead of 32). So, if you used OMI, you'd still need HBM, at which point the downside of a little extra latency for the "bulk memory" is a non-issue. And, if the latency differential is negligible (i.e. because we're talking about migrating whole pages, here), then it makes sense to go with CXL for more speed and the flexibility of having a pool of lanes you can use for either memory, accelerators, or I/O.
 
Last edited by a moderator:
  • Like
Reactions: TJ Hooker
Steve Jobs wasn't a very conventional person, but he got alot of <Mod Edit> done that was pretty outlandish at the time
Seems like you're falling victim to the Galileo Gambit, or at least getting close.

Going against consensus is brave and right if you have compelling evidence that the consensus is flawed, but otherwise there's nothing inherently noble or good about it. Going against the consensus of experts in a field where you have no experience or expertise is just silly.
 
Last edited by a moderator:
  • Like
Reactions: bit_user
Who said anything about being rich? I'll bet there are some open source simulators you can play around with, as well as FPGA prototyping boards that are within most budgets.

Find a good hardware development community and I'm sure they can point you in the right directions.
Basically, you want me to leave and find a different community.
That's what it comes down to.

Proper racing sims are a pretty good approximation of actual cars, so that counts as relevant experience. It would be somewhat like if you designed and prototyped a CPU core on a FPGA.
But I'm not trying to prototype a CPU core, now am I.

So, you're Steve Jobs, now? Remind me what part of Steve Jobs' life was spent writing about elaborate plans on bulletin boards or CompuServe.
Not everybody is going to live the same lives, who knows how he kept his ideas and in what form.

Performance at the expense of cost.

Please don't post these kinds of graphics without a clear source.

If you just use a little common sense, that obviously not saying 32 bits of storage consume so much energy. The units are wrong - it must refer to a read, write, or refresh.

Citing clear sources would clear up any confusion about not only what is being measured, but also the process nodes involved, since that's a dominant variable in the energy-efficiency of such operations. These details don't just matter - they're crucial. Getting them wrong can topple the entire tower of assumptions you build atop them.
That slide was made about energy costs programming wise to do certain instructions and access certain layers of memory.
I grabbed that long time ago, so I don't remember exactly which talk it was from, but the point stands.
Accessing certain layers of memory is very energy expensive. DRAM in particular.


Apple did it. That proves it's not the kind of necessity, for most users, that you seem to think it is.


Apple uses memory compression and then swap. Relative to that, I think having the expandability option of CXL.mem is a benefit.
Apple also caters to a very different clientele than what many in the Linux and PC community care about.
Where fashion trumps many design decisions.
Where ever slimmer SmartPhones, LapTops, & Tablets matter more than practical Computer Hardware designs that are easily (repair-able and service-able) by basic techs.


Sure, and people will accept it. The reason I'm so sure is just by looking at what's happened to the HEDT market. There are options to get 2x and 4x memory configurations than what's available on the mainstream desktop, but the additional performance isn't worth enough to most users to justify the additional cost.
Or maybe people can't afford that much RAM until it becomes cheaper.
There also down-sides when populating more memory in those slots, your maximum memory speeds go down.
That's part of the down-sides where you currently have to make a sacrifice between Memory Capacity and Memory Speeds.
So all that fancy fast memory becomes under-whelmingly slow once you slap in more Memory DIMMs due to the nature of extra electrical load on the Memory Controller for supporting "More DIMMs & More Memory Capacity".


Look how popular the 5800X3D and 7800X3D are? They're popular because they offer leading performance on most games, even if they suffer on a few others. That's similar to what would happen with HBM. It doesn't really matter if you can find a corner-case where conventional DRAM DIMMs are faster. For the small subset of users who care about those corner cases, they can buy CPUs without those caveats or even go HEDT.
If it's about Memory Capcaity cost, DIMMs beats HBM in cost per GiB handily.
And costs matter for the mass consumer market.


It doesn't scale in speed or power. The only way it's usable is like CXL.mem, except it's slower than CXL (25 GT/s instead of 32). So, if you used OMI, you'd still need HBM, at which point the downside of a little extra latency for the "bulk memory" is a non-issue. And, if the latency differential is negligible (i.e. because we're talking about migrating whole pages, here), then it makes sense to go with CXL for more speed and the flexibility of having a pool of lanes you can use for either memory, accelerators, or I/O.
Except HBM is too expensive. OMI is flexible to the customers cost needs in that they can choose how much memory they're investing in each server.
Speed is scalable based on how much you want to clock the Serial side of the OMI interface.
If you wanted faster transfers, that's do-able.
If you want more capacity, you buy the DIMM you want with the capacity you want.
If you need to repair your memory, you can quickly swap DIMMs instead of having to take down a server for hours to de-solder the HBM or replace the entire GPU card or CPU.

HBM has the disadvantage of massive costs on top of being fixed on the CPU/GPU and forcing you to buy into a expensive memory solution when not everybody wants or needs that kind of memory.

CXL.mem also works for routing regular memory as well, that means it inherently works with main memory and OMI.
CXL.mem is not just a set of add-in cards, it's a protocol as well to enable access to any set of Main Memory and sharing it's resource across the network.

And lower Latency is better than slower latency CXL.mem add-in cards slapped on top of the PCIe bus.
 
Last edited:
Seems like you're falling victim to the Galileo Gambit, or at least getting close.

Going against consensus is brave and right if you have compelling evidence that the consensus is flawed, but otherwise there's nothing inherently noble or good about it. Going against the consensus of experts in a field where you have no experience or expertise is just silly.
That's my burden to bear.
 
Basically, you want me to leave and find a different community.
That's what it comes down to.
No, I'll try to say this clearly, so there's no room for misunderstanding.

You make many valuable contributions to the threads on this site. When you do, I like or up-vote them (but, only if there's not some part of your post I also disagree with).

What I think is misplaced is using these news threads as a place for OMI advocacy. These forums really aren't for advocacy campaigns, of any kind.

In general, I think some of your other thoughts and ideas would receive better & more knowledgable feedback, if you'd post them on sites, boards, subreddits, newsgroups, etc. that are more focused on hardware architecture & design. But, I think they're not unwelcome here, so long as they're somewhat on topic. Many of us like to indulge in a bit of speculation.

Lastly, my personal advice is: don't post anything on a forum that will bother you too terribly, if people disagree, dislike, or try to shoot it down. There are other ways to air your ideas, without having to endure withering feedback, if all you want is a platform. However, if you keep an open mind and are flexible in your thinking, you might find that some feedback helps improve your ideas and sharpen your arguments.

Furthermore, don't get your ego so tied up in these ideas that you take it personally, when people disagree with them. That's antithetical to good engineering, because nobody is perfect. We all have our share of bad ideas, and need to be able accept criticism and move on.

But I'm not trying to prototype a CPU core, now am I.
There are ways you can dabble in hardware design, and maybe even reach the level of expertise you need to enter the field professionally. If you're flexible, you can find ways to explore hardware design without a large investment of time, money, or energy.

But, you can't just walk in off the street and expect to make system architecture decisions. Don't be the kid who's a hard-core sports fan and refuses to contemplate any career that's not being a star quarterback for his favorite football team. That never turns out well, especially if he doesn't ever practice or play on his high-school team. Heck, even for the kids who do put in all the work, the chances of success are minuscule.

Performance at the expense of cost.
Ah, but that's the thing. You can't ignore perf/$ and perf/W. There are yet other practical factors, as well.

I don't remember exactly which talk it was from, but the point stands.
Accessing certain layers of memory is very energy expensive. DRAM in particular.
Does it? How can you know, without even knowing which node(s) or operations it's referring to?

What transaction size does the 32-bit DRAM access even presume? Is it's assuming a 64-byte read, just to access 4 of the bytes? Does it include the cost of the cache hierarchy? What generation of DDR and speed?

Given so many crucial questions about what and how it was measured, I'd say it's pretty worthless.

Apple also caters to a very different clientele than what many in the Linux and PC community care about.
I think we like to tell ourselves that, but the typical Mac user isn't so different from the typical PC user.

If it's about Memory Capcaity cost, DIMMs beats HBM in cost per GiB handily.
It doesn't need to reach parity, if you don't need as much of it (see memory compression; external expansion). It doesn't even need to be HBM, as far as I'm concerned. LPDDR5X would work fine for most.

Most of these we've already hashed out multiple times. It's no use trying to win by attrition. It doesn't make your ideas any better.

Speed is scalable based on how much you want to clock the Serial side of the OMI interface.
It's not. PCIe hit a wall at 32 GHz. Much beyond that, you have to go PAM4, at which point why not just use CXL 3.0?
 
Last edited:
That's my burden to bear.
How can you tell whether you're Galileo or a nutjob, before you've accomplished anything? Accomplishments are the only real way to know with any certainty. So, maybe put your energy into doing instead of talking, and the world will tell you which you are.

If we're just going by statistical odds, then I'd point out there aren't many Galileos out there.
 
  • Like
Reactions: TJ Hooker
But if you're going with in-package memory, why does it have to be DRAM, why not SRAM?

SRAM is Full-Duplex vs Half-Duplex nature of DRAM, so that makes it better for parallel data access between multiple threads.
SRAM can be scaled up at cost for massive speed. SRAM is also MASSIVELY faster than DRAM, especially when it's "In-Package".
Because DRAM has 20X higher density (2GB per 56sqmm vs 64MB per 36sqmm) and the CPU already has L1, L2 and L3 caches that take care of things the overwhelming majority of the time where L1 alone typically has a 95+% hit rate. L2 and L3 caches exist in an attempt to eek out most of that last 5%, which means you need to increase cache size by well over 100X (vs L1) to improve overall cache hit rate by less than 5%. 100X the effort for 1/20th the potential gain, that is some rapidly diminishing return right there.

Practically all applications would benefit far more from faster local access to a much larger chunk of the total address space than an expensive sub-GB SRAM no matter how fast it may be.

As for SRAM being dual-ported, DRAM can be multi-ported too and dual-ported 'video' DRAM was patented by IBM in 1985. However, there is little practical benefit to dual-porting DRAM when you have the ability to pipeline reads and writes between internal banks and simply pump twice as much bandwidth through a twice-as-wide-or-fast interface instead, which is why the original VRAM got forgotten once SDRAM came along. Read/write pipelining across 32 internal memory banks is also how DDR5 is now able to saturate the bus with a single rank of chips and a single DIMM per channel.

Except HBM is too expensive. OMI is flexible to the customers cost needs in that they can choose how much memory they're investing in each server.
HBM is still fundamentally the same as $3/GB DDR5. As HBM works its way down through product tiers, manufacturing volume will pick up and it should get cheaper. It is just taking a while for a steady demand to establish itself and get the on-package memory ball rolling.

What is the cost of a 32GB OMI DDIMM? Likely in the neighborhood of "if you have to ask, you cannot afford it" because the volume is so low that no memory manufacturer could be bothered to launch their own.
 
  • Like
Reactions: bit_user
No, I'll try to say this clearly, so there's no room for misunderstanding.

You make many valuable contributions to the threads on this site. When you do, I like or up-vote them (but, only if there's not some part of your post I also disagree with).

What I think is misplaced is using these news threads as a place for OMI advocacy. These forums really aren't for advocacy campaigns, of any kind.

In general, I think some of your other thoughts and ideas would receive better & more knowledgable feedback, if you'd post them on sites, boards, subreddits, newsgroups, etc. that are more focused on hardware architecture & design. But, I think they're not unwelcome here, so long as they're somewhat on topic. Many of us like to indulge in a bit of speculation.

Lastly, my personal advice is: don't post anything on a forum that will bother you too terribly, if people disagree, dislike, or try to shoot it down. There are other ways to air your ideas, without having to endure withering feedback, if all you want is a platform. However, if you keep an open mind and are flexible in your thinking, you might find that some feedback helps improve your ideas and sharpen your arguments.
So you want me to advocate else-where but here.
Fine, I should just stop coming to Tom's Hardware's forums.

Furthermore, don't get your ego so tied up in these ideas that you take it personally, when people disagree with them. That's antithetical to good engineering, because nobody is perfect. We all have our share of bad ideas, and need to be able accept criticism and move on.
I don't think OMI is fundamentally bad, it's functionally compatible with what CXL is trying to do; in fact it is a more flexible design for increasing Main Memory Performance by seperating off the Memory Controller into a geneic interface on the CPU side and letting you slap on any memory you want.

There are ways you can dabble in hardware design, and maybe even reach the level of expertise you need to enter the field professionally. If you're flexible, you can find ways to explore hardware design without a large investment of time, money, or energy.
Easier said than done for most.

But, you can't just walk in off the street and expect to make system architecture decisions. Don't be the kid who's a hard-core sports fan and refuses to contemplate any career that's not being a star quarterback for his favorite football team. That never turns out well, especially if he doesn't ever practice or play on his high-school team. Heck, even for the kids who do put in all the work, the chances of success are minuscule.
I'm not even trying to be a star quarter back or anything like that.
I'm just trying to make sure that IBM's engineers who came up with a really good idea, but didn't go in 100% the direction they should've, doesn't go to waste.
It's a very solid idea IMO, they just went the route that some people before decided to go, and it has been shot down for the same reasons.
They were this close to hitting the mark on how to approach it, but they went to close to the sun and their wings melted.
Instead of going with the more practical route.

Ah, but that's the thing. You can't ignore perf/$ and perf/W. There are yet other practical factors, as well.
SRAM should be superior on Perf/Watt.
As for Perf/$, that's going to come to scaling of economies and the fact that the big DRAM consortiums are holding HBM prices high while Silicon is more in control of the Fabs and their contractees in how much they can charge base on scale of ouput.

DRAM, you're at the mercy of the Microns, SK Hynix, Samsungs of the world

SRAM, you're at the mercy of IFS & TSMC.

Pick your poison.

Does it? How can you know, without even knowing which node(s) or operations it's referring to?

What transaction size does the 32-bit DRAM access even presume? Is it's assuming a 64-byte read, just to access 4 of the bytes? Does it include the cost of the cache hierarchy? What generation of DDR and speed?

Given so many crucial questions about what and how it was measured, I'd say it's pretty worthless.
Ok, where are you going to quantify all the basic operations energy costs?
I already found a source a while ago and saved it.

I think we like to tell ourselves that, but the typical Mac user isn't so different from the typical PC user.
Depends on what audience you think your customers fit in.

It doesn't need to reach parity, if you don't need as much of it (see memory compression; external expansion). It doesn't even need to be HBM, as far as I'm concerned. LPDDR5X would work fine for most.
So instead of going with HBM, we're going with LPDDR

Most of these we've already hashed out multiple times. It's no use trying to win by attrition. It doesn't make your ideas any better.
Ok, then we're done talking.

It's not. PCIe hit a wall at 32 GHz. Much beyond that, you have to go PAM4, at which point why not just use CXL 3.0?
You mean 32 GT/s per lane on PCIe 5.0

Because CXL 3.0 will run on top of PCIe's PHY including PCIe 6.0/7.0 since most of CXL is a alternate protocol using the same serial hardware and interface.

And CXL 3.0 will be operating at the same speeds as PCIe 6.0/7.0 since they'll be running at 64 GT/s & 128 GT/s per lane on their respective PCIe versions.

That means it'll be using PAM4 as well.
 
Because DRAM has 20X higher density (2GB per 56sqmm vs 64MB per 36sqmm) and the CPU already has L1, L2 and L3 caches that take care of things the overwhelming majority of the time where L1 alone typically has a 95+% hit rate. L2 and L3 caches exist in an attempt to eek out most of that last 5%, which means you need to increase cache size by well over 100X (vs L1) to improve overall cache hit rate by less than 5%. 100X the effort for 1/20th the potential gain, that is some rapidly diminishing return right there.

Practically all applications would benefit far more from faster local access to a much larger chunk of the total address space than an expensive sub-GB SRAM no matter how fast it may be.

As for SRAM being dual-ported, DRAM can be multi-ported too and dual-ported 'video' DRAM was patented by IBM in 1985. However, there is little practical benefit to dual-porting DRAM when you have the ability to pipeline reads and writes between internal banks and simply pump twice as much bandwidth through a twice-as-wide-or-fast interface instead, which is why the original VRAM got forgotten once SDRAM came along. Read/write pipelining across 32 internal memory banks is also how DDR5 is now able to saturate the bus with a single rank of chips and a single DIMM per channel.
What about L4$ and a massive SRAM stack for it?
That's also a option.


HBM is still fundamentally the same as $3/GB DDR5. As HBM works its way down through product tiers, manufacturing volume will pick up and it should get cheaper. It is just taking a while for a steady demand to establish itself and get the on-package memory ball rolling.

What is the cost of a 32GB OMI DDIMM? Likely in the neighborhood of "if you have to ask, you cannot afford it" because the volume is so low that no memory manufacturer could be bothered to launch their own.
Who said anything about using the OMI DDIMM model?
I told you, I'm against using that model, that's the same BS that Intel did back in the day with FB-DIMM's and that initiative failed as well.
Any attempt to change the Parallel DIMM model with a on-DIMM controller is doomed to fail.
That was never the original goal of the Memory Module.

The original historical goal of the Memory Module is to leave the complex Memory Controller on the outside, the Memory Modules should be dumb & cheap.

BuqLNRC.png
This was the model they didn't bother using, they were hardcore proponents of DDIMM's with a Serial Controller on the DIMM and a new format.

This other proposal that they didn't want to push was IMO the correct route they should've gone.

Move the Memory Controller into it's own package, solder it onto the MoBo in a convenient location to minimize parallel trace lengths.
Use "Bog Standard Industry DIMM's" and DIMM slots.
One DIMM per Memory Channel ONLY, it'd be cheap enough to even make "Quad Memory Channel" the default for consumer ATX boards w/ 4 DIMM slots and physically locate the two DIMM slots in different locations for ease of routing.

If you're worried about MoBo vendors trying to put on extra DIMM slots per Memory Channel, you can firmware lock away that functionality and tell the Vendors to live with 1x DIMM per Memory Channel from now on.

And if you want more DIMM slots, you just tell the CPU vendor how many they want for the platform and they make adjustments for it, or design enough slots to handle it.

Each Memory Controller Chip is a Dual Channel Memory Controller with a 75-pin Serial Interface back to the SoC on a simple connection.
Whatever number of pins to the DIMM, that doesn't really matter, it'll get taken care of.

And given the modern "Chiplet Mantra" where the Memory Controller is being moved out of the Monolithic CPU to a more Chiplet based design, this literally fits right in.
 
So you want me to advocate else-where but here.
Fine, I should just stop coming to Tom's Hardware's forums.
You asked and I gave you my opinion. Luckily, it's not my responsibility to decide who & what's allowed. That's a job for the mods.

I'm not even trying to be a star quarter back or anything like that.
You are, in a way. You want to make the big calls in system architecture. Those are crucial strategic decisions, with a lot at stake, which is why the privilege of having a seat at the table is earned. Meritocracy and all that.

I'm just trying to make sure that IBM's engineers who came up with a really good idea, but didn't go in 100% the direction they should've, doesn't go to waste.
There are a lot of engineers with a lot of ideas. A lot of ideas aren't as good as they seem. They either have some non-obvious fatal flaw, or they have a number of drawbacks that cumulatively overwhelm their benefits.

Sometimes, good ideas fail for other reasons, but it's pretty hard to keep a truly good idea down. They just have a way of coming up again and again, in one form or another.

Don't worry about the engineers. Most are mature enough to shake off some failures and setbacks. I'm guessing most of OpenCAPI's backers are now fully onboard the CXL bandwagon.

It's a very solid idea IMO,
And who are you to judge?

but they went to close to the sun and their wings melted.
What is the sun, in this case?

SRAM should be superior on Perf/Watt.
Citation please.

As for Perf/$, that's going to come to scaling of economies
Silicon gets cheaper when the cost per transistor goes down through density increasing faster than the cost of newer fab nodes. We know that's over, for SRAM. So, no. It's not going to get substantially cheaper per bit.

DRAM, you're at the mercy of the Microns, SK Hynix, Samsungs of the world
Unlike SRAM, DRAM is still scaling.

Ok, where are you going to quantify all the basic operations energy costs?
I already found a source a while ago and saved it.
Not my problem. You're the one making the claims. You need to cite your source and we really need answers to the various circumstantial questions I raised.

Depends on what audience you think your customers fit in.
Well, are you not talking about mainstream computing? If not, you should specify which markets you're talking about.

So instead of going with HBM, we're going with LPDDR
So far, that's what we're seeing in the consumer space. I can believe what @InvalidError has said about HBM's long-term cost competitiveness, but I don't know. Apple has shown how much you can do with just LPDDR5X.

You mean 32 GT/s per lane on PCIe 5.0
CXL 1.0 and 2.0 are both 32 GT/s. CXL 3.0 is 64 GT/s.

That means it'll be using PAM4 as well.
Yes, CXL 3.0 uses PAM4. That was my point.
 
Last edited: