I know it's been said before, but I think the sun might finally be setting on the era of "simple programming model; complex hardware". The main reason is energy-efficiency. If hardware exposes more details to the software and vice versa, the end result can be much more efficient.
You might as well tell other programmers to delete their simple programming languages and have everybody come back to c/c++/Rust.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/box-plot-summary-charts.html
Those are the most efficient and fastest languages to program for.
There are many junior programmers who don't like Memory Management and will whine/complain.
I'm all for C/C++/Rust, but not everybody is into it.
We didn't need that so much when Moore's Law was in full-effect, but as node improvements taper off, there's going to be renewed enthusiasm for finding performance anywhere you can.
Everybody wants more performance, but some bean counters at top don't care.
Tell that to MS Teams and how their IM client sucks balls and is slow, bloated, and needs to be completely re-written from the ground up.
But upper management doesn't care and won't refactor everything to be properly fast, written in a faster language.
BTW, Vulkan/DX12 is probably the first example of a major increase in programming complexity that went over pretty well with the software community, because they could appreciate the benefits. Don't underestimate just how many low-level details APIs like those already expose about the hardware.
That's also because the target audience, Video Game Devs, cares about performance and asked for those features to be "Console Like" exposure to the hardware.
They got what they requested and games perform better now for the most part, with some caveats.
They'll do it, if/when they think somebody else is about to do it too. There are leaks indicating Intel is going to put a tGPU that's over 3x the size of their biggest laptop iGPU to date in Arrow Lake. It's so big that I think it's going to require at least a chunk of in-package memory to keep it fed. Expect AMD to counter.
We'll see how well it pans out. Right now, AMD's iGPU's are doing good enough for their target audience.
If you want to game, get a dGPU.
And laptops are exactly where it makes the most sense to have things like bigger iGPUs and in-package memory. That's where Apple did it first, and that's where I expect Intel and AMD to do the same. I'm not expecting to see them do anything on par with a high-end laptop dGPU, but just something that extends into the low/mid-range dGPU.
Depends on the target market, if they see normies asking for faster iGPU's, they might listen and AMD will just refresh the iGPU with newer iterations of RDNA to get their performance boosts.
But until then, dGPU's will solve the problem for gaming.
Let's not forget how Intel & AMD already did that Kaby Lake-G Frankenstein that paired a Polaris GPU + HBM on the same "package" as a Kabylake CPU. So, they've already dipped a toe in this water. I think that GPU was about 32 CU, or half of a desktop Vega 64.
That didn't last long, and it got abandoned like a Red-Headed step child.
AMD is barely competitive with Intel's 96 EU Xe iGPU, so expect to see them counter Arrow Lake with more like 32 CU.
I'll believe it when I see it. I don't think AMD feels that threatened in the iGPU market. So far, their solution has been adding a dGPU for gaming, otherwise include a low end dGPU if they really care about that barely faster performance.
You were comparing die costs and omitting nearly half of RX 7900's die area. That's a glaring omission.
Including the cost of the RX7900 naturally includes those die costs and everything else that goes with it.
It ain't over, till it's over. And I ain't giving up on OMI or the cloned equivalents.
Learn the rules, before you break them. Running against the trend can make sense if you know exactly what you're doing. Otherwise, it tends to go pretty badly.
I'm more than happy to take the risks and ride & die with it.
Perhaps in a metaphorical sense, but not in the sense of a network protocol stack that people usually have in mind when they say "x runs atop y".
Ok, now that we're clear on that aspect, can we move on?
Intel is bringing back HBM in their new Xeon Max, but I guess you mean the OmniPath aspect? Intel tried to dominate datacenter networking with OmniPath and the market rejected it. That sent them back to the drawing board and they've since bought Barefoot Networks.
I'm not surprised that HBM is being directly attached on CPU package, it works well for certain data sets and problems.
That also highlights difference, which is that Intel was integrating OmniPath, primarily for inter-node communication. AMD is talking about it for system-level, which means even things like PCIe/CXL.
Intel has been looking at & talking about silicon photonics for a long time. So, definitely expect to hear& see more from them, on that front.
Everybody is going to use Silicon Photonics, it's going to co-exist alongside copper for PHY depending on the use case and distance needed to cover.
Not from anything I've seen. It doesn't make sense for that. Once you get beyond rack-scale, just use Ethernet.
https://www.fabricatedknowledge.com/p/cxl-protocol-for-heterogenous-datacenters
What about Ethernet?
Something unspoken here is what the hell is going to happen to ethernet? Ethernet today is the workhorse of datacenters because of the power and simplicity of fiber optics. I’m still quite a bull for top-of-rack ethernet (hello Inphi/Marvell), but I believe that CXL adoption will encourage PCIe connection within the rack.
I think short-term PCIe adoption will cannibalize 50G ethernet ports, but longer-term, I think that PCIe the fabric is toast. The dream is silicon photonics and CXL combined, and PCIe the fabric is well positioned given its close relation to CXL. I think that PCIe will eventually be superseded by a photonics-based fabric technology, but the protocol will be alive and well. The problem for investors or anyone watching is this is likely on a 10-year time frame.
So in the short term, this could be bad for commodity-like ethernet if it pushes adoption of rack level to interconnect using PCIe, but in the very long time, I think it’s PCIe that’s genuinely at risk. Of course, PCIe could start to adopt more SiPho-like aspects and live, but the jury is still out.
I think PCIe will continue to exist, doesn't matter if it's Copper Based or Silicon Photonics based.
The concept of using the PCIe protocol or any Alt-Mode protocols, like CXL will still be there.
I really think you're confusing it with NVMe, which does have optional support for networking - I think so that it can replace iSCSI.
No, I'm not, CXL will spread wide and far.
https://community.cadence.com/cadence_blogs_8/b/fv/posts/cxl-3-0-scales-the-future-data-center
CXL 3.0 features facilitate the move to distributed, composable architectures and higher performance levels for AI/ML and other compute-intensive or memory-intensive workloads. The CXL 3.0 protocol can support up to 4,096 nodes, go beyond rack. Composable server architectures are when servers are broken apart into their various components and placed in groups where these resources can be dynamically assigned to workloads on the fly. CXL technology continues to enable game-changing innovations for the modern data center at scale.
The entire Data Center is going to be in on it, Network & All.
That's just a first step, eventually, they'll think bigger beyond just one DataCenter.
Please cite some references or drop the point, because I think you're spreading misinformation - even if unintentionally.
https://www.opencompute.org/blog/20...ays-cxls-implications-for-server-architecture
Meta announced its intentions to incorporate CXL into future server designs, especially for memory-intensive AI applications running in accelerated computing platforms. The company is planning to boost its data center investments by more than 60% this year, with a heavy emphasis on its accelerated computing infrastructure in order to increase engagement on its social media platforms and to lay the foundation for the metaverse. CXL would enable more advanced memory systems that could share memory across various hosts within the network, effectively improving memory utilization, as well as enabling asynchronous sharing of data and results over multiple hosts. The company also proposed the tiering of memory based on applications. For instance, applications such as caching that demand the lowest latency, can use native memory (residing next to the CPU) for "hot" memory pages. This, in contrast with less latency-intensive applications, such as data warehousing, can use CXL memory (riding in PCIe expander cards) for "cold" memory pages as latency, as native memory tends to have 2X better latency than CXL memory. This hierarchy of memory allocation, which can utilize total system memory more effectively, would be beneficial for any accelerated computing platform.
It's a fairly old standard, by now. I think they did it because energy efficiency wasn't an issue at the kinds of speeds they were dealing with, back then, nor was it probably among their primary concerns. These days, you can't afford not to care about it.
I don't think that's the case, the IBM engineers focus was the main issue is scaling of DIMMs and getting more Bandwidth to the CPU by allowing either more DIMM Slots or less DIMM slots an having a flexible OMI port to be used by vendors.
Look at the TDP of IBM mainframe CPUs and tell me energy-efficiency is a priority for them.
IBM is usually one step behind of the latest Proces Node.
Their latest CPU is IBM Power10.
https://newsroom.ibm.com/2020-08-17-IBM-Reveals-Next-Generation-IBM-POWER10-Processor
They're on Samsung 7nm, not exactly the most cutting edge of process nodes.
IBM POWER10 7nm Form Factor Delivers Energy Efficiency and Capacity Gains
IBM POWER10 is IBM's first commercialized processor built using 7nm process technology. IBM Research has been partnering with Samsung Electronics Co., Ltd. on research and development for more than a decade, including demonstration of the semiconductor industry's first 7nm test chips through IBM's Research Alliance.
With this updated technology and a focus on designing for performance and efficiency, IBM POWER10 is expected to deliver up to a 3x gain in processor energy efficiency per socket, increasing workload capacity in the same power envelope as IBM POWER9. This anticipated improvement in capacity is designed to allow IBM POWER10-based systems to support up to 3x increases in users, workloads and OpenShift container density for hybrid cloud workloads as compared to IBM POWER9-based systems. 1
This can affect multiple datacenter attributes to drive greater efficiency and reduce costs, such as space and energy use, while also allowing hybrid cloud users to achieve more work in a smaller footprint.
IBM has recently launched the first Power10 server. A new generation of servers, which due to changes in structure, components and functionality, provide significant improvements in terms of performance, computing power and energy consumption.
www.ibm.com
Because of its energy-efficient operation, Power10 can provide many organizations with substantial cost savings and a significantly lower footprint. In a study, IBM compared the footprint of Oracle Database Servers on Power10 with Power 9 and Intel servers. Two Power10 systems can handle the same amount of Oracle workloads as 126 Intel or 3 Power 9 servers. To do this, a Power10 server uses 20 Kw of power compared to 30 Kw for a Power 9 and 102 Kw for the Intel servers. Translated into licenses, a Power10 requires 628 fewer licenses than Intel servers for the studied Oracle workloads.
IBM's Power10 should be competitive with Intel servers, so I wouldn't worry about it.
Everybody measures against Intel.
It's kind of weird how you cast things in a binary terms. System design doesn't work like that. It's an optimization problem, where you're trying to hit competitive performance targets while balancing that against TCO. Because all the performance in the world is of no help if your TCO is uncompetitive. And more pins definitely hurts TCO in both up-front costs and probably also reliability. So, it's definitely not like they don't care about it. They just can't afford to blow their power budget to shave off a couple thousand pins.
Or keep the same # of pins and give you more DIMM slots or more PCIe slots or other forms of HSIO.
It's about making more effective use of what you got.
Because OMI will only become more uncompetitive as DRAM speeds increase.
Huh? How so? OMI is just a interconnection for DRAM, it scales as it needs to for DRAM bandwidth.
Don't take it personally. It's bad engineering to get your ego so wrapped up in something that you can't even see when it's not the best solution.
Best Solution for what problem? What are you trying to solve vs what am I trying to solve?
You're proposing that CXL will magically be a "Cure-All" for everything.
In this tiered memory world, CXL attached pooled memory will be a slower tier of memory compared to DIMMs or even Direct-Attached Memory.
So everybody has it's place.
OMI is a functional subset of CXL. That's why it's a loser. CXL enables switching and pooling, and the article you linked says. This is key for scalability.
OMI is a member of the CXL consortium, but what it solves is vastly different from what CXL is solving.
Both can complement each other just fine. I don't get why you're so narrow minded about my solution.
Like I said before, the future of the datacenter is: in-package memory for bandwidth + CXL.mem for capacity. It has to be this way. You can't get the bandwidth or the energy efficiency with off-package memory, and I already addressed the points about capacity and scalability. Everybody is doing it. Even Nvidia's Grace CPU, in case you missed my earlier post in this thread.
You WANT it to be that way, it doesn't HAVE to that way.
The future of Data Center is tiered memory.
A software person perspective on new upcoming interconnect technologies. Existing Server Landscape Servers are expensive. And difficult to maintain properly. That’s why most people turn to the public cloud for their hosting and computing needs. Dynamic virtual server instances have been key to...
pmem.io
But there’s always a but. In the case of cloud storage, it is latency. Unsurprisingly, that’s also the case for CXL.mem. Memory connected through this interconnect will not be as quick to access as ordinary DIMMs due to inherent protocol costs. Since this is all upcoming technology, no one has yet published an official benchmark that would allow us to quantify the difference. However, it’s expected that the difference will be similar to that of a local vs remote NUMA node access [6]. At least for CXL attached DRAM. That’s still plenty fast. But is it fast enough for applications not to notice if suddenly some of its memory accesses take twice (or more) as long? Only time will tell, but I’m not so sure. For the most part, software that isn’t NUMA-aware doesn’t really scale all that well across sockets.
Every memory tier will have it's place.
Be it on-package Memory.
DIMM based Memory.
CXL attached pooled Memory.
Just like we have L0-L3$, we're just adding more layers of Memory/Cache.
Welcome L4-L7, they'll all have it's place based on speed and latency.
You clearly did not read this in its entirety, because it doesn't say a single thing about using it beyond rack-scale. All they talked about was that a server could use it to access memory in another chassis, but no implication was made that memory pools would be shared by large numbers of machines or that they wouldn't be physically adjacent to the machine using them.
Bye-bye bottlenecks. Hello composable infrastructure?
www.theregister.com
Finally, the spec hints at a CXL fabric with the introduction of multi-level switching.
A CXL network fabric will be key to extending the technology beyond the rack level. And there’s reason to believe this could appear in version 3.0 after Gen-Z — not to be confused with the generation of adults born after the turn of the century — donated its coherent-memory fabric assets to the CXL Consortium late last year.
Because CXL.mem is much more flexible. As I've said many times now, it doesn't predetermine your allocation of memory vs. I/O. And, by using switching & pooling, you can scale up capacity almost independent of how many links you opt to use. So, DDR5 DIMM slots will slowly give way to CXL memory slots.
CXL.mem is a tool, it doesn't replace the DIMM, the DIMM will have it's place as the faster tier of memory.
DDR5 DIMM slots will co-exist with DIMM slots as the faster level of memory for instances attached to certain CPU's.
Am I the one being small-minded, here? I keep repeating answers to your questions and it's like you just ignore them and we go around all over again. So, let's review:
- Widespread industry support.
- Flexible allocation between memory channels and I/O - users get to decide how many lanes they want to use for which purpose, including even legacy PCIe.
- One type of controller & PHY, for both memory and PCIe.
- Switching enables scaling to larger capacities than a single CPU has lanes.
- Pooling enables multiple devices to share a memory pool, without a single CPU becoming a bottleneck.
- CXL now supported for chiplet communication, via UCIe.
It's interesting that AMD was also an OpenCAPI member, but never adopted it in any of their products, instead preferring to develop Infinity Link. Maybe that says OpenCAPI itself has deficiencies you don't know about.
Or maybe AMD has different priorities at that moment in time.
You also treat CXL.mem as a magical Silver Bullet, it isn't.
Yes, it's flexible and it can be allocated to any Host on the CXL network.
Doesn't mean there won't be faster memory tiers where certain devices will have priority.
And Faster Memory will always be in demand.
Bye-bye bottlenecks. Hello composable infrastructure?
www.theregister.com
Intel and others have tried and failed in the past to develop a standardized interconnect for accelerators, he tells us. Part of the problem is the complexity associated with these interconnects is shared between the components, making it incredibly difficult to extend them to third parties.
“When we at Intel tried to do this, it was so complex that almost nobody, essentially nobody, was ever able to really get it working,” Pappas reveals. With CXL, essentially all of the complexity is contained within the host CPU, he argues.
This asymmetric complexity isn’t without trade-offs, but Pappas reckons they're more than worth it. These come in the form of application affinity, specifically which accelerator gets priority access to the cache or memory and which has to play second fiddle.
This is mitigated somewhat, Pappas claims, by the fact that customers will generally know which regions of memory the accelerator is going to access versus those accessed by the host. Users will be able to accommodate by setting a bias in the bios[/QUOTE]
I care more about learning than being right. If you have evidence that I'm wrong about something, you're free to share it.
Go read up more on CXL.mem, it isn't the magic bullet you make it out to be.
Is it powerful, sure.
I don't think it's going to replace DIMM slots anytime soon, it's going to co-exist and use DIMM slots as well as use it's own pools of memory allocated from else-where.
Diversity of approaches isn't as important as getting it right. If there are two equally viable approaches, I'm all for having a diversity of options. If one is clearly inferior to the other, then nobody really benefits by keeping the worse approach on some kind of life support.
But my solution isn't "Inferior" or "On Life Support".
We're not even solving the same thing.
LOL, if the DIMM makers want to hold back in-package memory, let 'em try.
🙄
DIMM makers can't hold back in-package memory.
But DIMM makers will be damned if you try to take away their market.
That's why DIMM slots are critical, and the more, the merrier.
Try telling an architect to mix in some bad design methods or building materials with the good ones they use, and just see how that goes over.
You don't get to be the one to determine if OMI is a bad design or material.
You're not de-risking anything.
How so?
Ecosystems thrive on common standards. That's
the Network Effect. When everyone is using PCIe or CXL, then we have interoperability between the widest range of devices and systems. We also get a wide range of suppliers of all the IP and components, which gives device & system designers the widest range of options.
Guess what everybody has, DIMM slots on MoBos.
What does everybody want more of, Memory / RAM?
Everybody wants FASTER access to RAM / Memory!
What's the easy way to get you FASTER RAM/Memory, add in more DIMM slots attached to the CPU.
We've seen this play out with countless standards battles. Once a standard gains dominance and a good bit of momentum, there's a tipping point and the industry shifts to it.
Ok, glad OMI jumped ship to CXL and became part of the team.
You were literally trying to gin up some kind of conspiracy where DIMM vendors would kill the move towards in-package memory. You can't seriously believe that will happen, do you? I don't even see how they could.
They can't stop on-package memory.
They can stop their market of DIMMs from disappearing.
There's a difference.
I just call it like I see it. If you can't handle the idea that your idea might not be the best, or at least that some of us sure don't see it that way, then maybe don't put them out there for people to judge. Forum posts go both ways. You can post whatever you want, but you can't control what other people think of what you post or how they respond. Mainly, I think it's just such a strange thing to have an ego about. It's not even like you're a financial stakeholder in it.
I don't have to be a financial stakeholder in it, I support good ideas, concepts, or companies.
If I like it, I'll support it.
Just like Optane was great tech, I'll still support it even though it's not getting much love right now.
I enjoy reading the the tech news, analyzing the trends, and trying to see where the industry will go. Sometimes I turn out to be wrong, or at least timing-wise, but it's a lot easier to take if you don't get too personally invested in a particular solution.
Oh well, I'm a person who is attached to particular solutions or ideas.
That's just how I am.
For instance, I thought the transition to ARM would happen way faster than it is. I also thought we'd have in-package memory sooner than we will. I thought AMD would be doing better in the AI and GPU-compute space. I didn't predict that Microsoft would come to embrace Linux. I was wrong about all these things, but it doesn't really bother me.
I never thought ARM will take-over the world.
x86 is too entrenched.
ARM will have it's place, but it'll be a small place, right by x86's side.
While RISC-V will be nipping at ARM's market share.
AMD has to battle nVIDIA, that's a up-hill battle.
I didn't see MS embracing Linux either.
So, there's one throw-away remark, made by some author of unknown depth:
"entire data centers might be made to behave like a sinlge system."
That's the long term goal
That's next to worthless. And yes, they misspelled "single", LOL. There's still nothing to suggest that it will be deployed beyond rack-scale, and that's okay. We don't need a single standard to scale from chiplets to the internet. I know people have done that with Ethernet, but I think those days will soon be behind us.
The whole point of CXL, is to eventually get to DataCenter scale.
That's the point.
Am I the bully? Or maybe you're bullying me, just because I don't agree with you. We keep going around and around, as if you just can't let it go until I agree with you. That's not how the internet works. You simply can't force anyone to agree with you, or else internet arguments would never end.
Ok, if you don't feel like debating anymore, we can end this right now.
You're not convincing me, I'm not convincing you.
I'll be supporting OMI, you go do your thing.
One might appreciate that someone is even taking the time to listen to your ideas and respond to your points. We could just all ignore you. It's not like the world will be any different if you keep clinging to OMI, but I thought you might appreciate understanding why it's being left in the dustbin of history. I didn't realize it was quite so dear to you or that you'd take it so personally.
I don't think it's in the dustbin of history, it just hasn't been utilized yet.
You can't beat CXL, though. That's the problem. OMI loses to DDR5 on energy-efficiency, it loses to in-package memory on bandwidth, and it loses to CXL on scalability and flexibility. That leaves it with no legs left to stand on.
It's not about beating CXL.
That's what I'm trying to tell you.
OMI is a way of connecting more DIMMs, something that is limited by the # of contacts needed on the CPU package and it's excessive growth.
The implementers will make it as efficient as they can to serialize the connection back to the CPU.
And OMI will use DDR5 or the latest DDR# DIMMs.
Or whatever RAM takes over from DDR#.
CXL is about better utilizing the resources you have in the DataCenter.
If ___ memory in RAM isn't being utilized, it can get allocated to another VM.
So if
_ memory area in this __ DIMM on this _ CPU host isn't being used, it can get allocated.
Same with any PCIe attached memory. It's just slower memory that is being utilized.
Nothing about OMI & CXL makes it enemies.
I don't get why you seem to see OMI as the enemy to CXL, when they are literally complementary tech that helps each other.
Both can easily co-exist and be used together.