News Jim Keller Shares Zen 5 Performance Projections

InvalidError · Apr 9, 2023

bit_user said:
That's really just an argument about yield. We have an example of Nvidia making viable 5 nm dies at 814 mm^2, meanwhile the Zen 4 CCD is only 66 mm^2. This tells me they opted for 8 cores for reasons other than yield.

CPU workloads usually have much tighter data locality than GPU workloads, which makes dividing large CPUs into smaller chiplets/tiles for yields and scalability using one single design more workable. For GPU stuff, seamlessly passing SM-to-SM traffic across die-to-die interfaces in a manner transparent to programmers is likely a major hurdle, same reason AMD's high-end consumer plans switched from having dual-GCDs to single bigger GCD while spitting out the memory controllers and L3 cache into MCDs.

bit_user said:
@InvalidError already shot down this idea for cost reasons, in the other thread.

I shot SRAM down as a remotely viable alternative to (on-package) system memory. He only meant more SRAM, which is more feasible but we have the cache sizes we have today due to a necessary compromise between cache latency, bandwidth, footprint, hit rate, etc. If cache sizes could be arbitrarily increased with performance gains somewhat proportional to costs, every CPU and GPU designer would be pumping cache already. Apart from GPUs where L3 caches have grown by a factor of 6-12X (from 3MB or less to 16+MB) to shrink the memory bus by 33-50%, caches sizes rarely change until they can be doubled at little or no cost in latency as any increase in latency will nuke most gains from improved hit rate.

bit_user · Apr 9, 2023

Thanks for chiming in.

Not to get onto a whole tangent about GPUs, but there are some key details I want to clear up:

InvalidError said:
Apart from GPUs where L3 caches have grown by a factor of 6-12X (from 3MB or less to 16+MB) to shrink the memory bus by 33-50%,

That was true of RDNA2, but RDNA3 has both wide memory bus and an order of magnitude more cache than RDNA1.

In the RTX 4000, Nvidia did something a little different. Like RDNA3, they kept with the wide memory bus (both go up to 384-bits) and have a lot of cache, but unlike RDNA2/RDNA3, they put the additional cache into L2, rather than adding a L3 level.

bit_user · Apr 9, 2023

Kamen Rider Blade said:
Oh please, don't be so boring such that you are limited to only what you can see right in front of you.
You seriously lack imagination if that's all you're willing to see as viable.

Their cache coherency protocol might not allow for what you're describing. Just because the chiplets have 2 links on them doesn't necessarily mean you can link them up however you want. Unless there's published information specifying otherwise, we shouldn't make such assumptions. Hence, my caution about letting your imagination get carried away.

Kamen Rider Blade said:
Within the same NUMA Nodes currently, if multiple CCD's want to talk to each other within the same Node, they have to send signals straight to the cIOD, and run back to the CCD within their NUMA node.

NUMA -> Non-Uniform Memory Architecture. It means some address spaces are slower to access than others, from a given core. Really what we're talking about here is ccNUMA (cache-coherent). Zen 1 had this property, because each compute die had local memory. To access other address ranges, it had to talk to a different compute die.

Since Zen 2, Epyc basically got rid of NUMA. It still exists in some small form, due to partitioning within the I/O Die, but the penalty is far smaller. So, what you're really talking about is cache coherency - not NUMA.

As for cores "talking to each other", it's important to understand what that means in practical terms. The way you describe it sounds almost as if there are two machines on a network that are pinging messages back and forth, but the real-world picture looks rather different.

When cores communicate, it's via shared data objects in memory. When one core tries to touch a cache line that was recently touched by a core in another CCD (and recently enough that it's still in that CCD's cache), that's when you feel the latency penalty between the CCD's. That's the only time it's visible to software. A well-threaded program will minimize the frequency that different threads are modifying the same memory locations.

You can come up with elaborate communication networks, but you quickly hit a point of diminishing returns, in a general-purpose CPU. For most of the software and uses cases out there, inter-CCD communication just isn't the bottleneck you seem to think it is. That's why the "hub-and-spokes" model has been so successful for AMD.

Kamen Rider Blade said:
Certain Leaks for future designs that I've seen are going to change how they seperate the NUMA nodes, and one of the benefits will behaving the CCD's IFOP's face each other on the CPU Substrate Layout, one potential advantage of doing that is for a EMIB like short hop between CCD's on one set of GMI connections and the other set to go straight back to the cIOD for communication.

I think that's really just about solving the scalability problem of N-to-N switching. If the IO Die already implements multi-level switching, then it could make sense to push the first level of switching upstream to the CCDs. Perhaps the main benefit would be energy savings - not routing. If the software partitions workloads (i.e. via thread, process, container, or VM-level scheduling) to a pair of CCDs, then such a hardware optimization would make sense to take advantage of the reduced coherency domain.

Anyway, I'm not going to spend a lot of time speculating about such rumors. It's enough just to try and understand the hardware that exists.

Kamen Rider Blade said:
So what, I'll be their advocate. Consensus can be changed in due time with advocacy.

It's not like advocating for a social or political cause. They make these decisions based on technical and business reasons. First, you need to make a good argument on those bases, and this gets into hard numbers that you've said you lack the knowledge or training to provide. Second, you need to talk to the key decision-makers, which are the main chip architects and definitely not anyone reading these forums.

Kamen Rider Blade said:
It's my time to spend, not yours. And if you don't want to talk about it, then we won't.

I'm in no position to censor you, nor do I really have a desire to do so. I just think it's a sad waste of time and energy, not to mention yet more text in these discussion threads that's of no use to anyone. You don't have to agree, but I'm always going point out that the ship has sailed on OMI, so that any unsuspecting readers of your posts don't get confused into thinking it's still a viable possibility - it's not.

Kamen Rider Blade said:
OMI fantasy? The tech is real, it's part of the same CXL family now.

It's not. They just took it over for administrative purposes. The reason that needed to happen is that OMI imploded. Everyone jumped ship, and yet the IP still required some administration.

Kamen Rider Blade said:
He has his reasons, I have my reasons to think my way.
We don't always have to agree on everything.
That's part of what a free society allows.

It also allows people to be unreasonable. If you want people to listen to you, then take more care that your audience stays receptive. If certain ideas don't find a receptive audience, then it's better for you to move on, instead of continue ranting about them.

Kamen Rider Blade said:
And I'm telling you that "Cost per Die" & Yields are important. It's easier to get more good yields when you have many smaller dies than larger dies.

I didn't say they aren't. The math is key, however. If you just blindly hewed to the logic that smaller = better, then you'd put each core on its own chiplet.

If you think yield is the reason they didn't go above 8 cores per chiplet, then I think you really need to provide some data to show the yield would've been markedly worse at 12 cores per die.

InvalidError · Apr 9, 2023

bit_user said:
That was true of RDNA2, but RDNA3 has both wide memory bus and an order of magnitude more cache than RDNA1.

The wide memory bus may be true at the halo-tier end of the product stack but everything below AMD's 700-tier got its VRAM interface neutered going from the 5k-series to the 6k-series much like what Nvidia did going from 3k to 4k so far.

bit_user said:
I didn't say they aren't. The math is key, however. If you just blindly hewed to the logic that smaller = better, then you'd put each core on its own chiplet.

If you think yield is the reason they didn't go above 8 cores per chiplet, then I think you really need to provide some data to show the yield would've been markedly worse at 12 cores per die.

There are practical limits to how small a die can be before the added costs, complexity, latency, etc. of chip-to-chip interface outweigh the potential gains from improved yields. I'm actually surprised that it apparently makes sense to go at least down to ~70sqmm.

As for why not go to 12 cores, the simplest reasons I can think of are that 1) server-side clients are happy with the setup as it is and 2) most client-side sales are 4-8 cores so there is no point in going larger than necessary with 12-cores-centric CCDs.

Once the mainstream shifts towards 8-16 cores 4-5 years from now, we'll likely see 16 cores ~120sqmm CCDs.

Kamen Rider Blade · Apr 9, 2023

bit_user said:
Their cache coherency protocol might not allow for what you're describing. Just because the chiplets have 2 links on them doesn't necessarily mean you can link them up however you want. Unless there's published information specifying otherwise, we shouldn't make such assumptions. Hence, my caution about letting your imagination get carried away.

Then how does the Zen cores make sure it's "Cache Coherent" when they cross CPU Socket boundaries via xGMI for Dual Socket Processors?
Those are literal single link, Bi-Directional Serial connections between CPU's.

NUMA -> Non-Uniform Memory Architecture. It means some address spaces are slower to access than others, from a given core. Really what we're talking about here is ccNUMA (cache-coherent). Zen 1 had this property, because each compute die had local memory. To access other address ranges, it had to talk to a different compute die.

Since Zen 2, Epyc basically got rid of NUMA. It still exists in some small form, due to partitioning within the I/O Die, but the penalty is far smaller. So, what you're really talking about is cache coherency - not NUMA.

You don't think they can maintain "Cache Coherency" while having a short-cut between CCD's & still using a Hub & Spoke model?

As for cores "talking to each other", it's important to understand what that means in practical terms. The way you describe it sounds almost as if there are two machines on a network that are pinging messages back and forth, but the real-world picture looks rather different.

It sort of is, on a smaller scale. Multiple Cores in seperate CCD's might as well be Multiple Machines, but on a much smaller network with much shorter data paths.
Of course, that's an analogy. The sharing of the same data must have a certain amount of coherency.

Given Ryzen's Cache Coherency model is based on L1 <-> L2 <-> L3 <-> cIOD Memory Controller.
It's L3$ isn't shared amongst all CCD's. So if data doesn't fit on local L3, and there is space on another CCD's L3$, data will go there before being considered to go to Main Memory.
Main Memory is always the last option.

When cores communicate, it's via shared data objects in memory. When one core tries to touch a cache line that was recently touched by a core in another CCD (and recently enough that it's still in that CCD's cache), that's when you feel the latency penalty between the CCD's. That's the only time it's visible to software. A well-threaded program will minimize the frequency that different threads are modifying the same memory locations.

True, but it's still much faster to go across CCD than to go to main memory.

You can come up with elaborate communication networks, but you quickly hit a point of diminishing returns, in a general-purpose CPU. For most of the software and uses cases out there, inter-CCD communication just isn't the bottleneck you seem to think it is. That's why the "hub-and-spokes" model has been so successful for AMD.

"Diminishing Returns" doesn't mean there isn't some gain. And most of modern CPU Design & Evolution is based on improving on already "Diminishing Returns" and refining small things to make some sort of overall net gain. That's why every generational change, the % gains are always small single/double digit

t % changes.

CPU architectures have been hitting Diminishing Returns for quite a while, we just happen to be in that phase of technological improvement right now where we're dependent on refining alot of small aspects spread throughout the CPU Architecture, CPU Physical layout, & Process Nodes simultaneously to make the existing small gains that we can get in each generation.

I think that's really just about solving the scalability problem of N-to-N switching. If the IO Die already implements multi-level switching, then it could make sense to push the first level of switching upstream to the CCDs. Perhaps the main benefit would be energy savings - not routing. If the software partitions workloads (i.e. via thread, process, container, or VM-level scheduling) to a pair of CCDs, then such a hardware optimization would make sense to take advantage of the reduced coherency domain.

Given modern VM's getting recommended to split a large Multi-CCD's resource assignment of Cores/CCD's amongst the recommended NODE boundaries if they want to split up an entire CPU into multiple VM's, it makes sense to have some optimizations between the CCD's sitting within those existing NUMA Nodes.

Not every single VM will eat up an entire CPU's Core Resources, some setups will split the VM's down to each CCD, some will split their VM's to each entire NUMA NODE.

I don't see how having a direct link between CCD's would be harmful if they can get it to work.

Anyway, I'm not going to spend a lot of time speculating about such rumors. It's enough just to try and understand the hardware that exists.

That's your perogative.

It's not like advocating for a social or political cause. They make these decisions based on technical and business reasons. First, you need to make a good argument on those bases, and this gets into hard numbers that you've said you lack the knowledge or training to provide. Second, you need to talk to the key decision-makers, which are the main chip architects and definitely not anyone reading these forums.

Ok, getting to talk to those people aren't a easy thing; they're generally insulated from the general populace and rarely allowed to talk to the general masses for various reasons.

I'm in no position to censor you, nor do I really have a desire to do so. I just think it's a sad waste of time and energy, not to mention yet more text in these discussion threads that's of no use to anyone. You don't have to agree, but I'm always going point out that the ship has sailed on OMI, so that any unsuspecting readers of your posts don't get confused into thinking it's still a viable possibility - it's not.

Says who? Just because IBM is the only one that uses it right now, doesn't mean the tech & concept isn't viable.
I don't necessarilly agree with IBM's exact use model of shoving the memory controller on the DIMM.
That's the route they took, and many in the industry doesn't like that model since it adds significant cost to the DIMM and requires a proprietary DIMM format.

That's why I'm a advocate of their other OMI model, the one that they don't want to use because it's the half-and-half approach.

It's not. They just took it over for administrative purposes. The reason that needed to happen is that OMI imploded. Everyone jumped ship, and yet the IP still required some administration.

It "imploded" or was limited to IBM because they went against the "Raison D'etre" of the DIMM.
That Memory Modules should be Cheap & Modular. That Design Philosophy was agreed upon a LONG TIME ago by many companies.
Fully Buffered DIMMs failed because they tried to move parts of the Memory Controller onto the Module.
OMI's existing model didn't take off because they have a tiny Memory Controller shunted onto a proprietary (Non-Standard) DIMM.

The best solution is to not have to change the DIMM format, but move out the Memory Controller to be on the MoBo and Physically much closer to the DIMM slot.
Ergo minimizing the distance that the DIMM's many parallel electrical signal needs to travel to the memory controller.
The remaining distance can be covered by a Serial Connection, ergo minimizing the pins on the CPU Substrate.

It also saves Die Area for cIOD with the Memory Controller's PHY implementation, that can be used for other things.

It also allows people to be unreasonable. If you want people to listen to you, then take more care that your audience stays receptive. If certain ideas don't find a receptive audience, then it's better for you to move on, instead of continue ranting about them.

But you aren't the only person on a open forum, I'm trying to reach as many people as possible.

I didn't say they aren't. The math is key, however. If you just blindly hewed to the logic that smaller = better, then you'd put each core on its own chiplet.

If you think yield is the reason they didn't go above 8 cores per chiplet, then I think you really need to provide some data to show the yield would've been markedly worse at 12 cores per die.

I didn't say smaller = better CPU Cores.
I said AMD's main priority was maintaining costs & Yields while making a modular CPU CCD that works for their entire product stack.
Design & Make Once, use everywhere.
That was the entire point of the evolution from 4-core to 8-core CCD's/CCX's.

It's all about Baby-Steps.
What's next after 4-core CCX w/ 2x CCX's per CCD -> 8-core CCX|CCD?
I theorize that adding a 12-core CCX/CCD is the right step to increase the portfolio and allow a wider product target.
It's gradual evolution, there's physical room to fit (although barely on the Ryzen platform, more on certain EPYC designs).

InvalidError · Apr 9, 2023

Kamen Rider Blade said:
The best solution is to not have to change the DIMM format, but move out the Memory Controller to be on the MoBo and Physically much closer to the DIMM slot.

The DIMMs are already as close to the CPU socket as they can be without encroaching on the CPU's mechanical keep-out zone where you aren't allowed to put anything because it may interfere with HSF mounting. You also need that space for trace length-matching across the whole DIMM interface which is going to get more complicated if you slap an extra chip in there and increase the distance discrepancy between shortest and longest ball to DIMM pin distance, which means even more space wasted around your chip back-tracking out of the DIMM area for that.

One of the old DIMM interface's greatest benefits was the ability to add more RAM by using two DIMMs per channel but that is already mostly dead with DDR5 if you want speeds much beyond 5GT/s, almost certainly going to be history with DDR6. We have also reached the point where most workloads benefit more from trading slightly worse latency in favor of increased memory IO concurrency by splitting DIMMs into 2x36 data interfaces sharing one command bus with DDR5 to 4x18bits with DDR6. Things are clearly progressing in the direction of cutting ties with how DIMMs used to work. Holding on to the archaic interface isn't going to do anyone any good for much longer, especially if you are going to throw kludge chips in there to make it work.

bit_user · Apr 9, 2023

InvalidError said:
The wide memory bus may be true at the halo-tier end of the product stack but everything below AMD's 700-tier got its VRAM interface neutered going from the 5k-series to the 6k-series

RDNA2 is the 6000-series. RDNA3 is the 7000-series. So far, RDNA3 looks to have both big L3 caches and wide memory interfaces.

InvalidError said:
much like what Nvidia did going from 3k to 4k so far.

The 3080's width was 320 bits, yet the 4080's is 256 bits. Traditionally, it's normal for the 80-tier to have 256 bits. I think the main reason they went to 320 bits was to expand capacity to 10 GB, not for bandwidth. The reduction in width was partially compensated for by an increase in memory frequency.

At the 70-tier, the 3070's width was 256 bits, yet the 4070's is 192 bits. Again, the reduction in width was partially compensated by an increase in memory frequency. As with the 3080, I think it's hard to know whether the 3070 had 256 bits more because it needed the bandwidth, or because that was the most economical way to supply it with 8 GB of memory.

Kamen Rider Blade · Apr 9, 2023

InvalidError said:
The DIMMs are already as close to the CPU socket as they can be without encroaching on the CPU's mechanical keep-out zone where you aren't allowed to put anything because it may interfere with HSF mounting. You also need that space for trace length-matching across the whole DIMM interface which is going to get more complicated if you slap an extra chip in there and increase the distance discrepancy between shortest and longest ball to DIMM pin distance, which means even more space wasted around your chip back-tracking out of the DIMM area for that.

I concur, the current DIMMs are as close to the CPU socket w/o encroaching on the CPU's mechanical "Keep-Out Zone".
As far as trace length matching across the whole DIMM interface, that should be a bit more relaxed since the DIMM Memory Sub Channels have split it into independent A/B channels.
And DDR6 is going to add a C/D channel on the other side of the DIMM.

This isn't the DDR4 and earlier days where you have to trace-length match all 288-pins, you only need about half of that.

And moving onto DDR6, I can see them only needing to Trace-length match 1/4th of the total pin-count at a time.

That being stated, moving the entire Memory Controller to be closer and outside of the "Keep-Out Zone" would shorten the parallel trace lengths along with making it easier for the board designers to figure out how to make each independent memory channels trace-length as short and manageable as possible.

That's why I'm pushing for OMI, it gives you a flexibility that you didn't have before for Trace-Length routing on the parallel side.

One of the old DIMM interface's greatest benefits was the ability to add more RAM by using two DIMMs per channel but that is already mostly dead with DDR5 if you want speeds much beyond 5GT/s, almost certainly going to be history with DDR6. We have also reached the point where most workloads benefit more from trading slightly worse latency in favor of increased memory IO concurrency by splitting DIMMs into 2x36 data interfaces sharing one command bus with DDR5 to 4x18bits with DDR6. Things are clearly progressing in the direction of cutting ties with how DIMMs used to work. Holding on to the archaic interface isn't going to do anyone any good for much longer, especially if you are going to throw kludge chips in there to make it work.

I agree, doubling up on DIMM slots per Memory Channel should die for the sake of higher Main Memory speeds per DIMM Memory Channel.
But the benefit of OMI, especially with this incarnation where the Memory Controller sits on the MoBo:

I think you meant the 2x(32/40)-bit Memory Channel interfaces, not 2x36.
But with DDR6, they are going 4-channels, but I don't agree on trying to go with 4x16-bits.
It's too early IMO to go with x16-bit channels. We haven't found ways to make that fast enough at a reasonable latency and power cost.

We should stick to 4x(32/40)-bit Memory Channel interfaces for DDR6.
Yes, I know that would almost double the Pin-count to ~600 pins per DIMM, but having double Row-Pins wouldn't be hard to implement on a industry wide basis.
We've had double-row pins in the past as well, it's nothing new. So the point stands that we need a well thought out road-map for DDR# into the future.

With DDR7, we'll get to 6x DIMM Memory Sub-Channels, each Memory Sub-Channel containing 4x RAM Package.
ECC will use the "Link-ECC" model which will negate the need for seperate dedicated ECC Memory Chip and creates more room on DIMM for more RAM.

With 12x RAM Packages per row on each side at 6x DIMM Memory Sub-Channels, we'll have more options for splitting as time goes by.

Then comes DDR8 with 4x groups of 3-'RAM Packages' per DIMM Memory Sub-channels for a total of 8x DIMM Memory Sub-Channels per Double-Sided row.

Then comes DDR9 with 6x groups of 2-'RAM Packages' per DIMM Memory Sub-channels for a total of 12x DIMM Memory Sub-Channels per Double-Sided row.

Then comes DDR10 with 7x groups of 2-'RAM Packages' per DIMM Memory Sub-channels for a total of 14x DIMM Memory Sub-Channels per Double-Sided row.

Then comes DDR11 with 8x groups of 2-'RAM Packages' per DIMM Memory Sub-channels for a total of 16x DIMM Memory Sub-Channels per Double-Sided row.

I literally have a road-map for up to DDR15 where we have 32 RAM Packages and 32 DIMM Memory Sub-Channels per row with each Column being it's own Memory Channel.

But hey, I like to dream big =D.

It's not so much of a kludge as so much of a moving of the Memory Controller to be optimal for positioning and packaging.

Kamen Rider Blade · Apr 9, 2023

bit_user said:
RDNA2 is the 6000-series. RDNA3 is the 7000-series. So far, RDNA3 looks to have both big L3 caches and wide memory interfaces.

That's the beauty of Chiplets =D.

The amount of Die Real-Estate that the GCD needs for Memory PHY is very tiny compared to the traditional model that nVIDIA engineers have been complaining about.

bit_user · Apr 9, 2023

Kamen Rider Blade said:
True, but it's still much faster to go across CCD than to go to main memory.

When I say "memory", I mean a memory address. Obviously, I'm not saying the data has to get flushed out to DRAM. I'm just explaining the programming model, since some people might be under the impression that cores are passing messages directly to each other, or that this is frequently even a bottleneck.

Kamen Rider Blade said:
"Diminishing Returns" doesn't mean there isn't some gain. And most of modern CPU Design & Evolution is based on improving on already "Diminishing Returns" and refining small things to make some sort of overall net gain.

In fact, "diminishing returns" implies there is some gain to be had. Of course that's not always true, but I'm presuming there even is some gain. The key is to make sure the incremental cost is less than the benefit. That's the only way you win by chasing diminishing returns. It's not clear to me that the benefit justifies the cost.

I think you've just seized on something that interests you and are free-wheeling with ideas about it. The hard part of innovation isn't actually coming up with ideas, but figuring out which ideas make sense and then all the nuts & bolts of implementing them successfully. "Innovation" gets so much play in today's culture that it's easy to overlook the basic fundamentals of engineering.

Kamen Rider Blade said:
Lots of minor changes here & there for a measily 13% Generational IPC Uplift."]t % changes.

[/SPOILER]

You have no idea just how much modeling and data went into each and every one of the little decisions that added up to such gains. That's often the only way to know for sure what's a good idea and what's not. Letting a few bad ideas slip into the mix can quickly overwhelm the gains achieved by all the good ones.

Kamen Rider Blade said:
That's your perogative.

Speculation without data tends to be pretty worthless. The space of possible ideas is infinite. Data and modeling is what separates the plausible from the infeasible and the bad ideas from the good ones.

Kamen Rider Blade said:
"Two roads diverged in a wood, and I— I took the one less traveled by, And that has made all the difference."

You forgot to learn the rules, before breaking them. The reason being that it's only by understanding the rules that you can know whether, where, and when it makes sense to break them. Blindly ignoring the rules is a recipe for disaster.

Kamen Rider Blade said:
OMI's existing model didn't take off because they have a tiny Memory Controller shunted onto a proprietary (Non-Standard) DIMM.

How do you know that's why it failed? I think that's just what you want to believe, because it explains OMI's market failure without technically invalidating its premise. Wishful thinking has no place in engineering.

If you really want to know whether OMI is technically viable, you should do some more digging to find out as much as you can about the real reason it fizzled. I think you're afraid of what answer you might turn up, if you looked. Prove me wrong.

Kamen Rider Blade said:
The best solution is to not have to change the DIMM format, but move out the Memory Controller to be on the MoBo and Physically much closer to the DIMM slot.

Show me the math. How does energy consumption compare between that and DDR5 of the equivalent bandwidth? What happens to trace lengths, as you try to scale up to the DDR5-17600 we're told MRDIMMs will ultimately reach?

You'll never convince anyone it's a good idea, if you can't prove it's competitive on an energy-efficiency front and that it won't stop scaling before DDR5 does.

Kamen Rider Blade said:
But you aren't the only person on a open forum, I'm trying to reach as many people as possible.

It's the wrong place for that. Also, posts are supposed to be on-topic. This article wasn't about OMI. You're thread-jacking to try and work your agenda.

Kamen Rider Blade said:
It's all about Baby-Steps.

A baby step isn't a virtue, when a bigger one would be optimal. Competition eats you alive when you don't take the right size step the situation calls for. Reasoning by analogy will get you run over, in business and engineering. It's all about the data and making the best cost/benefit tradeoffs.

Kamen Rider Blade · Apr 9, 2023

bit_user said:
It's the wrong place for that. Also, posts are supposed to be on-topic. This article wasn't about OMI. You're thread-jacking to try and work your agenda.

If you don't want me to continue this, then just say so.

We're done.

bit_user · Apr 9, 2023

Kamen Rider Blade said:
If you don't want me to continue this, then just say so.

It's not about what I want. I also said it's also not my place or desire to censor you.

I think short tangents are okay, but this definitely seems like it's gone past that point. So, ending it here seems like a good choice.

Kamen Rider Blade · Apr 9, 2023

bit_user said:
It's not about what I want. I also said it's also not my place or desire to censor you.

You could've fooled me, it seems like you're here to rain on my parade at every encounter.

I think short tangents are okay, but this definitely seems like it's gone past that point. So, ending it here seems like a good choice.

Sure, all I'm going to remember is that you don't want me here, so I guess I won't be contributing as much since you don't like the way I post and what I post.

bit_user · Apr 9, 2023

Kamen Rider Blade said:
You could've fooled me, it seems like you're here to rain on my parade at every encounter.

Ideas should be challenged. Good ideas will have no trouble passing this test. The quicker that bad ideas can be ruled out, the more we can focus on the good ones. It's a basic tenet of research.

Kamen Rider Blade said:
Sure, all I'm going to remember is that you don't want me here, so I guess I won't be contributing as much since you don't like the way I post and what I post.

If you have an idea that doesn't gain traction, wouldn't you rather know why not? Then, you can either change your approach or focus on ideas that have a better chance of succeeding. That's why I take the time to point out & explain some of the holes I see.

This stuff isn't easy. Many smart people make careers of working on these problems. Many a doctoral thesis has been written about these subjects. You need to have both a realistic expectation of what it takes to have a successful idea, and how to pursue and promote good ideas so that they get the attention they deserve. I can't rule out the possibility that some of your ideas are in fact good ones, but you're not doing the necessary work to demonstrate that. If you're seriously interested in hardware architecture, there's no substitute for doing the work. There are also better forums for discussing it than this one.

As far as what I think, I'm just another random person on the internet. You decide how much my opinion matters to you.

Kamen Rider Blade · Apr 9, 2023

bit_user said:
You decide how much my opinion matters to you.

I already did.

InvalidError · Apr 9, 2023

Kamen Rider Blade said:
This isn't the DDR4 and earlier days where you have to trace-length match all 288-pins, you only need about half of that.

You don't length-match the ~140 power, ground and low-speed support pins.

Kamen Rider Blade said:
It's too early IMO to go with x16-bit channels. We haven't found ways to make that fast enough at a reasonable latency and power cost.

We should stick to 4x(32/40)-bit Memory Channel interfaces for DDR6.

https://resources.altium.com/p/ddr5...les#pcb-design-challenges-in-ddr5-vs-ddr6-ram
Altium, which is used by many if not most advanced PCB designers, is already telling designers to get ready for DDR6 having 24bits-wide channels, 16x data + 8x ECC. Looks like they are expecting memory bus error rates to become a huge issue beyond 10GT/s.

That would be one more reason to believe parallel interfaces for external DRAM will reach the end of the road with DDR6. Here come M.2-RAM sticks!

Kamen Rider Blade · Apr 9, 2023

InvalidError said:
You don't length-match the ~140 power, ground and low-speed support pins.

Fair enough.

https://resources.altium.com/p/ddr5...les#pcb-design-challenges-in-ddr5-vs-ddr6-ram
Altium, which is used by many if not most advanced PCB designers, is already telling designers to get ready for DDR6 having 24bits-wide channels, 16x data + 8x ECC. Looks like they are expecting memory bus error rates to become a huge issue beyond 10GT/s.

That would be one more reason to believe parallel interfaces for external DRAM will reach the end of the road with DDR6. Here come M.2-RAM sticks!

I've read through that article before, that's where I found out how they wanted to split the channel & keep the pin-count the same (up to 300-pin limit).

I kinda disagree with keeping the 300-pin limit of current DIMMs and want them to add-in a second row of pins that are half-offset, right below the original row.
This would allow for my plan to work in the long term with many more parallel DIMM Memory Sub-Channels.

I also want them to shrink the pin-pitch to have 250 pins per row with 500 pins on the double row and 1000 pins on both sides.

There have already been SO-DIMM & Mini-DIMM's with more Pins in a smaller physical width than the primary DIMM, so asking for 250 pins per row seems to be within a reasonable request.

bit_user · Apr 10, 2023

Kamen Rider Blade said:
I kinda disagree with keeping the 300-pin limit of current DIMMs and want them to add-in a second row of pins that are half-offset, right below the original row.
This would allow for my plan to work in the long term with many more parallel DIMM Memory Sub-Channels.

I also want them to shrink the pin-pitch to have 250 pins per row with 500 pins on the double row and 1000 pins on both sides.

All of these decisions involve cost, power, reliability, signal-integrity, mechanical, and other tradeoffs. Without any models, how are you able to know which tradeoffs are justifiable or optimal?

If you want to find out whether your ideas are viable, why aren't you posting them on a site or Reddit thread for board designers? More than anything you write, it's where you're writing that tells me you're not at all serious about your ideas. And if you're not serious about them, don't blame us for not taking them seriously.

Kamen Rider Blade · Apr 10, 2023

bit_user said:
All of these decisions involve cost, power, reliability, signal-integrity, mechanical, and other tradeoffs. Without any models, how are you able to know which tradeoffs are justifiable or optimal?

If you want to find out whether your ideas are viable, why aren't you posting them on a site or Reddit thread for board designers? More than anything you write, it's where you're writing that tells me you're not at all serious about your ideas. And if you're not serious about them, don't blame us for not taking them seriously.

So Tom's Hardware isn't "Serious Business", but Reddit is?

bit_user · Apr 10, 2023

Kamen Rider Blade said:
So Tom's Hardware isn't "Serious Business", but Reddit is?

The in-depth technical content of the articles here is pretty light, as is the amount of coverage of electronics industry. You need to find the sites and communities where chip and board designers hang out.

I don't frequent Reddit, but a quick search turned up this: https://www.reddit.com/r/chipdesign/

Maybe look at sites oriented towards EDA.

There clearly must be better places to chat with chip and board designers than here. The majority of people on here are gamers - why do you think any hardware designers would be interested in hanging out here?

Kamen Rider Blade · Apr 10, 2023

bit_user said:
The in-depth technical content of the articles here is pretty light, as is the amount of coverage of electronics industry. You need to find the sites and communities where chip and board designers hang out.

I don't have all day to go searching high and low across the internet.

I don't frequent Reddit, but a quick search turned up this: https://www.reddit.com/r/chipdesign/

I generally don't use Reddit.

Maybe look at sites oriented towards EDA.

There clearly must be better places to chat with chip and board designers than here. The majority of people on here are gamers - why do you think any hardware designers would be interested in hanging out here?

Because Tom's Hardware is a general Tech Site that is supposed to be welcoming to all Tech Enthusiasts of all shapes, stripes, and experiences.

Anyways, you want me to leave, that's what it comes down to.

My opinions aren't welcome here is what you're stating.

bit_user · Apr 10, 2023

Kamen Rider Blade said:
I don't have all day to go searching high and low across the internet.

I generally don't use Reddit.

In the time it takes you to write one of your long posts, I'm sure you could find better places to post them than here. If you were serious about pitching these ideas, I'm sure you would.

Kamen Rider Blade said:
Because Tom's Hardware is a general Tech Site that is supposed to be welcoming to all Tech Enthusiasts of all shapes, stripes, and experiences.

It's welcoming to all people, but that's different than saying it caters to all interests those people might have. If you want to talk chip design & board design, then you should find the appropriate communities and forums for that.

Kamen Rider Blade said:
Anyways, you want me to leave, that's what it comes down to.

No, but I think you should post your ideas about chip & board design in places where you can get feedback from people with more expertise in the respective fields. If you're serious about them, that's what I'd expect you'd do. Especially if you want them to get visibility among people who could actually implement them.

InvalidError · Apr 10, 2023

Kamen Rider Blade said:
I kinda disagree with keeping the 300-pin limit of current DIMMs and want them to add-in a second row of pins that are half-offset, right below the original row.

The problem with single-ended parallel interfaces is that the amount of ground bounce (noise) you get is a function of how many pins are going high vs going low. The more single-ended signals you have, the worse it gets, especially when signals have to travel through socket and slot interfaces. I'll hazard a guess that part of the reason for DDR6 going 16D+8ECC is because they use some of the code word expansion to do TMDS-like balancing, albeit for balancing the number of parallel bits flipping directions instead of DC bias over time for a serial lane to prevent coupling capacitor or transformer saturation.

If you are going to implement many serial-like features in your parallel bus just to make bits survive going through the DIMM slot, may as well go full serial.

bit_user said:
All of these decisions involve cost, power, reliability, signal-integrity, mechanical, and other tradeoffs.

Tradeoffs that likely won't make any lick of sense by the time DDR6 becomes mainstream: by then, CPUs will have 64+MB of L3$ as baseline and it is highly probable we'll have or be about to have on-package HBM3+ memory too.

To me, the way DDR6 splinters the DIMM interface into bite-sized chunks looks like a prelude to actually halving the per-module bus width: if each DIMM channel can only have one DIMM each in order to accommodate bandwidth scaling, splitting the interface width is the only way to keep four slots for expansion without massively increasing CPU pin count, board complexity and price. (And there is no point in maintaining a full 128bits worth of parallel external memory on mainstream platforms when the bulk of people won't need more than the 32-64GB already on package.)

Kamen Rider Blade · Apr 10, 2023

InvalidError said:
The problem with single-ended parallel interfaces is that the amount of ground bounce (noise) you get is a function of how many pins are going high vs going low. The more single-ended signals you have, the worse it gets, especially when signals have to travel through socket and slot interfaces. I'll hazard a guess that part of the reason for DDR6 going 16D+8ECC is because they use some of the code word expansion to do TMDS-like balancing, albeit for balancing the number of parallel bits flipping directions instead of DC bias over time for a serial lane to prevent coupling capacitor or transformer saturation.

If you are going to implement many serial-like features in your parallel bus just to make bits survive going through the DIMM slot, may as well go full serial.

They've tried to do that in the past multiple times; every time they did try to go "Serial Interface", the industry has stated "NEIN!".
And Adoption failed.
The various groups within JEDEC that influence the standards has stated that they want to stick to the parallel interface and not make Memory Modules serial in connection nature.
OMI is just the latest attempt to shove a Serial Memory Controller onto the DIMM and turn the DIMM into a Serial interface.
It didn't catch on, just like FB DIMM before it many years ago by Intel.

The original reason for the standard Memory Modules being simple PCB's with parallel interfaces was to make it cheaper for all.
I got that from reading alot of historical info on the formation of the SIMM/DIMM standard and what all the players wanted out of the common standardized Memory Module.

There was not to be any "Controlling Logic" put on the memory module, the memory controller (located outside of the SIMM/DIMM), would handle all the hard memory controlling logic necessary.

Changing that paradigm goes against the original cost effective ethos behind the formation of the Parallel Memory Module.

So tell it to JEDEC and all it's members who have skin in the game, ask them whether or not they can be willing to accept a Tiny Serialized Memory Controller on the DIMM.

So far, it has failed at least twice, in a big way.

That's why I'm a proponent of the alternate implementation of OMI.

The Memory Controller isn't on the Memory Module, but it is moved closer to the Memory Module and the serial connection allows flexible placement on the MoBo.

The shorter parallel traces for the Data Lines should make it easier to raise over-all clocks and bandwidth while maintaining reasonable power consumption.

InvalidError said:
Tradeoffs that likely won't make any lick of sense by the time DDR6 becomes mainstream: by then, CPUs will have 64+MB of L3$ as baseline and it is highly probable we'll have or be about to have on-package HBM3+ memory too.

I wouldn't expect HBM memory to be any consumer client platforms, that seems to be relegated to Enterprise Customers due to cost reasons.

To me, the way DDR6 splinters the DIMM interface into bite-sized chunks looks like a prelude to actually halving the per-module bus width: if each DIMM channel can only have one DIMM each in order to accommodate bandwidth scaling, splitting the interface width is the only way to keep four slots for expansion without massively increasing CPU pin count, board complexity and price. (And there is no point in maintaining a full 128bits worth of parallel external memory on mainstream platforms when the bulk of people won't need more than the 32-64GB already on package.)

That's assuming the average consumer/client CPU's get on-package memory vs sticking with the existing DIMM model.

I'm not betting on that to happen for the vast majority of folks, especially when they are used to selecting how much memory they're buying into.

I do agree on the split of the DIMM Memory Sub-Channels into 4x, I came to the same conclusion on my own when thinking about the future of DDR past DDR5.
Long before I decided to research what JEDEC was planning for DDR6.

The only part that I didn't agree with was how they wanted to do the bus width per module.
Going 24-bit at this time isn't a good idea at the moment.
There will be a time for 24-bit bus width per DIMM Memory Sub-Channels, but it isn't with DDR6 IMO, it's a future iteration.

But anyways, you know what I have planned more or less.

I have DDR6-15 in my head planned out to some degree, including a major change in how ECC is done to "Link ECC" method.
This way we'll all get ECC Memory, regardless of Client / Enterprise Computers.
You also won't need a dedicated DRAM package for ECC, every DRAM Package will have a single built in ECC line to send verification data along with the Memory Controller doing the appropriate checksum & ECC validations of the received data on it's end.
This allows a "Compare & Contrast" to see if the data sent is consistent with the ECC values sent over the dedicated ECC line on each DRAM package.

I hate how Intel has held us back on that front and I want to make that change to force everybody to finally benefit from having ECC Memory.

I do agree that the bulk of average people won't need more than 32-64 GiB of RAM for their average basic PC system (Gaming Rigs will need more, but that's a different story)
That's why I want the ½ GiB DRAM Package to remain a option for implementation.
This would let future Minimum Low Profile, Double-Sided, Single Row DIMM's be 16 GiB (even all the way up to DDR15 generation) as a bare minimum spec for the average consumer.

With my end goal to have 32 DRAM Packages on a Double-Sided, Single Row DIMM as the end game with 32x DIMM Memory Sub Channels, that should allow for "Very High Asynchronous Parallel" data throughput starting with DDR12-15.
Then improvements in how much data each DRAM package can send out will improve during the span of DDR12-15, allowing higher bandwidth & parallelization in each DIMM Memory Sub Channels.

bit_user · Apr 10, 2023

Kamen Rider Blade said:
The original reason for the standard Memory Modules being simple PCB's with parallel interfaces was to make it cheaper for all.
I got that from reading alot of historical info on the formation of the SIMM/DIMM standard and what all the players wanted out of the common standardized Memory Module.

There was not to be any "Controlling Logic" put on the memory module, the memory controller (located outside of the SIMM/DIMM), would handle all the hard memory controlling logic necessary.

Changing that paradigm goes against the original cost effective ethos behind the formation of the Parallel Memory Module.

Your information is obsolete. DDR5 already introduced active circuitry between the DIMM interface and the DRAM chips.

Kamen Rider Blade said:
I wouldn't expect HBM memory to be any consumer client platforms, that seems to be relegated to Enterprise Customers due to cost reasons.

Apple is using LPDDR5X in its consumer CPUs.

Kamen Rider Blade said:
I have DDR6-15 in my head planned out to some degree,

Without models or data, how can you plan even 1 generation, much less 8?

Kamen Rider Blade said:
I do agree that the bulk of average people won't need more than 32-64 GiB of RAM for their average basic PC system

Apple ships Mac Minis with as little as 8 GB. They use memory compression to make up a lot of the deficit. Not that it handles all cases adequately, but it's a fact that they're doing it.

Kamen Rider Blade said:
With my end goal to have 32 DRAM Packages on a Double-Sided, Single Row DIMM

Sounds expensive.

News Jim Keller Shares Zen 5 Performance Projections

Titan

Titan

Titan

Titan

Distinguished

Titan

Titan

Distinguished

Distinguished

Titan

Distinguished

Titan

Distinguished

Titan

Distinguished

Titan

Distinguished

Titan

Distinguished

Titan

Distinguished

Titan

Titan

Distinguished

Titan

Share this page