News Jim Keller Shares Zen 5 Performance Projections

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Uhhhh do you not see in the graph you provided that the Ryzen 5 7600x single thread efficiency handily breaks open a can of whoop on the core i5-13600k’s score which has the highest single thread efficiency of all Intel parts tested?
I'm sure he'll want the efficiency scores to be bracketed by performance tier. That's pretty fair, because it'd otherwise be too easy to game the 1T efficiency score by cutting clocks and/or cores.

He's found the biggest weakness of Zen 4 on the desktop, and he knows it. AMD won't have a proper answer for energy-efficiency of single-threaded tasks that can also deliver competitive performance, until they bring their monolithic APUs to the desktop. Otherwise, AMD's chiplet architecture seems to have overheads that you need multi-threaded tasks to amortize away.

Is it an important weakness? I'd argue not as much as Intel's crazy power consumption under highly-threaded loads.

That said, I think 1T efficiency isn't terribly germane to the discussion, because the CPU and system-level overheads get in the way of properly evaluating microarchitecture (i.e. core) efficiency. And that's what the article is really about.

Just look at the single thread efficiency ramp up for zen 4: 7950x @ 5.7 ghz = 41.7 pts, 7700x @ 5.4 ghz = 83 pts (IE +100% single thread efficiency gain by dropping frequency by 300 MHz), then 7600x @ 5.3 ghz = 124.3 points (IE a + 50% efficiency gain by dropping frequency by another 100 MHz).
When you cross the threshold from 2 CCDs to 1, there's a big jump in efficiency. Can't miss it.

All these forums are basically bit_user vs Terry.
I'm not in every thread where Terry hangs out and vice versa. Plus, several others have disputed his claims, even in this very thread!

I have nothing against Terry. If I see what I think is a misinterpretation or misrepresentation, I'm going to try and correct it. That's all.
 
Last edited:
  • Like
Reactions: Ogotai
Jim Keller is awesome. For a long time, K10 was pretty much my favorite architecture.
I suspected he was over-hyped, until I read the interviews of him on Anandtech. Those won my respect.

He's the reason I bought in on the Tenstorrent offering. So I'm pretty happy to finally see news releases on Tenstorrent. They need to keep using Keller as a tech celebrity to just provide expert views like this.
They IPO'd?

I haven't kept up on their recent strategic adjustments. Back when they were a pure AI-chip company, I thought they had a better than average chance of succeeding, but they're up against some really powerful competition, in an overcrowded market. Maybe that's why they're shifting more towards general-purpose RISC-V cores. Anyway, I'd classify it as a somewhat risky investment, but you could certainly do a lot worse.

If they succeed, I could see them getting bought by AMD, Dell, or HPE. IBM or even Microsoft might be a long shot. Or, maybe someone out of left field, like Siemens, NXP, Hitachi, or NEC, but probably not.
 
Last edited:
I'm not in every thread where Terry hangs out and vice versa. Plus, several others have disputed his claims, even in this very thread!

I have nothing against Terry. If I see what I think is a misinterpretation or misrepresentation, I'm going to try and correct it. That's all.
Haha...no, just the two I've been in a lot lately. It's hard for me to see Terry's viewpoint, but it might be because I don't like Intel. I really dislike the way they treat customers with artificially locking features into performance tiers and changing sockets all the time.

At least in the case of this article, the media is giving AMD some pretty great publicity.
 
I suspected he was over-hyped, until I read the interviews of him on Anandtech. Those won my respect.


They IPO'd?
Oops, I'm wrong. They didn't IPO. It was something else I bought when I was trying to buy them. I bought a company connected to a bunch of other funding that I hoped was related. So I don't think I've got a horse in that race.

Yes, Tenstorrent is a good buyout candidate.
 
Last edited:
Man, I do love that Ascalon Cluster & AEGIS Architecture.
BJXju2d.png
64T0G3L.png

It's a thing of beauty and how "Real Tiling" Chiplets should work.

That potential for "Infinite Tiling" 2D Flat Torus or a set of nested Rings is "Brilliant" from a Tiling / Chiplet Architectural standpoint.

Just add more rings around your center if you want a bigger CPU =D

================================================================================

Not what Intel and it's ultra-simplistic Quad-Tiles w/ Mesh inside turned out to be.
mZmqnhL.jpg
It doesn't even take advantage of the physical Geometric 2D Square Shape for infinite repeated tiling / tessallation.
 
Last edited:
I'm sure he'll want the efficiency scores to be bracketed by performance tier. That's pretty fair, because it'd otherwise be too easy to game the 1T efficiency score by cutting clocks and/or cores.

He's found the biggest weakness of Zen 4 on the desktop, and he knows it. AMD won't have a proper answer for energy-efficiency of single-threaded tasks that can also deliver competitive performance, until they bring their monolithic APUs to the desktop. Otherwise, AMD's chiplet architecture seems to have overheads that you need multi-threaded tasks to amortize away.

Is it an important weakness? I'd argue not as much as Intel's crazy power consumption under highly-threaded loads.

That said, I think 1T efficiency isn't terribly germane to the discussion, because the CPU and system-level overheads get in the way of properly evaluating microarchitecture (i.e. core) efficiency. And that's what the article is really about.


When you cross the threshold from 2 CCDs to 1, there's a big jump in efficiency. Can't miss it.


I'm not in every thread where Terry hangs out and vice versa. Plus, several others have disputed his claims, even in this very thread!

I have nothing against Terry. If I see what I think is a misinterpretation or misrepresentation, I'm going to try and correct it. That's all.
Single thread efficiency for 2 CCD CPUs is about 4 watts higher than the same frequency on single CCD cpu but this does not disprove that 5.6-5.7 ghz is past the voltage frequency optimal range and is the reason ryzen 9 is so inefficient at single thread, it has nothing to do with 2 CCD’s besides the addition of 4 watts (IE ryzen 7700x @5.4ghz single thread pulls 23 watts and ryzen 9 7900 @ 5.4 ghz as well pulls 27 watts). Also, since the ryzen 9 7950x at any of its ECO settings performs better in multi-threaded applications than the 13900k at the same wattage, it further reinforces that the zen core architecture is more efficient than Intel Raptorlake architecture in spite of added interconnect power usage. It is the fact that AMD pushed the 7900x and 7950x beyond the optimal range of 5nm’s frequency/voltage curve that single threaded efficiency is terrible on ryzen 9 zen 4 cpu’s. It’s like overclocking when you have to add 0.1 volts just to get 100 MHz more out of your cpu, then 0.15 volts to get an additional 100mhz, IE you have gone past the optimal range into the range of diminishing returns. That’s what AMD did to approach around the same single thread boost clock as Intel using a process node built for mass compatibility not all out performance like Intel 7. In fact, I predict Intel 4, 3, etc. will lose upper end frequency/voltage efficiency due to the mass compatibility requirement for these nodes under IFS. We will see.
 
That potential for "Infinite Tiling" 2D Flat Torus or a set of nested Rings is "Brilliant" from a Tiling / Chiplet Architectural standpoint.

Just add more rings around your center if you want a bigger CPU =D
There's a lot we don't know about the on-die interconnect of GPUs.

Not what Intel and it's ultra-simplistic Quad-Tiles w/ Mesh inside turned out to be.

mZmqnhL.jpg

It doesn't even take advantage of the physical Geometric 2D Square Shape for infinite repeated tiling / tessallation.
In reality, these things don't need to scale infinitely. There's a fairly narrow window between what you can comfortably do with a single chiplet and a CPU so large, expensive, and hot that no one wants it. It seems likely Intel would've benefited from making their chiplets a little smaller, but increasing chiplet count beyond 8 or so probably would've been pointless.

I like the symmetry of Sapphire Rapids and the original Epyc, where each die has its own local memory controller and inter-processor bus slice. If you pair this with optimal software partitioning, it's also more efficient.
 
There's a lot we don't know about the on-die interconnect of GPUs.
Here's the talk about the new processors from the man who designed it.

In reality, these things don't need to scale infinitely. There's a fairly narrow window between what you can comfortably do with a single chiplet and a CPU so large, expensive, and hot that no one wants it. It seems likely Intel would've benefited from making their chiplets a little smaller, but increasing chiplet count beyond 8 or so probably would've been pointless.
Is that also how you feel about the new EPYC 9000 series w/ Genoa and it's 12 CCD's & cIOD to unify them?
I bet you that they'll be scaling beyond 12x CCD's in the future as well.

I like the symmetry of Sapphire Rapids and the original Epyc, where each die has its own local memory controller and inter-processor bus slice. If you pair this with optimal software partitioning, it's also more efficient.
I'm more partial to the current Hub & Spoke model that AMD has chosen to adopt with it's cIOD & Chiplet.
It's a classic model that works reliably well and is very well understood along with it's PRO(s) & CON(s).

I'm not counting on "Optimal Software Partitioning".

That's the hard lesson that AMD learned, you don't get to deal with optimal software more often than not.

The developers are mostly going to do whatever they want, and that just may be very incompatible with your architecture.
When you're in the 2nd place seat, they won't optimize for your architecture either.
So you adapt your hardware model to the most common & prevalent coding models to make them work.
 
Is that also how you feel about the new EPYC 9000 series w/ Genoa and it's 12 CCD's & cIOD to unify them?
Yes, I thought that would come up. We don't know if AMD chose to stick with the current 8-core CCD because it was optimal for Epyc or desktops. The fact that the chiplets are shared has the potential to force some compromises on each side.

Another point would be to consider the CCD interconnect link, which you'd have to scale up for a bigger CCD. The HPC versions of Epyc actually use a pair of links per CCD, which shows it's indeed a bottleneck. Also, the CCD's L3 cache slice now has more consumers, and it was only in Zen 3 that AMD fused that L3 slice for sharing by all 8 cores.

The other point I'd make is about yield, and this can be a variable thing. A low-yielding process can incentivize very small chiplets, whereas a mature, high-yielding process can make very large dies viable. Look at how big Nvidia's dies are*, on TSMC 4N, and then tell me AMD had to use such small CCDs.

I bet you that they'll be scaling beyond 12x CCD's in the future as well.
Who knows, but I wouldn't count on it. I think the package will start to get crowded, as they cram in DRAM and other things.

Speaking of chiplets, Jim Keller has a pretty good take on why we should expect to see more, and it's nothing really to do with size or yield:

BTW, the best quote from that talk is nothing to do with chiplets:

"The '3 nm' transistor is something like 1000 x 1000 x 1000 atoms. That's a billion atoms. I know we know how to shrink that."

I'm more partial to the current Hub & Spoke model that AMD has chosen to adopt with it's cIOD & Chiplet.
It's a classic model that works reliably well and is very well understood along with it's PRO(s) & CON(s).
It doesn't scale for things like GPUs, however. Even for CPUs, it introduces avoidable compromises, if you can do some software partitioning around a more heavily NUMA approach.

I'm not counting on "Optimal Software Partitioning".
Yes, especially in the general-purpose CPU world.

* The H100 is reportedly 814 mm^2
 
Last edited:
I'm not in every thread where Terry hangs out and vice versa. Plus, several others have disputed his claims, even in this very thread!

I have nothing against Terry. If I see what I think is a misinterpretation or misrepresentation, I'm going to try and correct it. That's all.
I think you are in much more threads than Terry(or at least more threads I am interested in), But when you make it a bit personal just due to differing opinions it sure does seem like you have something against Terry other than his contrarian opinions↓
So far, I can't figure out what your agenda is, other than to try and minimize or distract from anything that's embarrassing to Intel. I've got to say that every one of your posts that try to put some positive spin on Intel or paint AMD in a negative light makes me feel just a little worse towards Intel. If you were a brand ambassador for team Intel, I'd say you're scoring some own-goals.
I took Terry's above assessment as nuanced stating p-core at efficiency band (best clock for power/performance curve) was better than Zen 4 at same criteria... I am not sure that is true but they are both very close competitively. AMD seems to be much more competitive on productivity if you keep the power thresholds low, if you ignore power limits Intel ekes out a win. Your die picture comparing both really shows AMD has a large performance per chip density win, and maybe with disaggregation AMD has found new ways to shift circuitry off core die to IO or part of the fabric and make the most of expensive litho. I find it a fascinating discussion and its the crux of who is going to be more competitive next gen. What is Meteor IPC on intel 4 and how will transistor density compare to zen4/5. Execution from Intel will determine who it competes with.

Can one have a contrarian opinion and not be labeled as a brand ambassador? How about agreeing to disagree or admitting perhaps we don't understand how they came to such a contrarian conclusion? Keeping it civil and kind
 
I took Terry's above assessment as nuanced stating p-core at efficiency band (best clock for power/performance curve) was better than Zen 4 at same criteria...
And I addressed why I think that's not accurate. He started out by saying we should separate out "core" power from "CPU" power, and then picked a metric which went almost as far as possible in the opposite direction.

I am not sure that is true but they are both very close competitively.
Not in terms of efficiency, which was the point.

Your die picture comparing both really shows AMD has a large performance per chip density win, and maybe with disaggregation AMD has found new ways to shift circuitry off core die to IO or part of the fabric and make the most of expensive litho.
It was comparing core + L2, only. That's about as functionally identical as you can get. There are two variables at play: lithography and micro-architecture.

I find it a fascinating discussion and its the crux of who is going to be more competitive next gen. What is Meteor IPC on intel 4 and how will transistor density compare to zen4/5. Execution from Intel will determine who it competes with.
While I'm aware this is an article about Zen 5, I'm making no predictions about either it or Meteor Lake.

Can one have a contrarian opinion and not be labeled as a brand ambassador?
First, contrarian is one who disputes the dominant opinion, whatever it is. That's not what I've been seeing. Find me contrarian positions Terry has taken that don't involve Intel or AMD. Find me a single negative thing Terry ever said about Intel, even when the majority opinion tended to favor Alder Lake vs. Zen 3.

Second, my "brand ambassador" was a hypothetical. I started out by saying I don't know what Terry's agenda is, but if it's a pro-Intel agenda, then I think he's going so far as to harm his own cause. Please read more carefully, before you start picking sides and making counter-accusations.

How about agreeing to disagree or admitting perhaps we don't understand how they came to such a contrarian conclusion?
Because this is a place where we share information and debate ideas. When someone is consistently misrepresenting or misinterpreting information, it can do harm by spreading misinformation (whether intentional or not). Misinformation should be countered and corrected, and it's not just me who's taking issue with many of Terry's claims. If you value peace and harmony at the expense of letting people actively spread misinformation, then I will admit that perhaps I don't understand how you came to such a view.
 
Because this is a place where we share information and debate ideas. When someone is consistently misrepresenting or misinterpreting information, it can do harm by spreading misinformation (whether intentional or not). Misinformation should be countered and corrected, and it's not just me who's taking issue with many of Terry's claims. If you value peace and harmony at the expense of letting people actively spread misinformation, then I will admit that perhaps I don't understand how you came to such a view.
Honestly its too much drama... Misinformation is another can of worms, we have data and conflicting data and perhaps both are right, its not black or white, it truly depends on particulars of each instance. You can find corner cases for any platform to say it is the best. Its what marketing gets paid for.

Debating ideas is fine, its when you make it personal or ascribe malicious intent where a line is crossed.

The original article is claiming zen5 is going to be king of integer performance and we got statements in the weeds comparing die density and and performance efficiency on current released products...

My personal opinion is Intel and AMD are neck and neck, all that matters is execution [release cadence in volume]. Even if AMD comes out with zen5 and it is the best, if they didn't pre-order enough TSMC capacity Intel will maintain and probably increase market share even with an inferior product. Intel seems to be pulling ahead in desktop so perhaps they can do nothing for a generation (Raptor lake refresh??? seriously getting Comet lake vibes). Mobile they are weak and meteor lake can not come out soon enough.
 
  • Like
Reactions: KyaraM
Honestly its too much drama...
Then why are you getting involved? You're only making this more personal, not less.

Every point Terry makes that I take issue with is purely on the basis of the claim being made, the supporting evidence, the interpretation, or the conclusions being drawn. You're the one saying it's personal. It's not. When Terry says something I agree with (and he sometimes does), then I upvote the post. I have no hesitation in doing so, because it's not personal for me.

Misinformation is another can of worms,
It's simply whenever someone posts inaccurate facts, incomplete facts that lead to a wrong interpretation, an incorrect interpretation or conclusion from the facts supplied, or controversial claims which cannot be substantiated.

Anything which looks like one of these situations should be challenged, because leaving it unchallenged could lead to others believing it's true and spreading it.

It's not hard. It's not a can of worms. If it's challenged and the original claim can be substantiated, then fine. Even if not, I'm also willing to walk away, so long as the problem isn't compounded by yet more problematic claims, which is what often happens.

we have data and conflicting data and perhaps both are right,
Data is a starting point. The next step is interpretation. Neither is infallible, and thus both are grounds for discussion. More data usually helps, because experimental variations are real and can account for some conflicting data.

its not black or white,
Facts and absolutes are real things. Reaching clear conclusions is hard enough, when it doesn't seem like certain people have an agenda.

Debating ideas is fine, its when you make it personal or ascribe malicious intent where a line is crossed.
Again, you're the one saying that. I just pointed out that Terry's action might be self-defeating.

The original article is claiming zen5 is going to be king of integer performance and we got statements in the weeds comparing die density and and performance efficiency on current released products...
You know how it goes. One person makes a rash generalization, another talks it down, but still managed to slip in some misleading statements... pretty soon, we're in the weeds.

It's not like there's really anything to say about Zen 5. We don't really know anything, other than rumors about what node it's meant to be on and that it's supposedly a much bigger design-change than Zen 4.

If you look at my first posts in the thread, it was to clarify misstatements about integer performance and to ask a question about the actual slide that was shown. I'm not the one who steered us off into the weeds.

And, where possible, I've tried to steer the discussion back towards focusing on the respective microarchitectures, as that's what the article was concerning.

My personal opinion is Intel and AMD are neck and neck,
Performance-wise, their (P) cores are pretty close. I posted a link to that effect, in post #20, in case you didn't notice.

Efficiency is a different story, and a somewhat more complicated one. "Complicated" doesn't mean there are no right answers, just that the answers involve more qualifications and conditions.

all that matters is execution [release cadence in volume].
As far as I'm concerned, all that matters is the products on the market. If we have data on unreleased products, it's unreliable and therefore not something I'm going to expend a lot of time or energy speculating about.

Even if AMD comes out with zen5 and it is the best, if they didn't pre-order enough TSMC capacity
One would hope they've learned their lesson. The few things I've heard about TSMC would suggest there should be plenty of capacity on offer.
 
Why? life is too short, people have ignorant opinions all the time, you think you can change them?
It's not about changing the mind of someone with a firm point-of-view. That's effectively impossible.

What we can do, and what I believe is worthwhile, is to avoid suspect or problematic claims going unchallenged.

With AI language models scraping the internet in their thirst for knowledge, the stakes of letting any potential misinformation stand unchallenged just got even higher. That being said, LLMs remain a hypothetical risk - if I were training one, I wouldn't use forums or discussion boards, if you wanted it to be any good!
; )
 
How about agreeing to disagree or admitting perhaps we don't understand how they came to such a contrarian conclusion? Keeping it civil and kind
Because when you've worked in the technology industry at the OS or firmware level, if you're not a hardware engineer already, you'd understand that what they're saying is stuff that doesn't make sense period because it's not how it actually works.

As a simple example, let's say someone was using one of those "Free RAM" utilities or something. They're convinced it's actually working because if they look in Task Manager, a bunch of RAM suddenly does become available. What they don't realize is all that tool is doing is hogging up all the RAM it can, then immediately releasing it, which put everything else into the page file. So the utility isn't actually helping at all. But they see the magic in Task Manager, so to them it does work, regardless of what anyone says.
 
Yes, I thought that would come up. We don't know if AMD chose to stick with the current 8-core CCD because it was optimal for Epyc or desktops. The fact that the chiplets are shared has the potential to force some compromises on each side.

Another point would be to consider the CCD interconnect link, which you'd have to scale up for a bigger CCD. The HPC versions of Epyc actually use a pair of links per CCD, which shows it's indeed a bottleneck. Also, the CCD's L3 cache slice now has more consumers, and it was only in Zen 3 that AMD fused that L3 slice for sharing by all 8 cores.

The other point I'd make is about yield, and this can be a variable thing. A low-yielding process can incentivize very small chiplets, whereas a mature, high-yielding process can make very large dies viable. Look at how big Nvidia's dies are*, on TSMC 4N, and then tell me AMD had to use such small CCDs.


Who knows, but I wouldn't count on it. I think the package will start to get crowded, as they cram in DRAM and other things.
I'm pretty sure AMD chose to make such small CCD's for yield & cost reasons.
Their main goal was to make manufacturing their entire product line very affordable, ergo having good margins while having a high performing part.
They accomplished that goal.
But moving forward in time, I don't think they can depend on a "One-Size Fit's All" approach and they'll need CCD's/CCX's with larger and smaller Core configs.

Moving forward into the future, I think they'll be needing at least a few more CCD's / CCX's in the future to meet all the potential customers & competitors who want more Cores.
_4x Core CCD <- Something for the bottom end / budget end
_8x Core CCD <- Keep What they're doing now, but add more members to the family.
12x Core CCD
16x Core CCD (Which is just a _8x Core CCD (Die Area) with Zen #C cores doubling up the Core count)
24x Core CCD (Which is just a 12x Core CCD (Die Area) with Zen #C cores doubling up the Core count)
32x Core CCD (Which is just a 16x Core CCD (Die Area) with Zen #C cores doubling up the Core count)

They'll also need 2x more dedicated Platforms besides Ryzen & EPYC
That's where Ryzen TR (ThreadRipper) & Ryzen FX would come in.
A dedicated 4x CCD CPU design & A dedicated 6x CCD CPU design based off Socket SP6 that can leverage the new larger CCD's to it's full potential.

Ryzen FX 4x CCD CPU for the HEDT / WorkStation / SoHo SMB market.
Ryzen TR 6x CCD CPU for the Ultra High End / Enterprise WorkStation / SMB Server market.

EPYC for their normal Enterprise customers.

========================================

As far as Transmission Speeds, it's a matter of what they think they'll need, but besides doubling up on Bandwidth by using 2x GMI3 connections to the cIOD, I can see other interesting uses for them in their CCD setups on their chips.

Some of the future CCD configs show a potential of having two CCD's facing each other, that would allow for a very short hop & direct GMI3 to GMI3 connection like 1st Gen ThreadRipper while maintaining a direct connection to the cIOD for the other GMI3 connection, this would allow 2x Zen Cores to have a direct short-cut to each other instead of being forced to make a round trip to the cIOD just to communicate, which would improve inter-NUMA node latency significantly by not having to travel that far just to visit your neighbors CCX.

The GMI links on each Zen CCD are basically really wide PCIe PHY lanes with a custom Protocol setup.
D1Gpr6S.jpg
And each lane moving at 36 Gbps, that should be 18 GHz Serial Connection w/ DDR on the transmission line.
hkYyLdr.jpg
The same speeds as their xGMI connection that they use for Socket to Socket communications.

Speaking of chiplets, Jim Keller has a pretty good take on why we should expect to see more, and it's nothing really to do with size or yield:
Cool, I'll go watch that when I get the time.

BTW, the best quote from that talk is nothing to do with chiplets:

"The '3 nm' transistor is something like 1000 x 1000 x 1000 atoms. That's a billion atoms. I know we know how to shrink that."
I know that he knows that other people knows how to shrink that.​
^_-​
I think that TSMC should start using A9, A8, A7, A6, A5, A4, A3, A2, A1 for their Process Nodes once we get under 1nm according to their marketing charts.
Eventually, once we get under 1 Angstrom,
TSMC should start using PM## for the Picometer era.

It doesn't scale for things like GPUs, however. Even for CPUs, it introduces avoidable compromises, if you can do some software partitioning around a more heavily NUMA approach.
True, the Hub & Spoke Model doesn't work well for GPU's, but as the engineers in the RTG group stated, you couldn't do chiplets the same way as the Ryzen team.

Moving their Memory Controllers with built in SRAM or Infinity Cache AKA L3$ was a brilliant move, something I see that they might eventually go down the road with OMI to solve the massive Contact Pin issues on their CPU's eventually.

That would give them ALOT of extra PCIe lanes on each CPU Socket in the future.

For a 300 Pin DIMM, you would get Dual Channel OMI Serial connection to the Memory Controller for the cost of 75 Contact pins and shoving the Memory Controller to be MUCH closer to the DIMMs. That's literally a 8:1 Pin-Count advantage by going OMI (Or whatever branding AMD & Intel chooses to call it).

I know you have a hard on for "On-Package DRAM Memory" and CXL.mem.

But I don't see a world where CXL.mem isn't just another add-on card to supplement existing DIMM channels on the MoBo, especially with the CXL.mem Latency penalty.
DbvSGPm.jpg

But for now, let's agree to disagree on the future of how Computers / CPU's will be configured (Main Memory wise).

I think L4$ w/ lots of SRAM will be a easier solution for AMD to bolt on instead of on-package DRAM, and the Bandwidth potential of L4$ has a long way to scale with the bottle neck being only the IFOP / Dual GMI3 connections that will grow in time as they improve.

So I wouldn't be worried about on-package DRAM.

Yes, especially in the general-purpose CPU world.

* The H100 is reportedly 814 mm^2
The Process Node Reticle Limit that will halve the maximum die area possible is coming, so if you make a design that is fully dependent on a "Large Die Area", then good luck with that when your Maximum Process Node Reticle Limit of 858 mm² on TSMC 5nm gets cut in half in the future down to 429 mm².

And we both know that SRAM Densities is capped in size at 5nm and isn't improving at the moment as we go down in Process Node sizes.

So if you aren't ready to design around that limitation as we head to 3nm, you're going to get BONED so hard.

That's going to be EXTRA hard on a company like nVIDIA who was dependent on ALL that GINORMOUS L2$ in their chip design for the 4090.
DiS1aVB.jpg
The amount of Die Area that their L2$ eats up is HUGE and the fact that "Un-Official" nVIDIA engineers have been complaining about their Memory Controller PHY sizes and how they will basically "REFUSE" to go Chiplet unless they absolutely have to.

And if they're going to use Chiplets, it's to go the Intel Route to make a MUCH more MASSIVE combined Chip via Chiplets.

They refuse to go down AMD's path of disaggregation, pointing to the issues that AMD has had with going Chiplets on GPU as the main reason for not going that route.

nVIDIA would rather pass on the costs to the customer and say (Suck it up, and pay out). Then put in the engineering hours to go Chiplets for cost savings.
 
Last edited:
  • Like
Reactions: AgentBirdnest
I'm pretty sure AMD chose to make such small CCD's for yield & cost reasons.
Their main goal was to make manufacturing their entire product line very affordable, ergo having good margins while having a high performing part.
That's really just an argument about yield. We have an example of Nvidia making viable 5 nm dies at 814 mm^2, meanwhile the Zen 4 CCD is only 66 mm^2. This tells me they opted for 8 cores for reasons other than yield.

As far as Transmission Speeds, it's a matter of what they think they'll need, but besides doubling up on Bandwidth by using 2x GMI3 connections to the cIOD,
That sounds like an acknowledgement of my point that GMI bandwidth appears to be an issue. I'll take it.

I can see other interesting uses for them in their CCD setups on their chips.
What you can imagine and what the chiplets actually support are two different things. Unless and until they actually discuss this option, we shouldn't assume it's possible.

Some of the future CCD configs show a potential of having two CCD's facing each other, that would allow for a very short hop & direct GMI3 to GMI3 connection like 1st Gen ThreadRipper while maintaining a direct connection to the cIOD for the other GMI3 connection, this would allow 2x Zen Cores to have a direct short-cut to each other instead of being forced to make a round trip to the cIOD just to communicate, which would improve inter-NUMA node latency significantly by not having to travel that far just to visit your neighbors CCX.
Bypassing the IOD, for CCD-to-CCD communication, would add a lot of complexity for probably rather minimal value to most use cases. You talk of NUMA domains as if they're the same between Zen 1 and later, but they're not. You're rattling off solutions, but I'm yet to be convinced you understand the underlying problem you're trying to solve.

something I see that they might eventually go down the road with OMI to solve the massive Contact Pin issues on their CPU's eventually.
It's dead, Jim. Let it go.

No roadmaps show anyone adopting it and no one is talking about it. Whether or not you understand why they rejected it, you should respect the fact that the industry has reached a consensus. So, please stop wasting your time & ours talking about it.

I know you have a hard on for "On-Package DRAM Memory" and CXL.mem.
That's not fair. I'm simply describing where the industry seems to be headed, based on actual product developments, roadmap announcements, and overall level of activity around CXL. Please don't place my view of these technologies on the same level as your OMI fantasy.


But I don't see a world where CXL.mem isn't just another add-on card to supplement existing DIMM channels on the MoBo, especially with the CXL.mem Latency penalty.

CXL Memory Latencies within the Latency Pyramid:
DbvSGPm.jpg
In the future, in-package memory would take the place of "Main Memory". Look at Nvidia's Grace, to see where things are headed. They packed in 512 GB of LPDDR5X and it can just use CXL for the rest. No need for any conventional DIMMs.

I think L4$ w/ lots of SRAM will be a easier solution for AMD to bolt on instead of on-package DRAM,
@InvalidError already shot down this idea for cost reasons, in the other thread.

The Process Node Reticle Limit that will halve the maximum die area possible is coming, so if you make a design that is fully dependent on a "Large Die Area", then good luck with that when your Maximum Process Node Reticle Limit of 858 mm² on TSMC 5nm gets cut in half in the future down to 429 mm².
You're taking that out of context. It was meant to be a footnote, to show that Zen 4 CCDs would've been viable with a lot more than 8 cores, which blows up the yield argument for why they kept with that size. I'm only talking about TSMC N5, here. For the sake of that argument, future nodes are irrelevant. And I'm not getting side-tracked onto a whole tangent about Nvidia.
 
  • Like
Reactions: hotaru.hino
That's really just an argument about yield. We have an example of Nvidia making viable 5 nm dies at 814 mm^2, meanwhile the Zen 4 CCD is only 66 mm^2. This tells me they opted for 8 cores for reasons other than yield.
That's why I mentioned "Cost & Yields". It costs more for a larger die while smaller dies are much cheaper to produce & bin.

That sounds like an acknowledgement of my point that GMI bandwidth appears to be an issue. I'll take it.
It's only an issue if they can't keep the CPU's filled when they need to actually do work or transmit data so they can do work.
So far, it's an option.

What you can imagine and what the chiplets actually support are two different things. Unless and until they actually discuss this option, we shouldn't assume it's possible.
Oh please, don't be so boring such that you are limited to only what you can see right in front of you.
You seriously lack imagination if that's all you're willing to see as viable.

Bypassing the IOD, for CCD-to-CCD communication, would add a lot of complexity for probably rather minimal value to most use cases. You talk of NUMA domains as if they're the same between Zen 1 and later, but they're not. You're rattling off solutions, but I'm yet to be convinced you understand the underlying problem you're trying to solve.
Within the same NUMA Nodes currently, if multiple CCD's want to talk to each other within the same Node, they have to send signals straight to the cIOD, and run back to the CCD within their NUMA node. That's a very long distance to travel and adds latency.
CZJXZTP.jpg
Certain Leaks for future designs that I've seen are going to change how they seperate the NUMA nodes, and one of the benefits will behaving the CCD's IFOP's face each other on the CPU Substrate Layout, one potential advantage of doing that is for a EMIB like short hop between CCD's on one set of GMI connections and the other set to go straight back to the cIOD for communication.
079Wtim.jpg
This would dramatically cut the latency down for inter-NODE communications between CCD's.

The other solution is for a small ring like connections between CCD's since each CCD has 2x IFOP's and they can arrange the connections like Zen 1 TR and have a way to directly communicate to each other.

That's what I'm talking about. Shortening the latency penalty, because the Zen Architecture philosophy is about having to move data the shortest distance possible.
What's more efficient, going from CCD -> cIOD -> CCD?
Or Going from CCD -> CCD.
One will obviously have a shorter route for Electrons to travel, ergo more energy efficiency.

It's dead, Jim. Let it go.
No, NEVER! You'll have to kill me first!

No roadmaps show anyone adopting it and no one is talking about it. Whether or not you understand why they rejected it, you should respect the fact that the industry has reached a consensus. So, please stop wasting your time & ours talking about it.
So what, I'll be their advocate. Consensus can be changed in due time with advocacy.
It's my time to spend, not yours. And if you don't want to talk about it, then we won't.
But I'll forever advocate for a more modular model of DIMM connections.

That's not fair. I'm simply describing where the industry seems to be headed, based on actual product developments, roadmap announcements, and overall level of activity around CXL. Please don't place my view of these technologies on the same level as your OMI fantasy.
Ah, the great CXL road-map that will change the world.
OMI fantasy? The tech is real, it's part of the same CXL family now.
Now it's about getting the big boys to see the error of their ways and to have a more "Flexible" model moving forward.

In the future, in-package memory would take the place of "Main Memory". Look at Nvidia's Grace, to see where things are headed. They packed in 512 GB of LPDDR5X and it can just use CXL for the rest. No need for any conventional DIMMs.
Good for them, nVIDIA's custom proprietary solution has on-package DRAM.
That's nVIDIA's choice.

@InvalidError already shot down this idea for cost reasons, in the other thread.
He has his reasons, I have my reasons to think my way.
We don't always have to agree on everything.
That's part of what a free society allows.

You're taking that out of context. It was meant to be a footnote, to show that Zen 4 CCDs would've been viable with a lot more than 8 cores, which blows up the yield argument for why they kept with that size. I'm only talking about TSMC N5, here. For the sake of that argument, future nodes are irrelevant. And I'm not getting side-tracked onto a whole tangent about Nvidia.
And I'm telling you that "Cost per Die" & Yields are important.
It's easier to get more good yields when you have many smaller dies than larger dies.
Especially when you're depending on the Foundaries Nodes that will have variable Yields over time due to many factors.