News Intel's Diamond Rapids will use LGA9324 packaging

Status
Not open for further replies.
the post made me realize something...idk how many contact pins Epyc has....only just realized amd doesnt make it as easy as intel to know their count anyone know? (one thing intel does well is the socket name giving the info)
 
  • Like
Reactions: rtoaht

bit_user

Titan
Ambassador
HEDT isn't usually the path for the the truly massive CPUs. At that level you would just buy the enterprise platform.

HEDT is aimed at a price point between consumer and commercial/enterprise. Several thousand dollars, not tens of thousands per chip.
Somebody must've blown the breaker on your humor circuit. I detected a distinct note of sarcasm in that post. Either way, it'd be pretty ridiculous!
: D
 

bit_user

Titan
Ambassador
The article said:
also up from the 7,529 contacts used by Intel's Xeon 5 Granite Rapids and Sierra Forest processors.
Oops, no. These are Xeon 6.

It's not very consistent how that Intel is calling 6th gen Xeon SP like that. I'm sure it's already caused some confusion with people thinking it's referring to the thousands digit of model numbers, instead of the generation.
 

JRStern

Distinguished
Mar 20, 2017
151
59
18,660
So, this would be a monolithic package of how many chips inside?
Is this a good idea?
Would anything be lost if it was split into four or eight smaller packages?
 

bit_user

Titan
Ambassador
So, this would be a monolithic package of how many chips inside?
Monolithic... package? I don't understand the question. When are these packages ever not monolithic? It will obviously contain multiple dies/tiles/chiplets, but I assume you weren't asking about that.

Is this a good idea?
Intel seems to think so. The downside is that the more components you cram into a single package, the greater the chance of failures. And if it does experience a catastrophic failure, the cost of replacing the whole thing is quite high.

On the other hand, if just a single core fails, you can simply take it offline. I recently saw some news about Intel working on field diagnostics, which are probably becoming a necessity for continuing to scale individual CPUs like this.

Would anything be lost if it was split into four or eight smaller packages?
I/O bandwidth between packages is much lower, as well as the energy per bit transferred being higher. That's the main reason. If you keep all chips on the same package, you can use much wider, faster, and more efficient links.
 

JRStern

Distinguished
Mar 20, 2017
151
59
18,660
Monolithic... package? I don't understand the question. When are these packages ever not monolithic? It will obviously contain multiple dies/tiles/chiplets, but I assume you weren't asking about that.

Intel seems to think so. The downside is that the more components you cram into a single package, the greater the chance of failures. And if it does experience a catastrophic failure, the cost of replacing the whole thing is quite high.

On the other hand, if just a single core fails, you can simply take it offline. I recently saw some news about Intel working on field diagnostics, which are probably becoming a necessity for continuing to scale individual CPUs like this.

I/O bandwidth between packages is much lower, as well as the energy per bit transferred being higher. That's the main reason. If you keep all chips on the same package, you can use much wider, faster, and more efficient links.
"Monolithic" is the (old) term for stuffing many chips under one cover. Even so, maybe you can take it too far. If you have a 128 cores, how much do they even want or need to talk to each other? Maybe keeping "just" 16 or 32 of them in a package would cover 90% of any need, if there is any need? With mobs of cores like that they are still hardwired to whichever L1, L2 caches they are using so are already somewhat constrained on memory and data fragmentation or duplication, so putting them all under a single cover is already misleading.
 

bit_user

Titan
Ambassador
maybe you can take it too far. If you have a 128 cores, how much do they even want or need to talk to each other?
Modern software (both kernel and userspace) assumes memory accesses are cache-coherent. This means a certain amount of traffic between the cores to support cache coherency, even when they're not actually exchanging data.

Maybe keeping "just" 16 or 32 of them in a package would cover 90% of any need, if there is any need?
For at least 5 years or so, AMD has supported partitioning EPYC CPUs into separate NUMA domains. I'm not sure, but I think it could be simply talking about de-interleaving memory accesses to an extent. Here's what the Zen 4 EPYC doc says about it:

"Beginning with AMD EPYC 7002 Series processors, non-uniform latency was reduced dramatically by locating memory controllers onto the I/O die. NUMA domains were flattened more by the move to 32 MB of L3 cache in the EPYC 7003 Series. In 4th Gen EPYC processors, optimizations to the Infinity Fabric interconnects reduced latency differences even further.

Using EPYC 9004 Series processors, for applications that need to squeeze the last one or two percent of latency out of memory references, creating an affinity between memory ranges and CPU dies (‘Zen 4’ or ‘Zen 4c) can improve performance. Figure 9 illustrates how this works. If you divide the I/O die into four quadrants for an ‘NPS=4’ configuration, you will see that six DIMMs feed into three memory controllers, which are closely connected via Infinity Fabric (GMI) to a set of up to three ‘Zen 4’ CPU dies, or up to 24 CPU cores.

Most applications don’t need to be concerned about using NUMA domains, and using the AMD EPYC processor as a single domain (NPS=1) gives excellent performance. The AMD EPYC 9004 Architecture Overview provides more details on NUMA configurations and tuning suggestions for specific applications"

Source: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf

I'm pretty sure that's just a memory-partitioning trick and assumes the OS is careful not to migrate threads between NUMA domains (otherwise, performance penalties would result).

I think the main argument for packing so many cores in a single CPU is that it allows cloud operators to pack more cores per rack unit. Also, each physical machine incurs some management overhead for them. So, it works out to a cost-savings for them.

There are also some workloads that really want lots of cores and/or lots of memory capacity or bandwidth. If you're a cloud operator, you can charge a premium for these instances, that wouldn't be possible on smaller CPUs, and still fill out any remaining capacity on the machine with smaller VMs that have much more modest requirements.

With mobs of cores like that they are still hardwired to whichever L1, L2 caches they are using so are already somewhat constrained on memory and data fragmentation or duplication, so putting them all under a single cover is already misleading.
Misleading? What's misleading about having distributed caches? Go back as far as you like, even to the days before multi-core CPUs. Back then, each CPU had its own cache(s). There was never a point, in probably at least the last 3 decades, when a multi-core system didn't have such a cache architecture.
 
Last edited:
  • Like
Reactions: JRStern

bit_user

Titan
Ambassador
I will add that, absent cloud - or, at least the concept of VM hosting - it's not clear to me that CPUs would've scaled to such core counts. I don't know if there's enough demand for having so many cores visible to a single application that it would justify having such large CPUs, especially when we can still scale by adding more CPU packages. Some of Intel's CPUs scale up to 8 sockets and I think AMD's would at least scale beyond 2S if they couldn't so easily just add more cores per package or the demand for doing so were less.
 
Last edited:

JRStern

Distinguished
Mar 20, 2017
151
59
18,660
Modern software (both kernel and userspace) assumes memory accesses are cache-coherent. This means a certain amount of traffic between the cores to support cache coherency, even when they're not actually exchanging data.
...
Presume you need the NUMA domains even with all the processors in one package.

Back about, oh, twenty years ago, the standard servers had two or more processor chips on single motherboards. I never did look into just how they did cache consistency. More than that I've done a lot of work with SQL Server that really used multiple cores/threads well, and allowed NUMA partitions, and "soft NUMA", and Ghu knows what else, but there was never any data on just how that was really being done. And then when wrapped into VMs as well, OMG.
 

bit_user

Titan
Ambassador
Presume you need the NUMA domains even with all the processors in one package.
What NUMA actually means is somewhat fuzzy. As the above text I quoted from the Zen 4 EPYC manual says, the penalty for accessing the farthest memory controller wasn't terribly big and has only gotten smaller.

The main thing that's NUMA about modern multiprocessor servers is how each CPU has its own memory controller(s). Until the end of the Core 2 era, Intel had a centralized memory controller in the North Bridge chip.

Back about, oh, twenty years ago, the standard servers had two or more processor chips on single motherboards.
I think 2P is still the standard server workhorse. Most of Intel and AMD's server CPU models support at least 2P configurations. As for the ARM world, it's the same with Ampere Altra (and I presume AmpereOne) and Nvidia's Grace.
 

JRStern

Distinguished
Mar 20, 2017
151
59
18,660
What NUMA actually means is somewhat fuzzy. As the above text I quoted from the Zen 4 EPYC manual says, the penalty for accessing the farthest memory controller wasn't terribly big and has only gotten smaller.

The main thing that's NUMA about modern multiprocessor servers is how each CPU has its own memory controller(s). Until the end of the Core 2 era, Intel had a centralized memory controller in the North Bridge chip.


I think 2P is still the standard server workhorse. Most of Intel and AMD's server CPU models support at least 2P configurations. As for the ARM world, it's the same with Ampere Altra (and I presume AmpereOne) and Nvidia's Grace.
NUMA is an attempt to have multiple channels to separate banks of memory to avoid collisions, waiting, and other bandwidth problems. You need controllers, and channels (conductors), and memory organized in banks - and ways to cross the boundaries without too much penalty. But the penalties are substantial on their own and what's worse is if six processors all start contending for one bank. Again, I wish I'd ever seen anything detailed about how SQL Server handled NUMA boundaries, but never seen a word of that - you can ask for this or that, but without knowing how the system is organized.
 

bit_user

Titan
Ambassador
NUMA is an attempt to have multiple channels to separate banks of memory to avoid collisions, waiting, and other bandwidth problems.
I think even that's an overcomplicated definition. At its heart, what it says is that the path isn't symmetrical between all processing elements and all memories. This functional definition hints at the reasons you list for pursuing such an architecture, but there are others (e.g. modularity) and the rationale for NUMA really is an adjunct, as there are other potential solutions for addressing those motives.

You need controllers, and channels (conductors), and memory organized in banks - and ways to cross the boundaries without too much penalty. But the penalties are substantial on their own and what's worse is if six processors all start contending for one bank.
Contention is pretty straight-forwardly addressed via caches and queues, but there are further tricks like prefetching. None of this is specific to NUMA, either. Even more so than with rationale, you quickly get bogged down if you over-specify what NUMA actually means, implementation-wise.
 
Last edited:
Status
Not open for further replies.