News Intel's Diamond Rapids will use LGA9324 packaging

Admin · Aug 21, 2024

Intel's Oak Stream platform to adopt 9324-pin packaging for Diamond Rapids CPU.

Intel's Diamond Rapids will use LGA9324 packaging : Read more

Eximo · Aug 21, 2024

Over 9000!

It's Over 9000! - Wikipedia

en.wikipedia.org

slightnitpick · Aug 21, 2024

Eximo said:
Over 9000!

ENIAC had over 5 Million hand-soldered joints. I'm not impressed.

hotaru251 · Aug 21, 2024

the post made me realize something...idk how many contact pins Epyc has....only just realized amd doesnt make it as easy as intel to know their count anyone know? (one thing intel does well is the socket name giving the info)

Eximo · Aug 21, 2024

4094 for the older sockets. Latest two use 4844 pins

https://en.wikipedia.org/wiki/Socket_SP3

https://en.wikipedia.org/wiki/Socket_sTR5

https://en.wikipedia.org/wiki/Socket_SP6

emike09 · Aug 21, 2024

Bring this to the next HEDT platform!

Eximo · Aug 21, 2024

slightnitpick said:
ENIAC had over 5 Million hand-soldered joints. I'm not impressed.

I guess, not quite the same thing. The number of interconnects inside of a modern CPU is presumably a staggeringly large number, particularly if you just go by transistor count.

Each Zen 4 CCD has 6.57 billion transistors, the big chips have 12 of them.

Eximo · Aug 21, 2024

emike09 said:
Bring this to the next HEDT platform!

HEDT isn't usually the path for the the truly massive CPUs. At that level you would just buy the enterprise platform.

HEDT is aimed at a price point between consumer and commercial/enterprise. Several thousand dollars, not tens of thousands per chip.

bit_user · Aug 21, 2024

Eximo said:
HEDT isn't usually the path for the the truly massive CPUs. At that level you would just buy the enterprise platform.

HEDT is aimed at a price point between consumer and commercial/enterprise. Several thousand dollars, not tens of thousands per chip.

Somebody must've blown the breaker on your humor circuit. I detected a distinct note of sarcasm in that post. Either way, it'd be pretty ridiculous!
: D

bit_user · Aug 21, 2024

The article said:
also up from the 7,529 contacts used by Intel's Xeon 5 Granite Rapids and Sierra Forest processors.

Oops, no. These are Xeon 6.

It's not very consistent how that Intel is calling 6th gen Xeon SP like that. I'm sure it's already caused some confusion with people thinking it's referring to the thousands digit of model numbers, instead of the generation.

Eximo · Aug 21, 2024

bit_user said:
Somebody must've blown the breaker on your humor circuit. I detected a distinct note of sarcasm in that post. Either way, it'd be pretty ridiculous!
: D

Meant to be a response in kind in a sense. We didn't exactly start this thread off on a serious note.

JRStern · Aug 21, 2024

So, this would be a monolithic package of how many chips inside?
Is this a good idea?
Would anything be lost if it was split into four or eight smaller packages?

Amdlova · Aug 21, 2024

At this rate you will need put a motherboard into your cpu...

bit_user · Aug 21, 2024

JRStern said:
So, this would be a monolithic package of how many chips inside?

Monolithic... package? I don't understand the question. When are these packages ever not monolithic? It will obviously contain multiple dies/tiles/chiplets, but I assume you weren't asking about that.

JRStern said:
Is this a good idea?

Intel seems to think so. The downside is that the more components you cram into a single package, the greater the chance of failures. And if it does experience a catastrophic failure, the cost of replacing the whole thing is quite high.

On the other hand, if just a single core fails, you can simply take it offline. I recently saw some news about Intel working on field diagnostics, which are probably becoming a necessity for continuing to scale individual CPUs like this.

JRStern said:
Would anything be lost if it was split into four or eight smaller packages?

I/O bandwidth between packages is much lower, as well as the energy per bit transferred being higher. That's the main reason. If you keep all chips on the same package, you can use much wider, faster, and more efficient links.

JRStern · Aug 22, 2024

bit_user said:
Monolithic... package? I don't understand the question. When are these packages ever not monolithic? It will obviously contain multiple dies/tiles/chiplets, but I assume you weren't asking about that.

Intel seems to think so. The downside is that the more components you cram into a single package, the greater the chance of failures. And if it does experience a catastrophic failure, the cost of replacing the whole thing is quite high.

On the other hand, if just a single core fails, you can simply take it offline. I recently saw some news about Intel working on field diagnostics, which are probably becoming a necessity for continuing to scale individual CPUs like this.

I/O bandwidth between packages is much lower, as well as the energy per bit transferred being higher. That's the main reason. If you keep all chips on the same package, you can use much wider, faster, and more efficient links.

"Monolithic" is the (old) term for stuffing many chips under one cover. Even so, maybe you can take it too far. If you have a 128 cores, how much do they even want or need to talk to each other? Maybe keeping "just" 16 or 32 of them in a package would cover 90% of any need, if there is any need? With mobs of cores like that they are still hardwired to whichever L1, L2 caches they are using so are already somewhat constrained on memory and data fragmentation or duplication, so putting them all under a single cover is already misleading.

bit_user · Aug 22, 2024

JRStern said:
maybe you can take it too far. If you have a 128 cores, how much do they even want or need to talk to each other?

Modern software (both kernel and userspace) assumes memory accesses are cache-coherent. This means a certain amount of traffic between the cores to support cache coherency, even when they're not actually exchanging data.

JRStern said:
Maybe keeping "just" 16 or 32 of them in a package would cover 90% of any need, if there is any need?

For at least 5 years or so, AMD has supported partitioning EPYC CPUs into separate NUMA domains. I'm not sure, but I think it could be simply talking about de-interleaving memory accesses to an extent. Here's what the Zen 4 EPYC doc says about it:

"Beginning with AMD EPYC 7002 Series processors, non-uniform latency was reduced dramatically by locating memory controllers onto the I/O die. NUMA domains were flattened more by the move to 32 MB of L3 cache in the EPYC 7003 Series. In 4th Gen EPYC processors, optimizations to the Infinity Fabric interconnects reduced latency differences even further.

Using EPYC 9004 Series processors, for applications that need to squeeze the last one or two percent of latency out of memory references, creating an affinity between memory ranges and CPU dies (‘Zen 4’ or ‘Zen 4c) can improve performance. Figure 9 illustrates how this works. If you divide the I/O die into four quadrants for an ‘NPS=4’ configuration, you will see that six DIMMs feed into three memory controllers, which are closely connected via Infinity Fabric (GMI) to a set of up to three ‘Zen 4’ CPU dies, or up to 24 CPU cores.

Most applications don’t need to be concerned about using NUMA domains, and using the AMD EPYC processor as a single domain (NPS=1) gives excellent performance. The AMD EPYC 9004 Architecture Overview provides more details on NUMA configurations and tuning suggestions for specific applications"

Source: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf

I'm pretty sure that's just a memory-partitioning trick and assumes the OS is careful not to migrate threads between NUMA domains (otherwise, performance penalties would result).

I think the main argument for packing so many cores in a single CPU is that it allows cloud operators to pack more cores per rack unit. Also, each physical machine incurs some management overhead for them. So, it works out to a cost-savings for them.

There are also some workloads that really want lots of cores and/or lots of memory capacity or bandwidth. If you're a cloud operator, you can charge a premium for these instances, that wouldn't be possible on smaller CPUs, and still fill out any remaining capacity on the machine with smaller VMs that have much more modest requirements.

JRStern said:
With mobs of cores like that they are still hardwired to whichever L1, L2 caches they are using so are already somewhat constrained on memory and data fragmentation or duplication, so putting them all under a single cover is already misleading.

Misleading? What's misleading about having distributed caches? Go back as far as you like, even to the days before multi-core CPUs. Back then, each CPU had its own cache(s). There was never a point, in probably at least the last 3 decades, when a multi-core system didn't have such a cache architecture.

bit_user · Aug 22, 2024

I will add that, absent cloud - or, at least the concept of VM hosting - it's not clear to me that CPUs would've scaled to such core counts. I don't know if there's enough demand for having so many cores visible to a single application that it would justify having such large CPUs, especially when we can still scale by adding more CPU packages. Some of Intel's CPUs scale up to 8 sockets and I think AMD's would at least scale beyond 2S if they couldn't so easily just add more cores per package or the demand for doing so were less.

JRStern · Aug 22, 2024

bit_user said:
Modern software (both kernel and userspace) assumes memory accesses are cache-coherent. This means a certain amount of traffic between the cores to support cache coherency, even when they're not actually exchanging data.
...

Presume you need the NUMA domains even with all the processors in one package.

Back about, oh, twenty years ago, the standard servers had two or more processor chips on single motherboards. I never did look into just how they did cache consistency. More than that I've done a lot of work with SQL Server that really used multiple cores/threads well, and allowed NUMA partitions, and "soft NUMA", and Ghu knows what else, but there was never any data on just how that was really being done. And then when wrapped into VMs as well, OMG.

bit_user · Aug 23, 2024

JRStern said:
Presume you need the NUMA domains even with all the processors in one package.

What NUMA actually means is somewhat fuzzy. As the above text I quoted from the Zen 4 EPYC manual says, the penalty for accessing the farthest memory controller wasn't terribly big and has only gotten smaller.

The main thing that's NUMA about modern multiprocessor servers is how each CPU has its own memory controller(s). Until the end of the Core 2 era, Intel had a centralized memory controller in the North Bridge chip.

JRStern said:
Back about, oh, twenty years ago, the standard servers had two or more processor chips on single motherboards.

I think 2P is still the standard server workhorse. Most of Intel and AMD's server CPU models support at least 2P configurations. As for the ARM world, it's the same with Ampere Altra (and I presume AmpereOne) and Nvidia's Grace.

JRStern · Aug 25, 2024

bit_user said:
What NUMA actually means is somewhat fuzzy. As the above text I quoted from the Zen 4 EPYC manual says, the penalty for accessing the farthest memory controller wasn't terribly big and has only gotten smaller.

The main thing that's NUMA about modern multiprocessor servers is how each CPU has its own memory controller(s). Until the end of the Core 2 era, Intel had a centralized memory controller in the North Bridge chip.

I think 2P is still the standard server workhorse. Most of Intel and AMD's server CPU models support at least 2P configurations. As for the ARM world, it's the same with Ampere Altra (and I presume AmpereOne) and Nvidia's Grace.

NUMA is an attempt to have multiple channels to separate banks of memory to avoid collisions, waiting, and other bandwidth problems. You need controllers, and channels (conductors), and memory organized in banks - and ways to cross the boundaries without too much penalty. But the penalties are substantial on their own and what's worse is if six processors all start contending for one bank. Again, I wish I'd ever seen anything detailed about how SQL Server handled NUMA boundaries, but never seen a word of that - you can ask for this or that, but without knowing how the system is organized.

bit_user · Aug 25, 2024

JRStern said:
NUMA is an attempt to have multiple channels to separate banks of memory to avoid collisions, waiting, and other bandwidth problems.

I think even that's an overcomplicated definition. At its heart, what it says is that the path isn't symmetrical between all processing elements and all memories. This functional definition hints at the reasons you list for pursuing such an architecture, but there are others (e.g. modularity) and the rationale for NUMA really is an adjunct, as there are other potential solutions for addressing those motives.

JRStern said:
You need controllers, and channels (conductors), and memory organized in banks - and ways to cross the boundaries without too much penalty. But the penalties are substantial on their own and what's worse is if six processors all start contending for one bank.

Contention is pretty straight-forwardly addressed via caches and queues, but there are further tricks like prefetching. None of this is specific to NUMA, either. Even more so than with rationale, you quickly get bogged down if you over-specify what NUMA actually means, implementation-wise.

News Intel's Diamond Rapids will use LGA9324 packaging

Administrator

Titan

Upstanding

Splendid

Titan

Distinguished

Titan

Titan

Titan

Titan

Titan

Distinguished

Distinguished

Titan

Distinguished

Titan

Titan

Distinguished

Titan

Distinguished

Titan

Share this page