News Intel Launches Granite Rapids Xeon 6900P series with 120 cores — matches AMD EPYC’s core counts for the first time since 2017

atmapuri

Distinguished
Sep 26, 2011
20
6
18,515
The question remains, if Intel still plans to sub-license parts of the logic on these new chips as it was the plan with the previous generation. One License for AVX512, one license for Neural Engine etc... Or will this be also pricewise direct competition to AMD.

If the product is better, this alone will not be enough, if the CPU, the platform, the memory etc.. will cost 2x more than competition.
 

JRStern

Distinguished
Mar 20, 2017
172
64
18,660
I have a hundred very geeky questions about how these high core counts work in common IT scenarios, having been a performance tuner for medium to large-scale systems over the past decade or two, and seeing how many problems there are with "noisy neighbors" on shared systems, plus or minus VMs all over the place. Makes it devilishly hard to get performance data to even analyze, when your performance can vary 2x or 10x because the neighbors are throwing a party, and with 128 neighbors seems like someone is always throwing a party.
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
This time Intel did a great job : Phoronix
Yeah, but you have to keep in mind that those are comparing against AMD's previous generation EPYCs - not the Zen 5 they're about to launch next month. Plus, you should consider than these are using 500 W each, whereas Genoa is normally limited to just 360 W.

I did skim through the benchmarks on Phoronix. AMD still got some wins and the Geomean is probably a little closer that I would've expected. It should be easy for Turin to retake the lead, overall.

STH also has benchmarks posted:

 

bit_user

Titan
Ambassador
The question remains, if Intel still plans to sub-license parts of the logic on these new chips as it was the plan with the previous generation. One License for AVX512, one license for Neural Engine etc... Or will this be also pricewise direct competition to AMD.
Do you mean like requiring people to pay extra to use those features? Intel did indeed roll out "Intel On Demand", but it's not at the granularity of what you're saying. Instead, it applied to more specialized accelerators.

If the product is better, this alone will not be enough, if the CPU, the platform, the memory etc.. will cost 2x more than competition.
Yeah, these look huge. Cost and power should be significant considerations. So far, I have yet to see power consumption data, so all we have to go on in their 500 W TDP.
 
  • Like
Reactions: cyrusfox

bit_user

Titan
Ambassador
I have a hundred very geeky questions about how these high core counts work in common IT scenarios, having been a performance tuner for medium to large-scale systems over the past decade or two, and seeing how many problems there are with "noisy neighbors" on shared systems, plus or minus VMs all over the place. Makes it devilishly hard to get performance data to even analyze, when your performance can vary 2x or 10x because the neighbors are throwing a party, and with 128 neighbors seems like someone is always throwing a party.
Well, they have some QoS mechanisms, which can be used to keep other cores from hogging too much L3 cache, for instance. I'm not sure if memory bandwidth is budgeted in a similar way.

AMD might have some advantages in this area, given how their L3 is segmented. If your VMs are generally small enough to fit on individual CCDs, then you'd probably tend to find EPYC scales better. However, I don't have any direct experience in matter.
 
  • Like
Reactions: JRStern
Seems like the extra bandwidth from the MCRDIMMs ought to give Intel an edge AMD won't be able to match this generation for appropriate workloads. I'm looking forward to seeing where Zen 5 stands out since we're finally getting matched core counts and TDP. I have no doubt there will be workloads that the new architecture just runs away with.
So far, I have yet to see power consumption data, so all we have to go on in their 500 W TDP.
STH said peak estimate of 1.2kW for their 2S reference system before cooling and that the CPUs can absolutely use all 500W. They also had an Intel provided slide that showed power/performance scaling over the curve. Won't have meaningful numbers until OEM systems are out though I imagine.

edit: I did enjoy Patrick's commentary on how fussy the system was
 

JRStern

Distinguished
Mar 20, 2017
172
64
18,660
Well, they have some QoS mechanisms, which can be used to keep other cores from hogging too much L3 cache, for instance.
That's interesting, should help, but going to be tricky as all to manage.
When there's 127 people between you and the buffet, QoS is difficult.
 

bit_user

Titan
Ambassador
That's interesting, should help, but going to be tricky as all to manage.
When there's 127 people between you and the buffet, QoS is difficult.
This page gives an overview of some (all?) of Intel's QoS technologies. Given that Xen has implemented them, I would assume most other hypervisors have, as well.

Exactly how the hypervisor presents QoS to administrators is another matter, but you can now be aware of some of the technologies, their capabilities and limitations, that are used under the hood.
 
  • Like
Reactions: JRStern
honestly...these would need to be a good 15-20%+ better than AMD's EPYC.

AMD's are MUCH more pwoer efficient so if they cant be comparable for the huge power difference servers and all will not buy them as they cost more long term (in power & cooling).
 

TheHerald

Respectable
BANNED
Feb 15, 2024
1,633
502
2,060
honestly...these would need to be a good 15-20%+ better than AMD's EPYC.

AMD's are MUCH more pwoer efficient so if they cant be comparable for the huge power difference servers and all will not buy them as they cost more long term (in power & cooling).
They are 3 to 5 times (that's 300 to 500%) faster than epyc in anything that uses AMX (mainly AI i think?)
 
honestly...these would need to be a good 15-20%+ better than AMD's EPYC.

AMD's are MUCH more pwoer efficient so if they cant be comparable for the huge power difference servers and all will not buy them as they cost more long term (in power & cooling).
I wouldn't be so sure about Zen 5 being more efficient in the data center. All information available is pointing towards them having a 500W TDP on the top parts as well. The down stack might be lower, but there are a lot of varying products from Intel to compete there.
 

bit_user

Titan
Ambassador
I wouldn't be so sure about Zen 5 being more efficient in the data center. All information available is pointing towards them having a 500W TDP on the top parts as well.
Efficiency is perf/W. Those 500 W parts will be running 128 x Zen 5 cores or 192 Zen 5C cores. So, more power, but also more and wider cores (with the amount of cache scaling linearly per-core).

If we just look at TDPs, 500 W represents a 38.9% increase over the Genoa and Bergamo generation. Going from 96 -> 128 cores is 33.3% more cores. Going from 128 -> 192 cores is 50% more. So, in the case of the Zen 5 Turin, per-core performance needs to improve by just 4.2% for it to have equal efficiency as Genoa. In the case of the Zen 5C version, per-core performance can actually decrease and it'd still have the same or better efficiency than Bergamo.

We know from the Ryzen 9000 benchmarks, especially those by Phoronix, that Zen 5 is easily capable of those sorts of gains. Overall performance of the 9950X improved 17.8% over the 7950X. Efficiency increased by 22.1%.
 
Mar 19, 2024
33
14
35
Look how smarter NVIDIA is doing its business compared to our two server CPU chips makers Intel and AMD. NVIDIA packs 6-8 GPUs connected by the fast link in parallel on each baseboard. Each such board manipulated by one rarely two 32-64 core CPUs. That is typical node of supercomputer. Block of 8 GPUs is 10-20 times more powerful than CPU but it also consumes 10-20 more electrical power. Obviously packing 10x more chips on the board NVIDIA will have 10x larger profit. Who prohibits Intel and AMD do the same with these new much more powerful CPUs which now becoming really competitive 1:1 CPU:GPU in the HPC field? What, AMD afraid unlucky number 4 just to start the expansion?
 
Last edited:

bit_user

Titan
Ambassador
Look how smarter NVIDIA is doing its business compared to our two server CPU chips makers Intel and AMD.
Did you know that AMD and Intel also make GPUs?

NVIDIA packs 6-8 GPUs connected by the fast link in parallel on each baseboard. Each such board manipulated by one rarely two 32-64 core CPUs.
Not true of Grace-Hopper setups, where the CPUs are enmeshed in the same NVLink fabric as the GPUs.

N5fCvRd.jpeg


Obviously packing 10x more chips on the board NVIDIA will have 10x larger profit.
AMD did this as well.

amd-frontier-cpu-gpu-node-1.jpg


Who prohibits Intel and AMD do the same with these new much more powerful CPUs which now becoming really competitive 1:1 CPU:GPU in the HPC field?
I'm curious whether you've heard about AMD's MI300A, which combines CPU and GPU cores in the same package.

AMD-Instinct-MI300A-Chiplets.jpg


amd-instinct-mi300a-cross-section.jpg


amd-instinct-mi300a-system-schematic.jpg


What, AMD afraid unlucky number 4 just to start the expansion?
Not sure what you're even referring to, here. Zen 4 powered the Genoa, Bergamo, and Siena EPYCs, which have been very successful for them. If you include the Zen+ refresh, then it was actually their 5th generation of EPYCs and the new Turin will be Gen 6.

Sources:
 

JRStern

Distinguished
Mar 20, 2017
172
64
18,660
This page gives an overview of some (all?) of Intel's QoS technologies. Given that Xen has implemented them, I would assume most other hypervisors have, as well.

Exactly how the hypervisor presents QoS to administrators is another matter, but you can now be aware of some of the technologies, their capabilities and limitations, that are used under the hood.
Thanks. Very interesting. Really NOT the kind of thing you really want to do by hand, though.
Not sure it should be run at the hypervisor level, or at least not only, an on-prem system with no VM should still have access to these. Even the statistics are useful, the tuning probably needs to be dynamic and automated to be really useful.
 
Mar 19, 2024
33
14
35
Did you know that AMD and Intel also make GPUs?


Not true of Grace-Hopper setups, where the CPUs are enmeshed in the same NVLink fabric as the GPUs.
N5fCvRd.jpeg


AMD did this as well.
amd-frontier-cpu-gpu-node-1.jpg


I'm curious whether you've heard about AMD's MI300A, which combines CPU and GPU cores in the same package.
AMD-Instinct-MI300A-Chiplets.jpg
amd-instinct-mi300a-cross-section.jpg
amd-instinct-mi300a-system-schematic.jpg


Not sure what you're even referring to, here. Zen 4 powered the Genoa, Bergamo, and Siena EPYCs, which have been very successful for them. If you include the Zen+ refresh, then it was actually their 5th generation of EPYCs and the new Turin will be Gen 6.

Sources:
Thanks for the picturesque response.
Looks like you completely missed my point and the basics behind it.

The point #1 is that no Intel or AMD, the main server CPU producers create purely CPU based motherboards with more than 2 CPUs on them but they could put 4, 6, 8 on them. Specifically this would be good for regular end users. And Intel or AMD do not promote the fast ways you would connect existing motherboards into single parallel system for end users. Seems they gave up to the notion or often a myth that GPUs always beat CPU in computer power and wall-plug efficiency.

The point #2 is that typical 8 GPUs on the supercomputers baseboard we tested do not give on our tasks speedups more than 5-8 compared to the CPU on the same board. These tasks are usual heavy FP64 vector and scalar simulations not the recent fashion AI. And some supercomputers are from the TOP500 list (your first picture show diagram for the typical and most powerful so far one). There are no recent CPU/GPU design incarnations there, when they will appear we will for sure see and test them too.

Point #3 is that the total electric power consumption by these GPUs on the baseboard is essentially on par with the power consumed by the same number of CPUs if they were used instead of GPU.

And seems you also missed my joke point #4 with number 4 as unlucky number in some countries why we might not see 4 processor motherboards :) Intel made them years ago but not anymore
 
Last edited:
The point #1 is that no Intel or AMD, the main server CPU producers create purely CPU based motherboards with more than 2 CPUs on them but they could put 4, 6, 8 on them. Specifically this would be good for regular end users. And Intel or AMD do not promote the fast ways you would connect existing motherboards into single parallel system for end users. Seems they gave up to the notion or often a myth that GPUs always beat CPU in computer power and wall-plug efficiency.
Intel has been the only home for the 8S x86 server for over a decade and 4S has been dead almost as long. It's a very niche market due to the cost and lack of broad usefulness. CPUs by nature of being generalized compute can never be as efficient as anything purpose built. Intel has done custom Xeons for customers with high enough volume and continued to add accelerators to their products but these things are just enhancements on the base capability rather than pure replacements for something specialized.
There are no recent CPU/GPU design incarnations there, when they will appear we will for sure see and test them too.
The June 2024 Top 500 list has:
GH200 at #6 though the CPU and GPU are separate cores despite being the same package.

MI300A at #46, 47 and 48 (These should all be El Capitan test systems I believe) which is most certainly monolithic CPU/GPU combined core.
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
Looks like you completely missed my point and the basics behind it.
Yes, I didn't find it very clear, but thanks for the additional clarifications.

The point #1 is that no Intel or AMD, the main server CPU producers create purely CPU based motherboards with more than 2 CPUs on them but they could put 4, 6, 8 on them.
Like Stryker said, Intel still maintains up to 8-socket scalability, but at a considerable price premium. I think you'll find that it's generally not cost-effective to scale that high, which is why > 2-socket servers aren't popular.

It's getting seriously difficult to fit more than 2 CPUs and all their RAM on a single board. You can find some pictures of AMD server boards with 2 sockets and 48 DIMM slots. There's definitely not room for another two CPUs and their memory. Just eyeballing it, I couldn't say if there'd be enough root to fit 4 of them with only one DIMM per channel, so you might have to look at some kind of blade-base setup.


Keep in mind that Intel's top end CPUs have the same number of UPI channels, regardless of whether they support only dual-socket or 8-socket configurations. Even in the latest generation, it's still not enough for point-to-point connectivity for 8 CPUs. Also, if your data isn't equally distributed across the other CPUs' memory, you'll hit more inter-processor communication bottlenecks than you would in a 2-CPU configuration, where all of the UPI channels are used to link the two CPUs.

Intel or AMD do not promote the fast ways you would connect existing motherboards into single parallel system for end users.
Intel had been pushing OmniPath, with versions of Skylake SP that even had it wired into the CPU package, itself. However, the market rejected OmniPath, as it was a proprietary format and required OmniPath networking gear.

As for other high-speed networking options, why do AMD and Intel need to be the ones pushing it? They're both members of the CXL consortium, and have both implemented 1.x support in Sapphire Rapids and Genoa, respectively.

Seems they gave up to the notion or often a myth that GPUs always beat CPU in computer power and wall-plug efficiency.
They both have AVX-512. Does that count for nothing? Zen 5 even beefed it up, quite considerably, not only implementing at native 512-bit width, but also adding more dispatch ports.

The point #2 is that typical 8 GPUs on the supercomputers baseboard we tested do not give on our tasks speedups more than 5-8 compared to the CPU on the same board. These tasks are usual heavy FP64 vector and scalar simulations not the recent fashion AI.
Well, GPUs don't only offer more raw compute - they also offer roughly an order of magnitude more memory bandwidth. So, I'd say you're in an interesting position to have a workload that's not only so heavily dependent on FP64, but also has such a high compute/memory_bandwidth ratio.

Plus, it matters specifically which GPUs you benchmarked. Compared to the A100, Nvidia's H100 made a huge leap in fp64 performance, in order to better compete with AMD, which went to full 64-bit registers and 1:1 performance vs. fp32 in CDNA.

If I had an app with poor scaling on GPUs, I'd definitely want to understand exactly why. It could just be as you say - that the task didn't use the GPUs very efficiently (branchy code? limited concurrency?), but it could also be that there's over-synchronization, excessive contention, etc. CPUs are designed to optimize those things a lot better than GPUs, with their weak memory models.

And seems you also missed my joke point #4 with number 4 as unlucky number in some countries why we might not see 4 processor motherboards :) Intel made them years ago but not anymore
I got the cultural reference about 4 being unlucky, but it wasn't clear to me that you meant it in reference to quad-CPU configurations, rather than counting generations of EPYC CPUs, or something like that.
 
  • Like
Reactions: thestryker