News Intel's unreleased Emerald Rapids CPU impresses in leaked benchmarks — 48-core chips deliver big gains over Sapphire Rapids predecessors

Admin · Nov 25, 2023

Leaked benchmarks of the upcoming Xeon Platinum 8551C and 8558P (Emerald Rapids) processors have emerged.

Intel's unreleased Emerald Rapids CPU impresses in leaked benchmarks — 48-core chips deliver big gains over Sapphire Rapids predecessors : Read more

thestryker · Nov 25, 2023

It's a pretty safe bet that EMR will clock higher than SPR given that it uses refreshed cores and the XCC CPUs are two tiles instead of four. Hopefully we'll get full core details on December 14th when the launch is supposed to be.

My dream is enough cheaper w25xx than w24xx that I might be able to justify the purchase.

bit_user · Nov 25, 2023

thestryker said:
It's a pretty safe bet that EMR will clock higher than SPR given that it uses refreshed cores and the XCC CPUs are two tiles instead of four. Hopefully we'll get full core details on December 14th when the launch is supposed to be.

Faster memory will also be a big win, given that it's only running on 8 channels. We saw how much HBM helped certain workloads, so there are definitely memory bottlenecks at play.

Regarding the number of tiles, are you aware of analysis of the scaling from the single-tile to the quad-tile version of SPR?

thestryker · Nov 25, 2023

bit_user said:
Regarding the number of tiles, are you aware of analysis of the scaling from the single-tile to the quad-tile version of SPR?

I never saw much in the way of anything on the MCC Xeons at all really. Chips and Cheese only did their coverage on XCC as far as I know.

I know all the split controllers and the required 10 EMIB connections were a big manufacturing problem. I'm not sure there will be a performance advantage (beyond the obvious clocks/DRAM bandwidth) to the two tiles vs four, but there will be a notable improvement to the manufacturing side (mostly packaging).

For some EMR details (it could change from here, but I doubt there will be any significant differences) https://www.semianalysis.com/p/intel-emerald-rapids-backtracks-on

DavidC1 · Nov 26, 2023

Calling Emerald Rapids and Raptorlake Refresh as both being a Refresh is a big stretch isn't it?

Emerald Rapids changes tile configuration and even number of tiles while Raptorlake Refresh barely increases clocks.

@thestryker According to Semianalysis, he's hinting that the new configuration is indeed to aim for better performance. That's why they went to a bigger two-Tile setup despite costing even more to manufacture than the four Tile SPR.

While some applications will only benefit from other aspects, many other server applications will benefit from the low-level changes the changed config brings.

bit_user · Nov 26, 2023

DavidC1 said:
Calling Emerald Rapids and Raptorlake Refresh as both being a Refresh is a big stretch isn't it?

Yeah, I was on the fence about nitpicking that point, but decided the article went into enough detail that people would hopefully understand the distinction.

DavidC1 said:
Emerald Rapids changes tile configuration and even number of tiles while Raptorlake Refresh barely increases clocks.

Based on what I've seen, Raptor Refresh doesn't even deserve to be called a refresh. If it's the same silicon as before, then it's a rebrand. That's all.

As for Emerald Rapids, does it have larger caches? Or are those unchanged from Sapphire Rapids.

thestryker · Nov 26, 2023

bit_user said:
As for Emerald Rapids, does it have larger caches? Or are those unchanged from Sapphire Rapids.

Cache varies by SKU, but the top end EMR has much more L3 cache than anything on SPR (6.25MB/core vs 2.625MB/core).

DavidC1 said:
@thestryker According to Semianalysis, he's hinting that the new configuration is indeed to aim for better performance. That's why they went to a bigger two-Tile setup despite costing even more to manufacture than the four Tile SPR.

The wafer cost is higher, but packaging cost is much lower.

bit_user · Nov 26, 2023

thestryker said:
Cache varies by SKU, but the top end EMR has much more L3 cache than anything on SPR (6.25MB/core vs 2.625MB/core).

Nice!

I'll bet folks at Intel have a little uptick in their blood pressure every time they see claims about AMD's large L3 cache figures, even though it's segmented and not unified like Intel's L3 caches. I think AMD's is probably more like a L2.5 cache, on their multi-CCD CPUs?

thestryker said:
The wafer cost is higher, but packaging cost is much lower.

Relatively speaking, but if it costs Intel that much more to put 4 tiles per package than 2, it seems like they haven't mastered Foveros quite like they've been portraying.

thestryker · Nov 26, 2023

bit_user said:
Relatively speaking, but if it costs Intel that much more to put 4 tiles per package than 2, it seems like they haven't mastered Foveros quite like they've been portraying.

I don't believe SPR employs Foveros it's just EMIB connections (EMR is 3 and SPR is 10) and that's a physical limitation where fewer is better no matter what.

bit_user · Nov 26, 2023

thestryker said:
I don't believe SPR employs Foveros it's just EMIB connections

Sorry, I get the terminology mixed up.

thestryker said:
(EMR is 3 and SPR is 10) and that's a physical limitation where fewer is better no matter what.

Uh, 10 for the Xeon Max models, with HBM? Otherwise, I don't see how you get to 10.

thestryker · Nov 26, 2023

bit_user said:
Uh, 10 for the Xeon Max models, with HBM? Otherwise, I don't see how you get to 10.

Nope, the Max are 14.

10x EMIB on Sapphire Rapids

Sapphire Rapids is going to be using four tiles connected with 10 EMIB connections using a 55-micron connection pitch. Normally you might think that a 2x2 array of tiles would need equal EMIBs per tile-to-tile connection, so in this case with 2 EMIBs per connection, that would be eight – why is Intel quoting 10 here? That comes down to the way Sapphire Rapids is designed.

Because Intel wants SPR to look monolithic to every operating system, Intel has essentially cut its inter-core mesh horizontally and vertically. That way each connection through the EMIB is seen purely as the next step on the mesh. But Intel’s monolithic designs are not symmetric in either of those dimensions – usually features like the PCIe or QPI are on the edges, and not in the same place in every corner. Intel has told us that in Sapphire Rapids, this is similarly the case, and one dimension is using 3 EMIBs per connection while the other dimension is using 2 EMIBs per connection.

Intel Xeon Sapphire Rapids: How To Go Monolithic with Tiles

www.anandtech.com

bit_user · Nov 26, 2023

thestryker said:
Nope, the Max are 14.

Intel Xeon Sapphire Rapids: How To Go Monolithic with Tiles

www.anandtech.com

Thanks. I thought you were counting the number of chiplets, not the number of connections.

Why is connection-count the dominant cost factor, rather than the actual number contact balls/pads or the number of chiplets?

thestryker · Nov 26, 2023

bit_user said:
Why is connection-count the dominant cost factor, rather than the actual number contact balls/pads or the number of chiplets?

There hasn't been enough information put out publicly (probably because TSMC hasn't fielded their version yet), but from what has the predominant problematic part is that the EMIB bridge has to be embedded in the substrate. There's some failure percentage when you connect the chips to the bridge as well and while it is likely extremely small the failure rate is multiplied with number of connections.

You're also absolutely right about number of ~~chiplets~~ tiles (edit: I should use Intel terminology here) and contacts having a part to play in packaging costs as well and in theory these costs should be lower.

bit_user · Nov 26, 2023

thestryker said:
You're also absolutely right about number of ~~chiplets~~ tiles (edit: I should use Intel terminology here)

Heh, I even thought to use "tiles", but what I ended up typing was "chiplets".
😅

JayNor · Nov 27, 2023

"These numbers won't make Emerald Rapids an AMD EPYC Genoa killer (or even tied with Genoa if we're being realistic)"

SPR-HBM is the Genoa killer for AI inference processing.

Similar story for the SPR-EE for vRAN.

bit_user · Nov 27, 2023

JayNor said:
SPR-HBM is the Genoa killer for AI inference processing.

Similar story for the SPR-EE for vRAN.

Yes, they certainly have niches where they do well.

The AI Inferencing wins depend mostly on AMX. Sadly, the current Xeon Max doesn't offer enough HBM to hold the larger LLMs.

TerryLaze · Nov 28, 2023

bit_user said:
Yes, they certainly have niches where they do well.

The niche in datacenter/server and so on is pure compute code (number crunching) that only works on x86 cores and can't be ported to arm or gpu...
And that has been the case for many years now.

DavidC1 · Nov 28, 2023

bit_user said:
Nice!

I'll bet folks at Intel have a little uptick in their blood pressure every time they see claims about AMD's large L3 cache figures, even though it's segmented and not unified like Intel's L3 caches. I think AMD's is probably more like a L2.5 cache, on their multi-CCD CPUs?

Sapphire Rapids is 1.875MB/core, while Emerald Rapids is 5MB/core.

Intel "Emerald Rapids" Doubles Down on On-die Caches, Divests on Chiplets

Finding itself embattled with AMD's EPYC "Genoa" processors, Intel is giving its 4th Gen Xeon Scalable "Sapphire Rapids" processor a rather quick succession in the form of the Xeon Scalable "Emerald Rapids," bound for Q4-2023 (about 8-10 months in). The new processor shares the same LGA4677...

www.techpowerup.com

Also since Skylake it hasn't been unified for L3. It's L1/L2 + L3.

For being on the same process it's a very decent improvement. Only downside is the basis for improvement is Sapphire Rapids. From what I hear the development time was very short so the team is doing much better. With Sierra Forest and Granite Rapids they will be in a much better position.

Sapphire Rapids in addition to making it very complex Intel fired most of the validation team back in the Kraznich era, hence the troubles.

thestryker · Nov 28, 2023

DavidC1 said:
Also since Skylake it hasn't been unified for L3. It's L1/L2 + L3.

Pretty sure bit_user is talking about the difference between having one L3 connected to all cores and each CCD having its own L3 or in the case of SPR connected via EMIB which is significantly faster than cross CCD.

JayNor · Nov 28, 2023

"is pure compute code (number crunching) that only works on x86 cores and can't be ported to arm or gpu..."

perhaps you need to do a google search on oneAPI.

TerryLaze · Nov 29, 2023

JayNor said:
"is pure compute code (number crunching) that only works on x86 cores and can't be ported to arm or gpu..."

perhaps you need to do a google search on oneAPI.

OneAPI takes code and compiles it for the most relevant hardware, be it cpu,gpu,fpga, or any other, or all of them. It confirms my point, it exists because most heavily parallel software doesn't run as well on x86 cores as it does on anything else.
OneAPI is the exact opposite of something only working on x86.

News Intel's unreleased Emerald Rapids CPU impresses in leaked benchmarks — 48-core chips deliver big gains over Sapphire Rapids predecessors

Administrator

Judicious

Titan

Judicious

Distinguished

Titan

Judicious

Titan

Judicious

Titan

Judicious

Titan

Judicious

Titan

Honorable

Titan

Titan

Distinguished

Judicious

Honorable

Titan

Share this page