Discussion Xeon Max HBM2e vs DDR5 benchmarks are in!

bit_user

Titan
Ambassador
Phoronix just posted benchmarks on the Xeon Max with 64 GiB of HBM 2e in caching and exclusive mode.

Some of us have been talking a lot about the prospect of HBM in CPUs (@InvalidError , @Kamen Rider Blade ). Exciting prospects of things to come (hopefully)!

Here's the GeoMean (sorry, can't embed image):

It shows an advantage of 18.5% to 20.4% (depending on CPU model) advantage for HBM 2e-exclusive mode vs. 8-channel DDR5-only. An important caveat is that this is only an average over select HPC and AI benchmarks, while not attempting to characterize performance across a broader range of server workloads.

More interesting is probably system power consumption, where increased HBM 2e usage correlated with slightly reduced power consumption.

I'm not surprised by that, but I wouldn't have bet on it since increasing bandwidth should decrease idle cycles of the cores (hence, the faster performance). So, the fact that it can improve performance while slightly decreasing power consumption is interesting.

Then again, I'm betting the CPUs spent a lot of time being power-limited. So, all it had to do was reduce power in a small number number of cases that weren't bumping the upper limit already. The story could still be different, for a desktop CPU with HBM-class memory.
 
Last edited:
D

Deleted member 2731765

Guest
Interesting.. ! Wait, I'm going through this now.
 
D

Deleted member 2731765

Guest
Okay. Impressive performance. How about doing a latency comparison between the two ?

The tests also show slightly higher power usage as measured by RAPL/PowerCap sysfs interface ? I don't think this might be accurate, even when using the HBM-ONLY mode.

Also, Intel should lift the 64GB limitation of the system memory addressable per socket in HBM-only mode, if not for 1LM and 2LM modes.

FWIW, Intel has had a similar memory design scheme with Intel Xeon Phi x200, though now the fast memory is HBM instead. Basically the “Flat” vs “Cache” mode, and others (MCDRAM). I can also see similarity with the tiered DRAM caching used with Intel Optane DIMMs as well (RIP Optane !).

Speaking of Xeon Max processor series in general, I think the sub-NUMA clustering and UMA clustering used in the Xeon MAX series design, easily allows cores to work directly with HBM and local DDR5 on the same compute tile as the x86 cores.

So apps can minimize the data movement across the chip and therefore increase the performance.

And since Xeon Max series scale to only 56 cores in 350W, it seems Intel is trading maximum core counts for HBM. But it's good to know that Xeon Max will use the full 4 compute tile design across all of its SKUs, as this is required for four packages of HBM 2e.

Xeon Max 9468 is my pick though, despite the lower core count.
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
Okay. Impressive performance. How about doing a latency comparison between the two ?
Would be nice, but Phoronix tends to focus on application benchmarks, rather than sythetics or microbenchmarking. Hopefully, ChipsAndCheese will get access to one, before too long. It probably depends on there being cloud instances.

I can also see similarity with the tiered DRAM caching used with Intel Optane DIMMs as well (RIP Optane !).
Optane is only relevant, in that context, because Intel anticipated better capacity scaling. In the CXL era, you can easily scale out CXL.mem even beyond how much Optane you could deploy.

it's good to know that Xeon Max will use the full 4 compute tile design across all of its SKUs, as this is required for four packages of HBM 2e.
I think the kicker for this approach to chiplets is higher power utilization. I think that's why AMD moved away from a mesh-like connectivity between chiplets and toward a hub-and-spokes model.

FWIW, I do slightly prefer the mesh approach, particularly because clever software partitioning can yield better latency by providing local and virtually-exclusive access.