News Intel Claims Sapphire Rapids up to 7X Faster Than AMD EPYC Genoa in AI and Other Workloads

FWIW, the compiler side of the initial AMX enablement work was already started in GCC 11, and has also been part of LLVM 12 since its late 2020 release. That being said, AMX is just a matrix math overlay for the AVX-512 vector math units.

More like a “TensorCore” type unit for the CPU.
 
  • Like
Reactions: JamesJones44
AMX is, so far, only BF16 and INT8 data types matrix multiplications and nothing else (technically dot product).
BUT it is foundational with more variations coming in the pipeline (reading intel's doc) FP16 etc
They are also talking about complex valued data types in Granite Rapids etc.

What makes matrix multiplications so important in AI is that it is the most complex component in the inner most loop in training, dominating the compute complexity and, importantly, memory bandwidth.

By reducing the Matrix multiplication to a handful of assembly instructions
1 tile load instruction: Throughput/Latency = 8/45
1 tile multiplication instruction: Throughput/Latency = 16/52
1 tile store (result): Throughput/Latency = 16

you achieve reduction in over all latency and throughput (magnitudes):
1. you only read the source matrix data once (not multiple times like when using any other method AVX, AVX2, AVX512...)
2. the compute clock cycle latency of 8+16+16 = 40 clock cycles for a tile multiplication (a small 16x16 matrix) is an order of magnitude faster than any other method on a CPUs

it is incredibly good use of silicon for AI apps.
 
Last edited:

AI Workloads​

First off, there's simply no way a CPU without a comparable matrix-multiply can compete with AMX. So, let's get that out of the way, up front. Of course, for heavy AI workloads, I don't expect most people to be using a CPU as their main AI compute engine. Note that they're not comparing against GPUs or other AI accelerators!

Credit to @PaulAlcorn , as I had noted the same things but he was already ahead of me:
"We also see a ~5.5X advantage in BertLarge natural language processing with BF16, but that is versus Genoa with FP32, so it isn't an apples-to-apples test. Intel notes that BF16 datatypes were not supported with AMD's ZenDNN (Zen Deep Neural Network) library with TensorFlow at the time of testing, which leads to a data type mismatch in the BertLarge test. The remainder of the benchmarks used the same data types for both the Intel and AMD systems, but the test notes at the end of the above image album show some core-count-per-instance variations between the two tested configs -- we've followed up with Intel for more detail [EDIT: Intel responded that they swept across the various ratios to find the sweet spot of performance for both types of chips]."​

Further observations:
  • Used Genoa with 2 DIMMs per channel (see end notes) - doesn't that incur a speed penalty? Also equipped Xeon with 2 DIMMs per channel. I wonder if the penalty is as much?
  • Genoa had NPS=1 (NPS=4 typically yields better performance).
  • Of course, they're using CPUs with the same core-count, when one of the main selling points of Genoa is that it has more cores.

Regarding that last point:
"per-core software licensing fees being the company's rationale for why these remain comparable."​

None of the software in their benchmarks has per-core licensing. I'm pretty sure it's all open source, even.

General Workloads​

I don't have much to say here, except that AMD is clearly using higher core-counts in opposition to Intel's increased reliance on accelerators. So, it seems logical to use another factor, like price, to determine which CPUs to match up to each other.

Also, where specified, most tests used NPS=1, except for the FIO test, GROMACS, and LAMMPS.

Finally, some of the tests used RHEL or Rocky Linux, with a 4.18 kernel. You really have to wonder how many of the more recent optimizations got backported to these ancient kernels, for the respective CPUs.

HPC Workloads​

In this category, it would be really nice to have AMD's 3D V-cache equipped CPUs, but I guess they still have yet to launch the Genoa version? Maybe AMD is planning to do that at the Tuesday event.

Again, I'm struck by how many of these benchmarks used an ancient 4.18 kernel. I would expect HPC users to be a lot more interested in running newer kernels, in order to extract the most performance from their massive hardware and energy expenditures. Not only that, but such old distros won't have the compiler optimizations needed to enable features like AVX-512 on Genoa. However, in some cases, they do seem to make a point of compiling with AVX2 on both CPUs.

I'm pleased to see NPS=4, on all cases except Stream. I guess they felt they had enough bandwidth to spare, that they could allow this.
 
Last edited:
FWIW, the compiler side of the initial AMX enablement work was already started in GCC 11, and has also been part of LLVM 12 since its late 2020 release.
That actually makes sense, relative to when Sapphire Rapids was supposed to launch.

FWIW, I think it really doesn't make much difference when they started building it into compilers, because it's not really the type of feature that I'd expect a compiler to automatically utilize. You'd have to explicitly insert intrinsics into your code, if you want the compiler to perform these ops, although maybe the compiler at least manages the registers for you?

At least 99% of users will be simply utilizing it through a handful of libraries optimized by Intel, so it won't matter much if they had to write them in assembly language or what.

That being said, AMX is just a matrix math overlay for the AVX-512 vector math units.
😵‍💫

No, I'm pretty sure it's not.

Intel claims it has far higher narrow (int8) FMA throughput:

hUDHuhX9KwEigHVoAnHcmU.jpg


Here, Locuza shows us it takes a real chunk of die space in each Sapphire Rapids version of Golden Cove:

And it adds 8x 1 kB tile ISA registers, which could theoretically be implemented via overlays on the same register pool used for ZMM registers, but I sure doubt it! Given that AMX has its own logic, and no instructions for direct interchange with the ZMM registers, it wouldn't be very practical to use the same underlying registers.

Where did you even hear that? In all I've read about AMX, I've never come across such a statement.

More like a “TensorCore” type unit for the CPU.
What's funny is that it's even more like an actual core than Nvidia's Tensor cores. They actually do use the same CUDA registers as their SIMD instructions. In both Nvidia and Intel's case their dispatched from the same instruction stream, making neither a proper core in its own right.
 
AMX has its own operations, storage, and register files.
See the fuse.wikichip info from article June, 29, 2020, titled "The x86 Advanced Matrix Extension (AMX) Brings Matrix Operations; To Debut with Sapphire Rapids"
 
  • Like
Reactions: bit_user
Yes, but the question is whether it's any match for GPUs (or purpose-built AI accelerators). Again, note that they opted to compare it only to another CPU!

Also, in the above die shot, you can see it hardly comes for free. For all those server apps which don't use AMX, they're still paying for it!
I mostly agree.

The only reason to run LLMs on CPUs is because that is what you have, you don't have enough GPUs available to run the LLMs on. I think they Intel were comparing against current CPU/GPU distribution in the datacentre, where there are buttloads more CPUs than GPUs.

However, in the future, I'd expect anyone (outside 'prosumer' or occasional casual use) who works on LLMs will be angling for GPUs (Grace Hopper and MI300X) rather than CPUs to run on. I think GPUs will take a larger % of the CPU/GPU market as the hardware becomes available and the usage growth of LLMs continues to increase.

Sure, having a CPU that has some accelerators useful for LLMs (assuming they don't require additional licensing!) will be useful for when you can't get a GPU, but I think GPUs are going to take a larger share of the deployed compute in the future and thus CPU compute for LLMs will become less important.
 
No need to argue against this type of marketing benchmarks. Intel obviously dont think they can fool DC customers with this. The audience of this material are people hyping up on the AI train with zero technical knowledge whatsoever. Everybody knows how SPR got obliterated by Genoa in CPU workloads. It will get even uglier when comparing to Mi300 in AI workload. 😶 At least SPR can still sell to enterprises or cloud venders who're so lazy to make a jump.
 
  • Like
Reactions: NinoPino
AI Workloads
Note that they're not comparing against GPUs or other AI accelerators!

All you need here is a breakthrough in AI, which uses a different algorithm, and one generation of hardware accelerators is history.

Intel got itself in to a tough spot due to market segmentation.

The dedicated accelerators are the "supreme" form and the last stage of market segmentation. Initially you only try to charge people differently for your (same) product depending on on how much money the customer has. By charging separately for using different parts of the (same) CPU is the final stage of this process. This will work, if you have an absolutely monopolistic position on the market. And here we have Intel coming with its best of bread for market segmentation while at the same time loosing its leading monopolistic market position.

Now suddenly Intel cant overcharge its customers as it was meant to be.
 
  • Like
Reactions: NinoPino
Yes, but the question is whether it's any match for GPUs (or purpose-built AI accelerators). Again, note that they opted to compare it only to another CPU!

Also, in the above die shot, you can see it hardly comes for free. For all those server apps which don't use AMX, they're still paying for it!
But you did notice that intel sells GPUs right?! They even sell server GPUs.
This is just another tool to use if you are into AI, it's not the only tool intel provides for AI.
If your workload works better on GPUs then you can use only GPUs, if it can work distributed you can even run it on the GPUs and the CPUs at the same time, more work done is still more work done.

You have also heard about intel max cpus plenty of times by now.
Intel_Max_Series_product_information.jpg



AMD's response is clear: integrate CPU and GPU dies (with HBM) into the same package:
Yes, but the question is whether it's any match for GPUs (or purpose-built AI accelerators). Again, note that they opted to compare it only to another CPU!

Also, in the above die shot, you can see it hardly comes for free. For all those server apps which don't use AMX, they're still paying for it!
Wow did that spinning around create a audible crack in your spine?!
How much are people going to pay more for the mi300 that they will never use?!
 
No need to argue against this type of marketing benchmarks. Intel obviously dont think they can fool DC customers with this. The audience of this material are people hyping up on the AI train with zero technical knowledge whatsoever.
Eh, I saw less "funny business" than I expected. The worst parts were things like comparing differently-priced CPUs, using RHEL (with an ancient kernel) for HPC benchmarks, and maybe some sub-optimal DIMM configuration and NPS settings.

It will get even uglier when comparing to Mi300 in AI workload.
Those are aimed at Nvidia's Grace+Hopper superchip.

Both price- & performance- wise, they should be in a different league than even Xeon Max.
 
  • Like
Reactions: NinoPino
All you need here is a breakthrough in AI, which uses a different algorithm, and one generation of hardware accelerators is history.
Yeah, which is why you see people throwing 1-2 generation old Nvidia GPUs in the trash...
/s

Initially you only try to charge people differently for your (same) product depending on on how much money the customer has. By charging separately for using different parts of the (same) CPU is the final stage of this process. This will work, if you have an absolutely monopolistic position on the market.
Reading this, I can't tell if you're aware that Intel just started doing that. What's weird is that they excluded AMX from it.

 
Wow did that spinning around create a audible crack in your spine?!
How much are people going to pay more for the mi300 that they will never use?!
You're mixing two different things. My response was written to @rtoaht saying that "AMD needs an answer to the AI accelerators."

So, I suggested that for people who want AI accelerators, they can use a solution like MI300.

However, people who have conventional server workloads can continue to use their maistream EPYC and not pay for AI accelerators they're not using.

So, there's no contradiction, here. Didn't think I'd have to spell that out.