News Intel Claims Sapphire Rapids up to 7X Faster Than AMD EPYC Genoa in AI and Other Workloads

Admin · Jun 12, 2023

Intel shared benchmarks showing up to a 7X speedup in AI workloads for its Sapphire Rapids Xeon Scalable chips over AMD's EPYC Genoa processors.

Intel Claims Sapphire Rapids up to 7X Faster Than AMD EPYC Genoa in AI and Other Workloads : Read more

Deleted member 2731765 · Jun 12, 2023

FWIW, the compiler side of the initial AMX enablement work was already started in GCC 11, and has also been part of LLVM 12 since its late 2020 release. That being said, AMX is just a matrix math overlay for the AVX-512 vector math units.

More like a “TensorCore” type unit for the CPU.

Amdlova · Jun 12, 2023

7 times the performance but the price...

rtoaht · Jun 12, 2023

AMD needs an answer to the AI accelerators. With the advent of generative AI, it is now mainstream.

lightofhonor · Jun 12, 2023

rtoaht said:
AMD needs an answer to the AI accelerators. With the advent of generative AI, it is now mainstream.

Their new 7040 chips have AI accelerators built-in. Not hard to imagine their EPYC line will get this sometime too.

Tech0000 · Jun 12, 2023

AMX is, so far, only BF16 and INT8 data types matrix multiplications and nothing else (technically dot product).
BUT it is foundational with more variations coming in the pipeline (reading intel's doc) FP16 etc
They are also talking about complex valued data types in Granite Rapids etc.

What makes matrix multiplications so important in AI is that it is the most complex component in the inner most loop in training, dominating the compute complexity and, importantly, memory bandwidth.

By reducing the Matrix multiplication to a handful of assembly instructions
1 tile load instruction: Throughput/Latency = 8/45
1 tile multiplication instruction: Throughput/Latency = 16/52
1 tile store (result): Throughput/Latency = 16

you achieve reduction in over all latency and throughput (magnitudes):
1. you only read the source matrix data once (not multiple times like when using any other method AVX, AVX2, AVX512...)
2. the compute clock cycle latency of 8+16+16 = 40 clock cycles for a tile multiplication (a small 16x16 matrix) is an order of magnitude faster than any other method on a CPUs

it is incredibly good use of silicon for AI apps.

waltc3 · Jun 12, 2023

If Intel's marketing was a reflection of its hardware and sales, AMD would be lost...😉 Fortunately for AMD, that's not the case...😉

bit_user · Jun 12, 2023

AI Workloads

First off, there's simply no way a CPU without a comparable matrix-multiply can compete with AMX. So, let's get that out of the way, up front. Of course, for heavy AI workloads, I don't expect most people to be using a CPU as their main AI compute engine. Note that they're not comparing against GPUs or other AI accelerators!

Credit to @PaulAlcorn , as I had noted the same things but he was already ahead of me:

"We also see a ~5.5X advantage in BertLarge natural language processing with BF16, but that is versus Genoa with FP32, so it isn't an apples-to-apples test. Intel notes that BF16 datatypes were not supported with AMD's ZenDNN (Zen Deep Neural Network) library with TensorFlow at the time of testing, which leads to a data type mismatch in the BertLarge test. The remainder of the benchmarks used the same data types for both the Intel and AMD systems, but the test notes at the end of the above image album show some core-count-per-instance variations between the two tested configs -- we've followed up with Intel for more detail [EDIT: Intel responded that they swept across the various ratios to find the sweet spot of performance for both types of chips]."

Further observations:

Used Genoa with 2 DIMMs per channel (see end notes) - doesn't that incur a speed penalty? Also equipped Xeon with 2 DIMMs per channel. I wonder if the penalty is as much?
Genoa had NPS=1 (NPS=4 typically yields better performance).
Of course, they're using CPUs with the same core-count, when one of the main selling points of Genoa is that it has more cores.

Regarding that last point:

"per-core software licensing fees being the company's rationale for why these remain comparable."

None of the software in their benchmarks has per-core licensing. I'm pretty sure it's all open source, even.

General Workloads

I don't have much to say here, except that AMD is clearly using higher core-counts in opposition to Intel's increased reliance on accelerators. So, it seems logical to use another factor, like price, to determine which CPUs to match up to each other.

Also, where specified, most tests used NPS=1, except for the FIO test, GROMACS, and LAMMPS.

Finally, some of the tests used RHEL or Rocky Linux, with a 4.18 kernel. You really have to wonder how many of the more recent optimizations got backported to these ancient kernels, for the respective CPUs.

HPC Workloads

In this category, it would be really nice to have AMD's 3D V-cache equipped CPUs, but I guess they still have yet to launch the Genoa version? Maybe AMD is planning to do that at the Tuesday event.

Again, I'm struck by how many of these benchmarks used an ancient 4.18 kernel. I would expect HPC users to be a lot more interested in running newer kernels, in order to extract the most performance from their massive hardware and energy expenditures. Not only that, but such old distros won't have the compiler optimizations needed to enable features like AVX-512 on Genoa. However, in some cases, they do seem to make a point of compiling with AVX2 on both CPUs.

I'm pleased to see NPS=4, on all cases except Stream. I guess they felt they had enough bandwidth to spare, that they could allow this.

bit_user · Jun 12, 2023

Metal Messiah. said:
FWIW, the compiler side of the initial AMX enablement work was already started in GCC 11, and has also been part of LLVM 12 since its late 2020 release.

That actually makes sense, relative to when Sapphire Rapids was supposed to launch.

FWIW, I think it really doesn't make much difference when they started building it into compilers, because it's not really the type of feature that I'd expect a compiler to automatically utilize. You'd have to explicitly insert intrinsics into your code, if you want the compiler to perform these ops, although maybe the compiler at least manages the registers for you?

At least 99% of users will be simply utilizing it through a handful of libraries optimized by Intel, so it won't matter much if they had to write them in assembly language or what.

Metal Messiah. said:
That being said, AMX is just a matrix math overlay for the AVX-512 vector math units.

😵‍💫

No, I'm pretty sure it's not.

Intel claims it has far higher narrow (int8) FMA throughput:

Here, Locuza shows us it takes a real chunk of die space in each Sapphire Rapids version of Golden Cove:

https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F527b9476-2264-485e-8880-8d4b28c488a9_1185x1037.jpeg

Source: https://locuza.substack.com/p/info-snack-alder-lake-m-raptor-lake

And it adds 8x 1 kB tile ISA registers, which could theoretically be implemented via overlays on the same register pool used for ZMM registers, but I sure doubt it! Given that AMX has its own logic, and no instructions for direct interchange with the ZMM registers, it wouldn't be very practical to use the same underlying registers.

Where did you even hear that? In all I've read about AMX, I've never come across such a statement.

Metal Messiah. said:
More like a “TensorCore” type unit for the CPU.

What's funny is that it's even more like an actual core than Nvidia's Tensor cores. They actually do use the same CUDA registers as their SIMD instructions. In both Nvidia and Intel's case their dispatched from the same instruction stream, making neither a proper core in its own right.

The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores

www.anandtech.com

JayNor · Jun 12, 2023

AMX has its own operations, storage, and register files.
See the fuse.wikichip info from article June, 29, 2020, titled "The x86 Advanced Matrix Extension (AMX) Brings Matrix Operations; To Debut with Sapphire Rapids"

bit_user · Jun 12, 2023

rtoaht said:
AMD needs an answer to the AI accelerators. With the advent of generative AI, it is now mainstream.

AMD's response is clear: integrate CPU and GPU dies (with HBM) into the same package:

AMD Instinct MI300 Details Emerge, Debuts in 2 Exaflop El Capitan Supercomputer

24-core Zen 4 CPU, CDNA 3 GPU, and 128GB of HBM3 all in one package, what's not to like?

www.tomshardware.com

bit_user · Jun 12, 2023

Tech0000 said:
it is incredibly good use of silicon for AI apps.

Yes, but the question is whether it's any match for GPUs (or purpose-built AI accelerators). Again, note that they opted to compare it only to another CPU!

Also, in the above die shot, you can see it hardly comes for free. For all those server apps which don't use AMX, they're still paying for it!

eldakka1 · Jun 12, 2023

bit_user said:
Yes, but the question is whether it's any match for GPUs (or purpose-built AI accelerators). Again, note that they opted to compare it only to another CPU!

Also, in the above die shot, you can see it hardly comes for free. For all those server apps which don't use AMX, they're still paying for it!

I mostly agree.

The only reason to run LLMs on CPUs is because that is what you have, you don't have enough GPUs available to run the LLMs on. I think they Intel were comparing against current CPU/GPU distribution in the datacentre, where there are buttloads more CPUs than GPUs.

However, in the future, I'd expect anyone (outside 'prosumer' or occasional casual use) who works on LLMs will be angling for GPUs (Grace Hopper and MI300X) rather than CPUs to run on. I think GPUs will take a larger % of the CPU/GPU market as the hardware becomes available and the usage growth of LLMs continues to increase.

Sure, having a CPU that has some accelerators useful for LLMs (assuming they don't require additional licensing!) will be useful for when you can't get a GPU, but I think GPUs are going to take a larger share of the deployed compute in the future and thus CPU compute for LLMs will become less important.

phitinh81 · Jun 13, 2023

No need to argue against this type of marketing benchmarks. Intel obviously dont think they can fool DC customers with this. The audience of this material are people hyping up on the AI train with zero technical knowledge whatsoever. Everybody knows how SPR got obliterated by Genoa in CPU workloads. It will get even uglier when comparing to Mi300 in AI workload. 😶 At least SPR can still sell to enterprises or cloud venders who're so lazy to make a jump.

atmapuri · Jun 13, 2023

bit_user said:
AI Workloads
Note that they're not comparing against GPUs or other AI accelerators!

All you need here is a breakthrough in AI, which uses a different algorithm, and one generation of hardware accelerators is history.

Intel got itself in to a tough spot due to market segmentation.

The dedicated accelerators are the "supreme" form and the last stage of market segmentation. Initially you only try to charge people differently for your (same) product depending on on how much money the customer has. By charging separately for using different parts of the (same) CPU is the final stage of this process. This will work, if you have an absolutely monopolistic position on the market. And here we have Intel coming with its best of bread for market segmentation while at the same time loosing its leading monopolistic market position.

Now suddenly Intel cant overcharge its customers as it was meant to be.

TerryLaze · Jun 13, 2023

bit_user said:
Yes, but the question is whether it's any match for GPUs (or purpose-built AI accelerators). Again, note that they opted to compare it only to another CPU!

Also, in the above die shot, you can see it hardly comes for free. For all those server apps which don't use AMX, they're still paying for it!

But you did notice that intel sells GPUs right?! They even sell server GPUs.
This is just another tool to use if you are into AI, it's not the only tool intel provides for AI.
If your workload works better on GPUs then you can use only GPUs, if it can work distributed you can even run it on the GPUs and the CPUs at the same time, more work done is still more work done.

You have also heard about intel max cpus plenty of times by now.

bit_user said:
AMD's response is clear: integrate CPU and GPU dies (with HBM) into the same package:

AMD Instinct MI300 Details Emerge, Debuts in 2 Exaflop El Capitan Supercomputer

24-core Zen 4 CPU, CDNA 3 GPU, and 128GB of HBM3 all in one package, what's not to like?

www.tomshardware.com

bit_user said:
Yes, but the question is whether it's any match for GPUs (or purpose-built AI accelerators). Again, note that they opted to compare it only to another CPU!

Also, in the above die shot, you can see it hardly comes for free. For all those server apps which don't use AMX, they're still paying for it!

Wow did that spinning around create a audible crack in your spine?!
How much are people going to pay more for the mi300 that they will never use?!

dalek1234 · Jun 13, 2023

"Intel Claims"

NinoPino · Jun 13, 2023

Amdlova said:
7 times the performance but the price...

...and the power consumption.

bit_user · Jun 13, 2023

phitinh81 said:
No need to argue against this type of marketing benchmarks. Intel obviously dont think they can fool DC customers with this. The audience of this material are people hyping up on the AI train with zero technical knowledge whatsoever.

Eh, I saw less "funny business" than I expected. The worst parts were things like comparing differently-priced CPUs, using RHEL (with an ancient kernel) for HPC benchmarks, and maybe some sub-optimal DIMM configuration and NPS settings.

phitinh81 said:
It will get even uglier when comparing to Mi300 in AI workload.

Those are aimed at Nvidia's Grace+Hopper superchip.

Both price- & performance- wise, they should be in a different league than even Xeon Max.

bit_user · Jun 13, 2023

atmapuri said:
All you need here is a breakthrough in AI, which uses a different algorithm, and one generation of hardware accelerators is history.

Yeah, which is why you see people throwing 1-2 generation old Nvidia GPUs in the trash...
/s

atmapuri said:
Initially you only try to charge people differently for your (same) product depending on on how much money the customer has. By charging separately for using different parts of the (same) CPU is the final stage of this process. This will work, if you have an absolutely monopolistic position on the market.

Reading this, I can't tell if you're aware that Intel just started doing that. What's weird is that they excluded AMX from it.

Intel Finalizes 'Intel on Demand' Pay-As-You-Go Mechanism for CPUs

Intel's Software Defined Silicon finally gets an official brand name.

www.tomshardware.com

bit_user · Jun 13, 2023

TerryLaze said:
Wow did that spinning around create a audible crack in your spine?!
How much are people going to pay more for the mi300 that they will never use?!

You're mixing two different things. My response was written to @rtoaht saying that "AMD needs an answer to the AI accelerators."

So, I suggested that for people who want AI accelerators, they can use a solution like MI300.

However, people who have conventional server workloads can continue to use their maistream EPYC and not pay for AI accelerators they're not using.

So, there's no contradiction, here. Didn't think I'd have to spell that out.

News Intel Claims Sapphire Rapids up to 7X Faster Than AMD EPYC Genoa in AI and Other Workloads

Administrator

Deleted member 2731765

Guest

Distinguished

Reputable

Distinguished

Reputable

Honorable

Titan

AI Workloads​

General Workloads​

HPC Workloads​

Titan

Honorable

Titan

Titan

Honorable

Prominent

Distinguished

Titan

Honorable

Reputable

Titan

Titan

Titan

Share this page

AI Workloads

General Workloads

HPC Workloads