News Intel fires back at AMD's AI benchmarks, shares results claiming current-gen Xeon chips are faster at AI than next-gen EPYC Turin

bit_user · Jun 14, 2024

CmdrShepard said:
But you expect us to believe some emo hippie on YouTube? Give me a break.

Steve Burke isn't just "some emo hippie". He's one of the more knowledgeable and trustworthy tech tubers out there.

CmdrShepard said:
Last time I checked, Intel CPUs were still the top results in all relevant charts from performance to sales.

AMD's 7800X3D beats it in numerous games and AMD CPUs usually beat Intel in Linux.

CmdrShepard said:
Also, Intel launched into video card business and is doing very good considering ...

Losing money is not "doing very good", regardless of what you consider. The cards were years late to market and still nowhere near ready, let alone competitive.

CmdrShepard said:
how long both AMD and NVIDIA have been entrenched there as the only contenders.

How long has Intel's iGPU business been around? Shouldn't they know more about building GPU hardware and drivers, by now?

CmdrShepard said:
4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle and 1,024 BF16 operations per cycle.

And it's still not as fast as a humble gaming GPU costing 1/10th as much!

CmdrShepard · Jun 14, 2024

bit_user said:
Steve Burke isn't just "some emo hippie".

You're right, he is also an arrogant obnoxious attention _____ (5 letter word, starts with w).

bit_user said:
He's one of the more knowledgeable and trustworthy tech tubers out there.

Considering the competion the knowledgeable part isn't that hard. As for trustworthy? Maybe for you, but not for me.

bit_user said:
AMD's 7800X3D beats it in numerous games and AMD CPUs usually beat Intel in Linux.

Yeah, it beats it by 5% average in games where average FPS is already 200+ in 1080p and has like 1% difference in 4K where games aren't CPU limited. Besides, all benchmarks done so far should be redone with proper Intel BIOS settings to see the actual performance which might even end up being better with reduced power draw and less throttling.

Linux performance? WGAF about that?

I'd hardly call that a beating.

bit_user said:
Losing money is not "doing very good", regardless of what you consider. The cards were years late to market and still nowhere near ready, let alone competitive.

I don't see how any company could enter PC GPU market and do anything but lose money in the start, even if they did have integrated GPU experience. The cards are decent, and very competitive for the asking price.

bit_user said:
And it's still not as fast as a humble gaming GPU costing 1/10th as much!

Which GPU you are talking about exactly here? If Xeon is $1,000 I'd like to see a (new, not used and current, not old) $100 GPU that can beat it in AI inference because for $100 you won't even get enough VRAM to load a decent model on it.

Droidfreak · Jun 14, 2024

TL;DR always treat vendor-provided benchmarks as BS and wait for independent testing.

bit_user · Jun 14, 2024

CmdrShepard said:
You're right, he is also an arrogant obnoxious attention _____ (5 letter word, starts with w).

Oof. People in glass houses shouldn't throw stones.

CmdrShepard said:
Which GPU you are talking about exactly here? If Xeon is $1,000 I'd like to see a (new, not used and current, not old) $100 GPU that can beat it in AI inference because for $100 you won't even get enough VRAM to load a decent model on it.

$1000 will hardly get you in the door, with Xeon pricing. Don't you have a Xeon W? I shouldn't have to tell you that!

No, I was assuming you were talking about a serious number of cores, because AI and all that. Now, I was also thinking about the server Xeons, but if you want to do Xeon W, we can go down that road instead and see how they match up.

CmdrShepard · Jun 14, 2024

bit_user said:
Oof. People in glass houses shouldn't throw stones.

I am not the one bitching and moaning on YouTube while earning money from ad impressions of impressionable idiots soaking up their next opinion from me because they don't have and can't form their own -- I can therefore throw whatever I want.

bit_user said:
$1000 will hardly get you in the door, with Xeon pricing. Don't you have a Xeon W? I shouldn't have to tell you that!

Oh no, we aren't shifting the goalposts -- you made a comparison of a CPU to GPU performance and cost WITHOUT bringing external elements into play.

I've quoted Xeon w5-2455X CPU price and asked for a 1/10th of a price GPU to match. You made that ridiculous claim, not me. And of course you can't have a GPU working by itself either and you need the full system to feed it just like you need for a CPU so we can disregard that cost -- just say which GPU you had in mind.

bit_user said:
No, I was assuming you were talking about a serious number of cores, because AI and all that.

12 cores 24 threads. each AMX tile of each core has 8 registers each 1 KB in size and a TMUL unit. Oh and there's also AVX-512 too.

bit_user said:
Now, I was also thinking about the server Xeons, but if you want to do Xeon W, we can go down that road instead and see how they match up.

You wanted server Xeons because that would have given you more headroom for more expensive and thus more powerful GPU. Nope.

rluker5 · Jun 14, 2024

NeoMorpheus said:
Thanks for validating both points i made. 😀

The scripts Intel ran are available on GitHub FYI.

NeoMorpheus · Jun 14, 2024

rluker5 said:
The scripts Intel ran are available on GitHub FYI.

bit_user · Jun 15, 2024

CmdrShepard said:
I am not the one bitching and moaning on YouTube while earning money from ad impressions of impressionable idiots soaking up their next opinion from me because they don't have and can't form their own -- I can therefore throw whatever I want.

Actually, you just undermined your original point. You alleged he has a character flaw, and then proceeded to provide a justification for it.

CmdrShepard said:
Oh no, we aren't shifting the goalposts -- you made a comparison of a CPU to GPU performance and cost WITHOUT bringing external elements into play.

You already established:

"4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle and 1,024 BF16 operations per cycle. "

My statement assumed an element of reasonableness that I'm beginning to see is a mistake, when dealing with you. Someone interested in AMX performance would likely go for a mid-range Xeon model with the best perf/$, since the AMX performance is most directly tied to how many cores it has. I didn't mean to imply that one could take any Xeon SP model and find a dGPU costing 1/10th as much that would out-perform it. That would obviously be impossible, considering the cheapest Sapphire Rapids Xeon costs just $425 and no dGPU exist that can be purchased for 1/10th of that price.

https://ark.intel.com/content/www/u...nze-3408u-processor-22-5m-cache-1-80-ghz.html

The value in having these debates is to establish the relative merits of the products and their approaches. Trying to win on a technicality is at odds with that and suggests you're really just more interested in burnishing your ego than arriving at deeper truths.

CmdrShepard said:
I've quoted Xeon w5-2455X CPU

Oh, well now that we're nit-picking, I'm going to hold you to your original statement, which pertained to the Xeon Scalable product line. You don't want any goalpost-moving, right?

CmdrShepard said:
And of course you can't have a GPU working by itself either and you need the full system to feed it just like you need for a CPU so we can disregard that cost -- just say which GPU you had in mind.

The way that works is you get a cheap CPU with lots of PCIe lanes. Then, you can pack in several GPUs to amortize the overhead costs of the host.

CmdrShepard said:
12 cores 24 threads. each AMX tile of each core has 8 registers each 1 KB in size and a TMUL unit. Oh and there's also AVX-512 too.

Not sure why you're listing anything but the number of cores, since that's what the AMX unit is tied to. Also, I know how many registers it has and how big they are.

TheHerald · Jun 15, 2024

bit_user said:
Steve Burke isn't just "some emo hippie". He's one of the more knowledgeable and trustworthy tech tubers out there.

Is he? Strongly disagree.

For example, he made an entire video to prove that the 7800x 3d is more efficient at ISO power compared to the 14700k in MT. The only actual test he did at MT was blender, the 14700k topped the entire chart in efficiency and he proceeded to downplay it by saying "nobody is going to run it at this configuration". Sorry, can't find that very trustworthy.

bit_user · Jun 15, 2024

TheHerald said:
he proceeded to downplay it by saying "nobody is going to run it at this configuration".

Are you sure you're not omitting some context? What did he mean by "this configuration"?

TheHerald · Jun 15, 2024

bit_user said:
Are you sure you're not omitting some context? What did he mean by "this configuration"?

He meant powerlimitting it to the 7800x 3d power draw. Basically he made a video to disprove that the 14700k is more efficient at ISO power in MT, and when the only thing he actually tested (blender) had the 14700k sitting on top (it only lost to a 4.999€ threadripper) he said it's not relevant cause nobody runs it like that.

bit_user · Jun 15, 2024

TheHerald said:
He meant powerlimitting it to the 7800x 3d power draw. Basically he made a video to disprove that the 14700k is more efficient at ISO power in MT, and when the only thing he actually tested (blender) had the 14700k sitting on top (it only lost to a 4.999€ threadripper) he said it's not relevant cause nobody runs it like that.

Okay, so he guessed wrong. At least he did the testing and showed the results, even though they made him look bad.

IMO, it's not surprising that a 20 core/28 thread CPU would be more efficient than an 8/16 one, especially given that rendering doesn't tend to respond well to the extra L3 cache.

TheHerald · Jun 15, 2024

bit_user said:
Okay, so he guessed wrong. At least he did the testing and showed the results, even though they made him look bad.

IMO, it's not surprising that a 20 core/28 thread CPU would be more efficient than an 8/16 one, especially given that rendering doesn't tend to respond well to the extra L3 cache.

The problem isn't that he guessed wrong, that's fine / whatever. The problem is he quickly glossed over the results and even made a comment about it not mattering cause who runs like that. He wouldn't be so quickly to gloss over it if it was any other way around.

CmdrShepard · Jun 15, 2024

bit_user said:
Actually, you just undermined your original point. You alleged he has a character flaw, and then proceeded to provide a justification for it.

Both can be true so no.

bit_user said:
Someone interested in AMX performance would likely go for a mid-range Xeon model with the best perf/$, since the AMX performance is most directly tied to how many cores it has.

And you don't think the core base / boost clock affects the AMX speed?

bit_user said:
I didn't mean to imply that one could take any Xeon SP model and find a dGPU costing 1/10th as much that would out-perform it. That would obviously be impossible, considering the cheapest Sapphire Rapids Xeon costs just $425 and no dGPU exist that can be purchased for 1/10th of that price.

That's how it sounded and how non-tech-savvy people reading your post would've understood it -- you should have clarified better what you meant by Xeon (SP/MP, how many cores, price range, etc) instead of making a blanket statement like that with a singular goal to dismiss AMX as a gimmick compared to even the cheapest of dGPUs.

bit_user said:
The value in having these debates is to establish the relative merits of the products and their approaches. Trying to win on a technicality is at odds with that and suggests you're really just more interested in burnishing your ego than arriving at deeper truths.

Says the guy who made a blanket statement about merits to begin with.

bit_user said:
Oh, well now that we're nit-picking, I'm going to hold you to your original statement, which pertained to the Xeon Scalable product line. You don't want any goalpost-moving, right?

And the architectural difference between Xeon Scalable (which was in Intel's quote, not my choice) and Xeon W is what (apart from obvious lack of scalability in workstation class CPU)? When it comes to AMX they are functionally the same.

bit_user said:
The way that works is you get a cheap CPU with lots of PCIe lanes. Then, you can pack in several GPUs to amortize the overhead costs of the host.

Then you would actually want Xeon W because they can have up to 112 PCI-Express lanes compared to 80 lanes in Xeon Scalable. On the other hand, in order to saturate those GPUs you can't exactly get the cheapest possible CPU either and if you are already paying for more powerful CPU you might as well use it.

Thankfully there are already some additional 3rd-party benchmarks which can validate that Intel AMX isn't something to sneeze at:

https://www.phoronix.com/review/intel-xeon-amx/6

bit_user · Jun 15, 2024

CmdrShepard said:
And you don't think the core base / boost clock affects the AMX speed?

I specifically said "most directly".

Boost clocks won't enter the picture and I don't trust Intel's base clocks on this, either. I had direct experience with a Skylake-era Xeon SP dropping its clock speeds well below the specified base clocks, when hit with a moderate AVX-512 workload. So, to make any remotely serious performance estimate, I would actually do some digging around to se what happens to clocks under a heavy AMX load.

That's how it sounded and how non-tech-savvy people reading your post would've understood it -- you should have clarified better what you meant by Xeon (SP/MP, how many cores, price range, etc) instead of making a blanket statement like that with a singular goal to dismiss AMX as a gimmick compared to even the cheapest of dGPUs.

CmdrShepard said:
And the architectural difference between Xeon Scalable (which was in Intel's quote, not my choice) and Xeon W is what

That's why I was willing to allow for it, as a point of comparison, before I saw that you were going to be more pedantic than curious.

CmdrShepard said:
Then you would actually want Xeon W because they can have up to 112 PCI-Express lanes compared to 80 lanes in Xeon Scalable.

Or maybe go with an EPYC or TR Pro, which not only has more PCIe lanes but more memory channels with which to feed them.

CmdrShepard said:
On the other hand, in order to saturate those GPUs you can't exactly get the cheapest possible CPU either

How many CPU cores to specify really depends on what you're doing. Ideally, the CPUs really aren't doing the heavy number-crunching.

CmdrShepard said:
Thankfully there are already some additional 3rd-party benchmarks which can validate that Intel AMX isn't something to sneeze at:

https://www.phoronix.com/review/intel-xeon-amx/6

It doesn't answer the question, since Phoronix Test Suite has no OpenVINO benchmarks in common, between the CPU and GPU device. I'd suspect that's intentional, but then he only defined four benchmarks for the GPU. So, it seems more like he regards OpenVINO as a CPU benchmarking tool than a GPU one.

CmdrShepard · Jun 15, 2024

bit_user said:
Boost clocks won't enter the picture and I don't trust Intel's base clocks on this, either. I had direct experience with a Skylake-era Xeon SP dropping its clock speeds well below the specified base clocks, when hit with a moderate AVX-512 workload. So, to make any remotely serious performance estimate, I would actually do some digging around to se what happens to clocks under a heavy AMX load.

There's nothing to dig -- there is a MSR value called TMUL (negative) offset which can be set in BIOS if overclocking is supported on the CPU and mainboard (which is for Xeon W).

I just ran ONNX runtime micro benchmark with AMX at 4.6 GHz (so offset 0) on my w5-2455X, and it is suprisingly power efficient compared to AVX-512 -- it used up to 200W of power.

AVX-512 on the other hand can pull well over 300W and I can't run it at 4.6 GHz because current cooling setup can't sustain that much dissipation without being too loud for my taste.

bit_user said:
That's why I was willing to allow for it, as a point of comparison, before I saw that you were going to be more pedantic than curious.

I am always curious, it's my main character flaw.

bit_user said:
Or maybe go with an EPYC or TR Pro, which not only has more PCIe lanes but more memory channels with which to feed them.

Or that, yeah.

bit_user said:
It doesn't answer the question, since Phoronix Test Suite has no OpenVINO benchmarks in common, between the CPU and GPU device. I'd suspect that's intentional, but then he only defined four benchmarks for the GPU. So, it seems more like he regards OpenVINO as a CPU benchmarking tool than a GPU one.

The point of that benchmark was to compare AMX on/off with AMD. To me, it throws a monkey wrench into those "AMD beating / spanking Intel" fanboy claims.

bit_user · Jun 15, 2024

CmdrShepard said:
There's nothing to dig -- there is a MSR value called TMUL (negative) offset which can be set in BIOS if overclocking is supported on the CPU and mainboard (which is for Xeon W).

I just ran ONNX runtime micro benchmark with AMX at 4.6 GHz (so offset 0) on my w5-2455X, and it is suprisingly power efficient compared to AVX-512 -- it used up to 200W of power.

AVX-512 on the other hand can pull well over 300W ...

Thanks for the info.

CmdrShepard said:
The point of that benchmark was to compare AMX on/off with AMD. To me, it throws a monkey wrench into those "AMD beating / spanking Intel" fanboy claims.

AMX doesn't apply to much beyond AI and it AMX doesn't even magically speed up all AI. It only supports operations on dense matrices and only supports int8 and bf16.

Now, let's look at Intel's claims about it. They say it does inferencing on a simple image classification network (Resnet50) at 50% faster than Nvidia A30. The A30 is rated at 165 half-precision tensor TFLOPS. The RTX 4080 Super is rated at 209 (dense), which is 26.7% faster, so not quite there. The RTX 4090 jumps all the way up to 330, which should correspond to 33.3% faster than their AMX test vehicle.

https://simplecore.intel.com/oneapi...uez-AI-Software-and-Hardware-Acceleration.pdf

That doc doesn't directly say which CPU they used, but I followed the link and found they're using Xeon Platinum 8480+ for several benchmarks in this class. That has a list price of $10.7k

https://ark.intel.com/content/www/u...tinum-8480-processor-105m-cache-2-00-ghz.html

So, if utilizing sparsity adds enough performance to the RTX 4080 Super for it to match or exceed the Xeon, then it's exactly as I said - a 10:1 price ratio. Or, if you use a network that won't fit in the CPU's L3 cache, since Resnet50 is quite old and most networks now in use are much larger, then it's a similar situation where the RTX 4080 Super will beat the Xeon by the sheer force of its memory bandwidth.

However, in the case most favorable to Intel (because it's the one they chose to highlight), in the test conditions most favorable to them, then the ratio is a mere 6.2:1. ...but, if we look at performance/$, we get 2.24 fps/$ for the Xeon and 18.6 fps/$ for the RTX 4090. That's a ratio of 8.3:1, on Intel's chosen workload.

CmdrShepard · Jun 15, 2024

bit_user said:
AMX doesn't apply to much beyond AI and it AMX doesn't even magically speed up all AI. It only supports operations on dense matrices and only supports int8 and bf16.

I don't see why someone can't use it for something else too (DCT quantization of image/video comes to mind as a possible use and so does audio DSP), and as for AI it is kind of magical because it is supported in PyTorch and ONNX and likely in other frameworks already.

bit_user said:
Now, let's look at Intel's claims about it. They say it does inferencing on a simple image classification network (Resnet50) at 50% faster than Nvidia A30. The A30 is rated at 165 half-precision tensor TFLOPS. The RTX 4080 Super is rated at 209 (dense), which is 26.7% faster, so not quite there. The RTX 4090 jumps all the way up to 330, which should correspond to 33.3% faster than their AMX test vehicle.

RTX 4090 also has 450W TDP, limited RAM compared to a CPU, and increased latency of processing (you can't eliminate PCI-e bus latency for host->device->host transfers).

bit_user said:
That doc doesn't directly say which CPU they used, but I followed the link and found they're using Xeon Platinum 8480+ for several benchmarks in this class. That has a list price of $10.7k

So, if utilizing sparsity adds enough performance to the RTX 4080 Super for it to match or exceed the Xeon, then it's exactly as I said - a 10:1 price ratio. Or, if you use a network that won't fit in the CPU's L3 cache, since Resnet50 is quite old and most networks now in use are much larger, then it's a similar situation where the RTX 4080 Super will beat the Xeon by the sheer force of its memory bandwidth.

However, in the case most favorable to Intel (because it's the one they chose to highlight), in the test conditions most favorable to them, then the ratio is a mere 6.2:1. ...but, if we look at performance/$, we get 2.24 fps/$ for the Xeon and 18.6 fps/$ for the RTX 4090. That's a ratio of 8.3:1, on Intel's chosen workload.

And what is the performance of 4090 with a 90 GB model? Zero fps/$. You are comparing silly things. If you want to compare video cards then compare the professional cards with proper amount of VRAM and you will see they aren't cheaper than $10k Xeon.

Look, I am not saying GPUs aren't faster, what with all those Tensor and CUDA cores, just that CPUs with AI acceleration have their own place in the market where larger models are needed and where VRAM is prohibitively expensive.

Intel currently has a better performing product in that segment than AMD despite AMD's claims to the contrary and that was my point since that's what we were discussing here.

bit_user · Jun 15, 2024

CmdrShepard said:
I don't see why someone can't use it for something else too (DCT quantization of image/video comes to mind as a possible use

That's because you didn't bother to look at the actual instructions it supports. It just computes dot products. That's it. For DCTs used in image & video compression, the blocks are so small it's probably not worth the overhead of using AMX.

CmdrShepard said:
and so does audio DSP),

Honestly, what can you do with audio at only 8 bits? BF16 isn't much better, as the mantissa is still only 8 bits. Maybe it's good enough for certain kinds of audio analysis, but nobody would want to listen to audio processed using BF16.

CmdrShepard said:
and as for AI it is kind of magical because it is supported in PyTorch and ONNX and likely in other frameworks already.

For AI, it's really aimed at convolutional neural networks. Big convolutions, involving non-separable 2D kernels. That's great for image analysis, like their Resnet50 example.

CmdrShepard said:
RTX 4090 also has 450W TDP, limited RAM compared to a CPU, and increased latency of processing (you can't eliminate PCI-e bus latency for host->device->host transfers).

For sure, the RAM limit is an issue for training. As far as latency goes, inferencing is usually batched anyhow. You can stream your videos or images into the GPU's memory, have it decompress them, and then asynchronously collect the results as they finish.

CmdrShepard said:
I am not saying GPUs aren't faster, what with all those Tensor and CUDA cores, just that CPUs with AI acceleration have their own place in the market where larger models are needed and where VRAM is prohibitively expensive.

What's notable about CPUs and AI is that some people have actually resorted to using them for it, due to the scarcity of GPUs capable of training large models. Because of that, AMX probably did end up being a win for Intel.

The down side is that it made their server P-cores that much bigger, which really hurt their core density until Sierra Forest finally launched.

CmdrShepard said:
Intel currently has a better performing product in that segment than AMD despite AMD's claims to the contrary and that was my point since that's what we were discussing here.

Eh, you're spoiling for a fight, but nobody cares. What I find funny is how you were so dismissive of AMD's Linux performance and yet CPU AI performance is way less important than that!

slightnitpick · Jun 15, 2024

purposelycryptic said:
Unlike with Apple, the ARM architecture will fracture the Windows software ecosystem if successful, with likely significant negative repercussions to the overall perceived stability and reliability of the Windows platform, as a PC can no longer be relied upon to "just work" with software, due to the split architecture.

For most people I'd assume that most new software would come from the Windows store. In other cases I'd expect software publishers to have an install script wrapper that autodetects the platform and downloads the compatible version.

purposelycryptic said:
Emulation offers a band-aid solution for the time being, but for the average consumer, and especially the large SMB market, who don't fully understand the difference and simply need their software to work, which for many companies can be positively ancient and barely able to run on modern x86 systems, this is likely to result in significant frustration which, if not handled just right, may collapse majority opinion on Windows ARM machines, resulting in a death spiral they cannot recover from.

For the reason you indicate ("barely able to run on modern x86 systems") this old software typically doesn't get installed on new machines. It gets used on the old machines. So no problem there.

Even smaller companies typically have a third-party IT department that they contract with. Basically no one except IT themselves is deciding what computers to buy. I expect the regular IT certification process to be adequate to preventing frustration here.

Don Frenser · Jun 15, 2024

I will not believe anything any shintel rep says.

The fraude company should have been banned from the universe 20 years ago whenbthey were found out about the bribes they paid

CmdrShepard · Jun 16, 2024

bit_user said:
That's because you didn't bother to look at the actual instructions it supports. It just computes dot products. That's it. For DCTs used in image & video compression, the blocks are so small it's probably not worth the overhead of using AMX.

DCT blocks vary in sizes for video and I am sure with some clever tricks you could do more than one block at a time given the tile size.

bit_user said:
Honestly, what can you do with audio at only 8 bits?

Newsflash: you can do quality audio even with 1-bit ADC. But to answer your question, if your dynamic range is controlled you don't really need 16 bits to get good enough quality for speech.

bit_user said:
BF16 isn't much better, as the mantissa is still only 8 bits. Maybe it's good enough for certain kinds of audio analysis, but nobody would want to listen to audio processed using BF16.

What if it's a machine listening to it? Like picking out trigger words out of a bunch of low bandwidth audio channels?

bit_user said:
For AI, it's really aimed at convolutional neural networks. Big convolutions, involving non-separable 2D kernels. That's great for image analysis, like their Resnet50 example.

From what I can find, it can work for other stuff as well:
https://huggingface.co/blog/stable-diffusion-inference-intel

It is nowhere near performance of RTX 4090 of course, but it provides good speedup for CPU inference.

bit_user said:
For sure, the RAM limit is an issue for training.

Not only for training.

bit_user said:
As far as latency goes, inferencing is usually batched anyhow. You can stream your videos or images into the GPU's memory, have it decompress them, and then asynchronously collect the results as they finish.

That's not what I am saying -- think of a hypothetical code completion model for example. You type a character in text editor, it is sent to the GPU along with context, model analyzes it and returns results to display. GPU roundrip is always going to have more latency than CPU processing here and you can't really hide it by batching.

bit_user said:
What's notable about CPUs and AI is that some people have actually resorted to using them for it, due to the scarcity of GPUs capable of training large models. Because of that, AMX probably did end up being a win for Intel.

Considering the prices of such GPUs that seems like a smart move. Not only you control the core count and RAM amount allowing you to scale out as needed, but you also aren't limited to using CUDA only.

bit_user said:
The down side is that it made their server P-cores that much bigger, which really hurt their core density until Sierra Forest finally launched.

Sierra Forest are E-core Xeons which have different accelerators -- they don't have AVX-512 and AMX. You want Emerald Rapids (5th Gen) or Granite Rapids (6th Gen) Xeons for comparison.

There's also another upside (provided that the tech works as advertised) to using CPUs for AI instead of GPUs -- Intel TDX. You can have VMs fully isolated from hypervisor and everything else with encrypted memory. As far as I know GPUs don't support this level of protection yet so there's no such thing as private AI inference in the cloud (and that's why Apple is building their own).

bit_user said:
What I find funny is how you were so dismissive of AMD's Linux performance and yet CPU AI performance is way less important than that!

I was dismissive of Linux performance in general, not AMD Linux performance. That's important to people running Linux servers, not to consumers. AI performance on the other hand is important to everyone who would like to use AI.

bit_user · Jun 16, 2024

CmdrShepard said:
DCT blocks vary in sizes for video and I am sure with some clever tricks you could do more than one block at a time given the tile size.

I trust you know what a dot product is, right? It has one answer. A matrix multiply of matrices A and B consists of A_M * B_N dot products, where the size of the operands is A_N (which should be the same as B_M). I'm no math genius, but I don't see a generalized way to build a joint matrix multiply out of that. For an 8x8 matrix, I think it always amounts to 64 dot products of 8-element vectors. Those vectors are so small that even the largest 32x32 blocks can fit in regular AVX registers. AMX has no special advantage, here.

CmdrShepard said:
Newsflash: you can do quality audio even with 1-bit ADC.

There's no free lunch. After delta-sigma conversion, your bitrate is usually even higher and processing is now much more complex! I think most non-trivial processing on delta-sigma bitstreams probably converts back to PCM, before anything else. Delta-sigma was mostly done as a hack to make high-performance DACs cheaper.

CmdrShepard said:
But to answer your question, if your dynamic range is controlled you don't really need 16 bits to get good enough quality for speech.

Oh, sure. If speech at the quality of legacy telephone systems is all you want, then 8 bits is plenty. As I said, you can do some audio analysis @ 8-bits (including speech recognition). However, when people speak more broadly of audio signal processing, I figure they're including applications like processing for studio production and live music performance.

CmdrShepard said:
It is nowhere near performance of RTX 4090 of course, but it provides good speedup for CPU inference.

Not only for training.

Yeah, but because you can use cheaper client GPUs for inference, and memory size doesn't tend to be such a limiting factor, I think the main use case for AMX is in training.

CmdrShepard said:
That's not what I am saying -- think of a hypothetical code completion model for example. You type a character in text editor, it is sent to the GPU along with context, model analyzes it and returns results to display. GPU roundrip is always going to have more latency than CPU processing here and you can't really hide it by batching.

Yes, realtime inferencing in a closed loop is an example of where you can't use batching. Realtime control applications, like robotics or self-driving is where you tend to find examples of that. However, those things typically don't involve a big server CPU. Instead, they tend to use something like these Nvidia AGX systems:

NVIDIA AGX Systems

The world's first family of systems for Autonomous Machines, Self Driving Cars, and Medical Imaging.

www.nvidia.com

CmdrShepard said:
Considering the prices of such GPUs that seems like a smart move. Not only you control the core count and RAM amount allowing you to scale out as needed,

Oh, but your scaling is effectively limited to 8 sockets. Training needs high inter-processor bandwidth, hence Nvidia's focus on NVLink. Sure, you can scale out to multiple chassis, but it doesn't scale linearly in either performance or especially cost. People do it, as I said, but it's not a panacea.

CmdrShepard said:
you also aren't limited to using CUDA only.

Yes, and that's a good thing. The near-term practical benefits of avoiding CUDA in AI are virtually nonexistent, but I want to see us move away from CUDA as much as anyone.

CmdrShepard said:
Sierra Forest are E-core Xeons which have different accelerators -- they don't have AVX-512 and AMX.

My point was that Sapphire & Emerald Rapids have AMX accelerators - you're paying for that silicon, even if you don't need them (e.g. for web serving or general database serving, etc.). This problem wasn't solved until Sierra Forest, but that came years late.

CmdrShepard said:
There's also another upside (provided that the tech works as advertised) to using CPUs for AI instead of GPUs -- Intel TDX. You can have VMs fully isolated from hypervisor and everything else with encrypted memory. As far as I know GPUs don't support this level of protection yet

You really should give Nvidia more credit, given how long they've been in the cloud computing game. If features like that are important for CPUs, you can bet Nvidia's customers have been asking for the same sorts of capabilities in their GPUs. I don't know much about it, but Nvidia claims to have a feature called Confidential Computing:

"While data is encrypted at rest in storage and in transit across the network, it’s unprotected while it’s being processed. NVIDIA Confidential Computing addresses this gap by protecting data and applications in use. The NVIDIA Hopper architecture introduces the world’s first accelerated computing platform with confidential computing capabilities.

With strong hardware-based security, users can run applications on-premises, in the cloud, or at the edge and be confident that unauthorized entities can’t view or modify the application code and data when it’s in use. This protects confidentiality and integrity of data and applications while accessing the unprecedented acceleration of H200 and H100 GPUs for AI training, AI inference, and HPC workloads."

NVIDIA Hopper GPU Architecture

World’s most advanced GPU.

www.nvidia.com

Let's be clear about something: even Intel didn't expect AMX would effectively counter GPUs!

https://www.tomshardware.com/pc-com...di-3-will-cost-half-the-price-of-nvidias-h100

https://www.tomshardware.com/pc-com...cchio-to-focus-on-falcon-shores-gaudi-2-and-3

I think AMX was part of a hedge on the bets Intel was making in Ponte Vecchio (GPU Max) and Habana Labs (Gaudi). They knew Xeon Phi was going away and they'd have to transition to those new product lines, but they weren't sure how good or quick uptake would be, or how many customers would come along.

CmdrShepard · Jun 16, 2024

I know what dot product is, read this to see what I had in mind (skip to This isn’t about Machine Learning or Artificial Intelligence sub-heading if you are impatient):

tdpbuud: Average Color – Wunk

Using artificial intelligence and machine learning instructions to get the average color of an image

wunkolo.github.io

People find all kinds of creative uses for new instructions all the time.

As for NVIDIA CC, it relies on SEV-SNP (AMD) or TDX (Intel) -- it basically just covers the GPU code and data paths and can't do anything with the system itself.

bit_user · Jun 16, 2024

CmdrShepard said:
I know what dot product is, read this to see what I had in mind (skip to This isn’t about Machine Learning or Artificial Intelligence sub-heading if you are impatient):

tdpbuud: Average Color – Wunk

Using artificial intelligence and machine learning instructions to get the average color of an image

wunkolo.github.io

Yes, like I said: it's really a convolution engine. That's a degenerate case.

It's funny to me that he thinks multiplying by lots of zeros is better than the conventional approaches of de-interleaving. Planar image formats are better for most purposes, anyways. If he started with a planar format, he could get an easy ~4x speedup! (And if you don't have a planar image, then you can write a simple AVX-based deinterleaver that's almost as fast as a memcpy().)

News Intel fires back at AMD's AI benchmarks, shares results claiming current-gen Xeon chips are faster at AI than next-gen EPYC Turin

Titan

Prominent

Reputable

Titan

Prominent

Distinguished

Reputable

Titan

Respectable

Titan

Respectable

Titan

Respectable

Prominent

Titan

Prominent

Titan

Prominent

Titan

Upstanding

Reputable

Prominent

Titan

Prominent

Titan

Share this page