News Intel fires back at AMD's AI benchmarks, shares results claiming current-gen Xeon chips are faster at AI than next-gen EPYC Turin

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

bit_user

Polypheme
Ambassador
But you expect us to believe some emo hippie on YouTube? Give me a break.
Steve Burke isn't just "some emo hippie". He's one of the more knowledgeable and trustworthy tech tubers out there.

Last time I checked, Intel CPUs were still the top results in all relevant charts from performance to sales.
AMD's 7800X3D beats it in numerous games and AMD CPUs usually beat Intel in Linux.

Also, Intel launched into video card business and is doing very good considering ...
Losing money is not "doing very good", regardless of what you consider. The cards were years late to market and still nowhere near ready, let alone competitive.

how long both AMD and NVIDIA have been entrenched there as the only contenders.
How long has Intel's iGPU business been around? Shouldn't they know more about building GPU hardware and drivers, by now?

4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle and 1,024 BF16 operations per cycle.
And it's still not as fast as a humble gaming GPU costing 1/10th as much!
 
Last edited:

CmdrShepard

Prominent
Dec 18, 2023
416
307
560
Steve Burke isn't just "some emo hippie".
You're right, he is also an arrogant obnoxious attention _____ (5 letter word, starts with w).
He's one of the more knowledgeable and trustworthy tech tubers out there.
Considering the competion the knowledgeable part isn't that hard. As for trustworthy? Maybe for you, but not for me.
AMD's 7800X3D beats it in numerous games and AMD CPUs usually beat Intel in Linux.
Yeah, it beats it by 5% average in games where average FPS is already 200+ in 1080p and has like 1% difference in 4K where games aren't CPU limited. Besides, all benchmarks done so far should be redone with proper Intel BIOS settings to see the actual performance which might even end up being better with reduced power draw and less throttling.

Linux performance? WGAF about that?

I'd hardly call that a beating.
Losing money is not "doing very good", regardless of what you consider. The cards were years late to market and still nowhere near ready, let alone competitive.
I don't see how any company could enter PC GPU market and do anything but lose money in the start, even if they did have integrated GPU experience. The cards are decent, and very competitive for the asking price.
And it's still not as fast as a humble gaming GPU costing 1/10th as much!
Which GPU you are talking about exactly here? If Xeon is $1,000 I'd like to see a (new, not used and current, not old) $100 GPU that can beat it in AI inference because for $100 you won't even get enough VRAM to load a decent model on it.
 
  • Like
Reactions: Metal Messiah.

bit_user

Polypheme
Ambassador
You're right, he is also an arrogant obnoxious attention _____ (5 letter word, starts with w).
Oof. People in glass houses shouldn't throw stones.

Which GPU you are talking about exactly here? If Xeon is $1,000 I'd like to see a (new, not used and current, not old) $100 GPU that can beat it in AI inference because for $100 you won't even get enough VRAM to load a decent model on it.
$1000 will hardly get you in the door, with Xeon pricing. Don't you have a Xeon W? I shouldn't have to tell you that!

No, I was assuming you were talking about a serious number of cores, because AI and all that. Now, I was also thinking about the server Xeons, but if you want to do Xeon W, we can go down that road instead and see how they match up.
 
  • Like
Reactions: helper800

CmdrShepard

Prominent
Dec 18, 2023
416
307
560
Oof. People in glass houses shouldn't throw stones.
I am not the one bitching and moaning on YouTube while earning money from ad impressions of impressionable idiots soaking up their next opinion from me because they don't have and can't form their own -- I can therefore throw whatever I want.
$1000 will hardly get you in the door, with Xeon pricing. Don't you have a Xeon W? I shouldn't have to tell you that!
Oh no, we aren't shifting the goalposts -- you made a comparison of a CPU to GPU performance and cost WITHOUT bringing external elements into play.

I've quoted Xeon w5-2455X CPU price and asked for a 1/10th of a price GPU to match. You made that ridiculous claim, not me. And of course you can't have a GPU working by itself either and you need the full system to feed it just like you need for a CPU so we can disregard that cost -- just say which GPU you had in mind.

No, I was assuming you were talking about a serious number of cores, because AI and all that.
12 cores 24 threads. each AMX tile of each core has 8 registers each 1 KB in size and a TMUL unit. Oh and there's also AVX-512 too.
Now, I was also thinking about the server Xeons, but if you want to do Xeon W, we can go down that road instead and see how they match up.
You wanted server Xeons because that would have given you more headroom for more expensive and thus more powerful GPU. Nope.
 

NeoMorpheus

Reputable
Jun 8, 2021
223
250
4,960
comic-zoom-inscription-woosh-colored-background-vector-illustration-231233552.jpg
The scripts Intel ran are available on GitHub FYI.
 

bit_user

Polypheme
Ambassador
I am not the one bitching and moaning on YouTube while earning money from ad impressions of impressionable idiots soaking up their next opinion from me because they don't have and can't form their own -- I can therefore throw whatever I want.
Actually, you just undermined your original point. You alleged he has a character flaw, and then proceeded to provide a justification for it.

Oh no, we aren't shifting the goalposts -- you made a comparison of a CPU to GPU performance and cost WITHOUT bringing external elements into play.
You already established:

"4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle and 1,024 BF16 operations per cycle. "

My statement assumed an element of reasonableness that I'm beginning to see is a mistake, when dealing with you. Someone interested in AMX performance would likely go for a mid-range Xeon model with the best perf/$, since the AMX performance is most directly tied to how many cores it has. I didn't mean to imply that one could take any Xeon SP model and find a dGPU costing 1/10th as much that would out-perform it. That would obviously be impossible, considering the cheapest Sapphire Rapids Xeon costs just $425 and no dGPU exist that can be purchased for 1/10th of that price.

The value in having these debates is to establish the relative merits of the products and their approaches. Trying to win on a technicality is at odds with that and suggests you're really just more interested in burnishing your ego than arriving at deeper truths.

I've quoted Xeon w5-2455X CPU
Oh, well now that we're nit-picking, I'm going to hold you to your original statement, which pertained to the Xeon Scalable product line. You don't want any goalpost-moving, right?

And of course you can't have a GPU working by itself either and you need the full system to feed it just like you need for a CPU so we can disregard that cost -- just say which GPU you had in mind.
The way that works is you get a cheap CPU with lots of PCIe lanes. Then, you can pack in several GPUs to amortize the overhead costs of the host.

12 cores 24 threads. each AMX tile of each core has 8 registers each 1 KB in size and a TMUL unit. Oh and there's also AVX-512 too.
Not sure why you're listing anything but the number of cores, since that's what the AMX unit is tied to. Also, I know how many registers it has and how big they are.
 
  • Like
Reactions: helper800

TheHerald

Upstanding
Feb 15, 2024
390
99
260
Steve Burke isn't just "some emo hippie". He's one of the more knowledgeable and trustworthy tech tubers out there.
Is he? Strongly disagree.

For example, he made an entire video to prove that the 7800x 3d is more efficient at ISO power compared to the 14700k in MT. The only actual test he did at MT was blender, the 14700k topped the entire chart in efficiency and he proceeded to downplay it by saying "nobody is going to run it at this configuration". Sorry, can't find that very trustworthy.
 

TheHerald

Upstanding
Feb 15, 2024
390
99
260
Are you sure you're not omitting some context? What did he mean by "this configuration"?
He meant powerlimitting it to the 7800x 3d power draw. Basically he made a video to disprove that the 14700k is more efficient at ISO power in MT, and when the only thing he actually tested (blender) had the 14700k sitting on top (it only lost to a 4.999€ threadripper) he said it's not relevant cause nobody runs it like that.
 

bit_user

Polypheme
Ambassador
He meant powerlimitting it to the 7800x 3d power draw. Basically he made a video to disprove that the 14700k is more efficient at ISO power in MT, and when the only thing he actually tested (blender) had the 14700k sitting on top (it only lost to a 4.999€ threadripper) he said it's not relevant cause nobody runs it like that.
Okay, so he guessed wrong. At least he did the testing and showed the results, even though they made him look bad.

IMO, it's not surprising that a 20 core/28 thread CPU would be more efficient than an 8/16 one, especially given that rendering doesn't tend to respond well to the extra L3 cache.
 

TheHerald

Upstanding
Feb 15, 2024
390
99
260
Okay, so he guessed wrong. At least he did the testing and showed the results, even though they made him look bad.

IMO, it's not surprising that a 20 core/28 thread CPU would be more efficient than an 8/16 one, especially given that rendering doesn't tend to respond well to the extra L3 cache.
The problem isn't that he guessed wrong, that's fine / whatever. The problem is he quickly glossed over the results and even made a comment about it not mattering cause who runs like that. He wouldn't be so quickly to gloss over it if it was any other way around.
 
  • Like
Reactions: CmdrShepard

CmdrShepard

Prominent
Dec 18, 2023
416
307
560
Actually, you just undermined your original point. You alleged he has a character flaw, and then proceeded to provide a justification for it.
Both can be true so no.
Someone interested in AMX performance would likely go for a mid-range Xeon model with the best perf/$, since the AMX performance is most directly tied to how many cores it has.
And you don't think the core base / boost clock affects the AMX speed?
I didn't mean to imply that one could take any Xeon SP model and find a dGPU costing 1/10th as much that would out-perform it. That would obviously be impossible, considering the cheapest Sapphire Rapids Xeon costs just $425 and no dGPU exist that can be purchased for 1/10th of that price.
That's how it sounded and how non-tech-savvy people reading your post would've understood it -- you should have clarified better what you meant by Xeon (SP/MP, how many cores, price range, etc) instead of making a blanket statement like that with a singular goal to dismiss AMX as a gimmick compared to even the cheapest of dGPUs.
The value in having these debates is to establish the relative merits of the products and their approaches. Trying to win on a technicality is at odds with that and suggests you're really just more interested in burnishing your ego than arriving at deeper truths.
Says the guy who made a blanket statement about merits to begin with.
Oh, well now that we're nit-picking, I'm going to hold you to your original statement, which pertained to the Xeon Scalable product line. You don't want any goalpost-moving, right?
And the architectural difference between Xeon Scalable (which was in Intel's quote, not my choice) and Xeon W is what (apart from obvious lack of scalability in workstation class CPU)? When it comes to AMX they are functionally the same.
The way that works is you get a cheap CPU with lots of PCIe lanes. Then, you can pack in several GPUs to amortize the overhead costs of the host.
Then you would actually want Xeon W because they can have up to 112 PCI-Express lanes compared to 80 lanes in Xeon Scalable. On the other hand, in order to saturate those GPUs you can't exactly get the cheapest possible CPU either and if you are already paying for more powerful CPU you might as well use it.

Thankfully there are already some additional 3rd-party benchmarks which can validate that Intel AMX isn't something to sneeze at:

https://www.phoronix.com/review/intel-xeon-amx/6
 

bit_user

Polypheme
Ambassador
And you don't think the core base / boost clock affects the AMX speed?
I specifically said "most directly".

Boost clocks won't enter the picture and I don't trust Intel's base clocks on this, either. I had direct experience with a Skylake-era Xeon SP dropping its clock speeds well below the specified base clocks, when hit with a moderate AVX-512 workload. So, to make any remotely serious performance estimate, I would actually do some digging around to se what happens to clocks under a heavy AMX load.

That's how it sounded and how non-tech-savvy people reading your post would've understood it -- you should have clarified better what you meant by Xeon (SP/MP, how many cores, price range, etc) instead of making a blanket statement like that with a singular goal to dismiss AMX as a gimmick compared to even the cheapest of dGPUs.

And the architectural difference between Xeon Scalable (which was in Intel's quote, not my choice) and Xeon W is what
That's why I was willing to allow for it, as a point of comparison, before I saw that you were going to be more pedantic than curious.

Then you would actually want Xeon W because they can have up to 112 PCI-Express lanes compared to 80 lanes in Xeon Scalable.
Or maybe go with an EPYC or TR Pro, which not only has more PCIe lanes but more memory channels with which to feed them.

On the other hand, in order to saturate those GPUs you can't exactly get the cheapest possible CPU either
How many CPU cores to specify really depends on what you're doing. Ideally, the CPUs really aren't doing the heavy number-crunching.

Thankfully there are already some additional 3rd-party benchmarks which can validate that Intel AMX isn't something to sneeze at:

https://www.phoronix.com/review/intel-xeon-amx/6
It doesn't answer the question, since Phoronix Test Suite has no OpenVINO benchmarks in common, between the CPU and GPU device. I'd suspect that's intentional, but then he only defined four benchmarks for the GPU. So, it seems more like he regards OpenVINO as a CPU benchmarking tool than a GPU one.
 

CmdrShepard

Prominent
Dec 18, 2023
416
307
560
Boost clocks won't enter the picture and I don't trust Intel's base clocks on this, either. I had direct experience with a Skylake-era Xeon SP dropping its clock speeds well below the specified base clocks, when hit with a moderate AVX-512 workload. So, to make any remotely serious performance estimate, I would actually do some digging around to se what happens to clocks under a heavy AMX load.
There's nothing to dig -- there is a MSR value called TMUL (negative) offset which can be set in BIOS if overclocking is supported on the CPU and mainboard (which is for Xeon W).

I just ran ONNX runtime micro benchmark with AMX at 4.6 GHz (so offset 0) on my w5-2455X, and it is suprisingly power efficient compared to AVX-512 -- it used up to 200W of power.

AVX-512 on the other hand can pull well over 300W and I can't run it at 4.6 GHz because current cooling setup can't sustain that much dissipation without being too loud for my taste.
That's why I was willing to allow for it, as a point of comparison, before I saw that you were going to be more pedantic than curious.
I am always curious, it's my main character flaw.
Or maybe go with an EPYC or TR Pro, which not only has more PCIe lanes but more memory channels with which to feed them.
Or that, yeah.
It doesn't answer the question, since Phoronix Test Suite has no OpenVINO benchmarks in common, between the CPU and GPU device. I'd suspect that's intentional, but then he only defined four benchmarks for the GPU. So, it seems more like he regards OpenVINO as a CPU benchmarking tool than a GPU one.
The point of that benchmark was to compare AMX on/off with AMD. To me, it throws a monkey wrench into those "AMD beating / spanking Intel" fanboy claims.
 

bit_user

Polypheme
Ambassador
There's nothing to dig -- there is a MSR value called TMUL (negative) offset which can be set in BIOS if overclocking is supported on the CPU and mainboard (which is for Xeon W).

I just ran ONNX runtime micro benchmark with AMX at 4.6 GHz (so offset 0) on my w5-2455X, and it is suprisingly power efficient compared to AVX-512 -- it used up to 200W of power.

AVX-512 on the other hand can pull well over 300W ...
Thanks for the info.

The point of that benchmark was to compare AMX on/off with AMD. To me, it throws a monkey wrench into those "AMD beating / spanking Intel" fanboy claims.
AMX doesn't apply to much beyond AI and it AMX doesn't even magically speed up all AI. It only supports operations on dense matrices and only supports int8 and bf16.

Now, let's look at Intel's claims about it. They say it does inferencing on a simple image classification network (Resnet50) at 50% faster than Nvidia A30. The A30 is rated at 165 half-precision tensor TFLOPS. The RTX 4080 Super is rated at 209 (dense), which is 26.7% faster, so not quite there. The RTX 4090 jumps all the way up to 330, which should correspond to 33.3% faster than their AMX test vehicle.
That doc doesn't directly say which CPU they used, but I followed the link and found they're using Xeon Platinum 8480+ for several benchmarks in this class. That has a list price of $10.7k

So, if utilizing sparsity adds enough performance to the RTX 4080 Super for it to match or exceed the Xeon, then it's exactly as I said - a 10:1 price ratio. Or, if you use a network that won't fit in the CPU's L3 cache, since Resnet50 is quite old and most networks now in use are much larger, then it's a similar situation where the RTX 4080 Super will beat the Xeon by the sheer force of its memory bandwidth.

However, in the case most favorable to Intel (because it's the one they chose to highlight), in the test conditions most favorable to them, then the ratio is a mere 6.2:1. ...but, if we look at performance/$, we get 2.24 fps/$ for the Xeon and 18.6 fps/$ for the RTX 4090. That's a ratio of 8.3:1, on Intel's chosen workload.
 
  • Like
Reactions: helper800

CmdrShepard

Prominent
Dec 18, 2023
416
307
560
AMX doesn't apply to much beyond AI and it AMX doesn't even magically speed up all AI. It only supports operations on dense matrices and only supports int8 and bf16.
I don't see why someone can't use it for something else too (DCT quantization of image/video comes to mind as a possible use and so does audio DSP), and as for AI it is kind of magical because it is supported in PyTorch and ONNX and likely in other frameworks already.
Now, let's look at Intel's claims about it. They say it does inferencing on a simple image classification network (Resnet50) at 50% faster than Nvidia A30. The A30 is rated at 165 half-precision tensor TFLOPS. The RTX 4080 Super is rated at 209 (dense), which is 26.7% faster, so not quite there. The RTX 4090 jumps all the way up to 330, which should correspond to 33.3% faster than their AMX test vehicle.
RTX 4090 also has 450W TDP, limited RAM compared to a CPU, and increased latency of processing (you can't eliminate PCI-e bus latency for host->device->host transfers).
That doc doesn't directly say which CPU they used, but I followed the link and found they're using Xeon Platinum 8480+ for several benchmarks in this class. That has a list price of $10.7k

So, if utilizing sparsity adds enough performance to the RTX 4080 Super for it to match or exceed the Xeon, then it's exactly as I said - a 10:1 price ratio. Or, if you use a network that won't fit in the CPU's L3 cache, since Resnet50 is quite old and most networks now in use are much larger, then it's a similar situation where the RTX 4080 Super will beat the Xeon by the sheer force of its memory bandwidth.

However, in the case most favorable to Intel (because it's the one they chose to highlight), in the test conditions most favorable to them, then the ratio is a mere 6.2:1. ...but, if we look at performance/$, we get 2.24 fps/$ for the Xeon and 18.6 fps/$ for the RTX 4090. That's a ratio of 8.3:1, on Intel's chosen workload.
And what is the performance of 4090 with a 90 GB model? Zero fps/$. You are comparing silly things. If you want to compare video cards then compare the professional cards with proper amount of VRAM and you will see they aren't cheaper than $10k Xeon.

Look, I am not saying GPUs aren't faster, what with all those Tensor and CUDA cores, just that CPUs with AI acceleration have their own place in the market where larger models are needed and where VRAM is prohibitively expensive.

Intel currently has a better performing product in that segment than AMD despite AMD's claims to the contrary and that was my point since that's what we were discussing here.
 
  • Like
Reactions: helper800

bit_user

Polypheme
Ambassador
I don't see why someone can't use it for something else too (DCT quantization of image/video comes to mind as a possible use
That's because you didn't bother to look at the actual instructions it supports. It just computes dot products. That's it. For DCTs used in image & video compression, the blocks are so small it's probably not worth the overhead of using AMX.

and so does audio DSP),
Honestly, what can you do with audio at only 8 bits? BF16 isn't much better, as the mantissa is still only 8 bits. Maybe it's good enough for certain kinds of audio analysis, but nobody would want to listen to audio processed using BF16.

and as for AI it is kind of magical because it is supported in PyTorch and ONNX and likely in other frameworks already.
For AI, it's really aimed at convolutional neural networks. Big convolutions, involving non-separable 2D kernels. That's great for image analysis, like their Resnet50 example.

RTX 4090 also has 450W TDP, limited RAM compared to a CPU, and increased latency of processing (you can't eliminate PCI-e bus latency for host->device->host transfers).
For sure, the RAM limit is an issue for training. As far as latency goes, inferencing is usually batched anyhow. You can stream your videos or images into the GPU's memory, have it decompress them, and then asynchronously collect the results as they finish.

I am not saying GPUs aren't faster, what with all those Tensor and CUDA cores, just that CPUs with AI acceleration have their own place in the market where larger models are needed and where VRAM is prohibitively expensive.
What's notable about CPUs and AI is that some people have actually resorted to using them for it, due to the scarcity of GPUs capable of training large models. Because of that, AMX probably did end up being a win for Intel.

The down side is that it made their server P-cores that much bigger, which really hurt their core density until Sierra Forest finally launched.

Intel currently has a better performing product in that segment than AMD despite AMD's claims to the contrary and that was my point since that's what we were discussing here.
Eh, you're spoiling for a fight, but nobody cares. What I find funny is how you were so dismissive of AMD's Linux performance and yet CPU AI performance is way less important than that!
 
  • Like
Reactions: helper800

slightnitpick

Upstanding
Nov 2, 2023
164
102
260
Unlike with Apple, the ARM architecture will fracture the Windows software ecosystem if successful, with likely significant negative repercussions to the overall perceived stability and reliability of the Windows platform, as a PC can no longer be relied upon to "just work" with software, due to the split architecture.
For most people I'd assume that most new software would come from the Windows store. In other cases I'd expect software publishers to have an install script wrapper that autodetects the platform and downloads the compatible version.
Emulation offers a band-aid solution for the time being, but for the average consumer, and especially the large SMB market, who don't fully understand the difference and simply need their software to work, which for many companies can be positively ancient and barely able to run on modern x86 systems, this is likely to result in significant frustration which, if not handled just right, may collapse majority opinion on Windows ARM machines, resulting in a death spiral they cannot recover from.
For the reason you indicate ("barely able to run on modern x86 systems") this old software typically doesn't get installed on new machines. It gets used on the old machines. So no problem there.

Even smaller companies typically have a third-party IT department that they contract with. Basically no one except IT themselves is deciding what computers to buy. I expect the regular IT certification process to be adequate to preventing frustration here.
 
  • Like
Reactions: bit_user

Don Frenser

Reputable
Mar 29, 2020
31
8
4,535
I will not believe anything any shintel rep says.

The fraude company should have been banned from the universe 20 years ago whenbthey were found out about the bribes they paid
 

CmdrShepard

Prominent
Dec 18, 2023
416
307
560
That's because you didn't bother to look at the actual instructions it supports. It just computes dot products. That's it. For DCTs used in image & video compression, the blocks are so small it's probably not worth the overhead of using AMX.
DCT blocks vary in sizes for video and I am sure with some clever tricks you could do more than one block at a time given the tile size.
Honestly, what can you do with audio at only 8 bits?
Newsflash: you can do quality audio even with 1-bit ADC. But to answer your question, if your dynamic range is controlled you don't really need 16 bits to get good enough quality for speech.
BF16 isn't much better, as the mantissa is still only 8 bits. Maybe it's good enough for certain kinds of audio analysis, but nobody would want to listen to audio processed using BF16.
What if it's a machine listening to it? Like picking out trigger words out of a bunch of low bandwidth audio channels?
For AI, it's really aimed at convolutional neural networks. Big convolutions, involving non-separable 2D kernels. That's great for image analysis, like their Resnet50 example.
From what I can find, it can work for other stuff as well:
https://huggingface.co/blog/stable-diffusion-inference-intel

It is nowhere near performance of RTX 4090 of course, but it provides good speedup for CPU inference.
For sure, the RAM limit is an issue for training.
Not only for training.
As far as latency goes, inferencing is usually batched anyhow. You can stream your videos or images into the GPU's memory, have it decompress them, and then asynchronously collect the results as they finish.
That's not what I am saying -- think of a hypothetical code completion model for example. You type a character in text editor, it is sent to the GPU along with context, model analyzes it and returns results to display. GPU roundrip is always going to have more latency than CPU processing here and you can't really hide it by batching.
What's notable about CPUs and AI is that some people have actually resorted to using them for it, due to the scarcity of GPUs capable of training large models. Because of that, AMX probably did end up being a win for Intel.
Considering the prices of such GPUs that seems like a smart move. Not only you control the core count and RAM amount allowing you to scale out as needed, but you also aren't limited to using CUDA only.
The down side is that it made their server P-cores that much bigger, which really hurt their core density until Sierra Forest finally launched.
Sierra Forest are E-core Xeons which have different accelerators -- they don't have AVX-512 and AMX. You want Emerald Rapids (5th Gen) or Granite Rapids (6th Gen) Xeons for comparison.

There's also another upside (provided that the tech works as advertised) to using CPUs for AI instead of GPUs -- Intel TDX. You can have VMs fully isolated from hypervisor and everything else with encrypted memory. As far as I know GPUs don't support this level of protection yet so there's no such thing as private AI inference in the cloud (and that's why Apple is building their own).
What I find funny is how you were so dismissive of AMD's Linux performance and yet CPU AI performance is way less important than that!
I was dismissive of Linux performance in general, not AMD Linux performance. That's important to people running Linux servers, not to consumers. AI performance on the other hand is important to everyone who would like to use AI.
 
  • Like
Reactions: helper800

bit_user

Polypheme
Ambassador
DCT blocks vary in sizes for video and I am sure with some clever tricks you could do more than one block at a time given the tile size.
I trust you know what a dot product is, right? It has one answer. A matrix multiply of matrices A and B consists of A_M * B_N dot products, where the size of the operands is A_N (which should be the same as B_M). I'm no math genius, but I don't see a generalized way to build a joint matrix multiply out of that. For an 8x8 matrix, I think it always amounts to 64 dot products of 8-element vectors. Those vectors are so small that even the largest 32x32 blocks can fit in regular AVX registers. AMX has no special advantage, here.

Newsflash: you can do quality audio even with 1-bit ADC.
There's no free lunch. After delta-sigma conversion, your bitrate is usually even higher and processing is now much more complex! I think most non-trivial processing on delta-sigma bitstreams probably converts back to PCM, before anything else. Delta-sigma was mostly done as a hack to make high-performance DACs cheaper.

But to answer your question, if your dynamic range is controlled you don't really need 16 bits to get good enough quality for speech.
Oh, sure. If speech at the quality of legacy telephone systems is all you want, then 8 bits is plenty. As I said, you can do some audio analysis @ 8-bits (including speech recognition). However, when people speak more broadly of audio signal processing, I figure they're including applications like processing for studio production and live music performance.

It is nowhere near performance of RTX 4090 of course, but it provides good speedup for CPU inference.

Not only for training.
Yeah, but because you can use cheaper client GPUs for inference, and memory size doesn't tend to be such a limiting factor, I think the main use case for AMX is in training.

That's not what I am saying -- think of a hypothetical code completion model for example. You type a character in text editor, it is sent to the GPU along with context, model analyzes it and returns results to display. GPU roundrip is always going to have more latency than CPU processing here and you can't really hide it by batching.
Yes, realtime inferencing in a closed loop is an example of where you can't use batching. Realtime control applications, like robotics or self-driving is where you tend to find examples of that. However, those things typically don't involve a big server CPU. Instead, they tend to use something like these Nvidia AGX systems:


Considering the prices of such GPUs that seems like a smart move. Not only you control the core count and RAM amount allowing you to scale out as needed,
Oh, but your scaling is effectively limited to 8 sockets. Training needs high inter-processor bandwidth, hence Nvidia's focus on NVLink. Sure, you can scale out to multiple chassis, but it doesn't scale linearly in either performance or especially cost. People do it, as I said, but it's not a panacea.

you also aren't limited to using CUDA only.
Yes, and that's a good thing. The near-term practical benefits of avoiding CUDA in AI are virtually nonexistent, but I want to see us move away from CUDA as much as anyone.

Sierra Forest are E-core Xeons which have different accelerators -- they don't have AVX-512 and AMX.
My point was that Sapphire & Emerald Rapids have AMX accelerators - you're paying for that silicon, even if you don't need them (e.g. for web serving or general database serving, etc.). This problem wasn't solved until Sierra Forest, but that came years late.

There's also another upside (provided that the tech works as advertised) to using CPUs for AI instead of GPUs -- Intel TDX. You can have VMs fully isolated from hypervisor and everything else with encrypted memory. As far as I know GPUs don't support this level of protection yet
You really should give Nvidia more credit, given how long they've been in the cloud computing game. If features like that are important for CPUs, you can bet Nvidia's customers have been asking for the same sorts of capabilities in their GPUs. I don't know much about it, but Nvidia claims to have a feature called Confidential Computing:

hopper-arch-confidential-computing-2c50-d.jpg


"While data is encrypted at rest in storage and in transit across the network, it’s unprotected while it’s being processed. NVIDIA Confidential Computing addresses this gap by protecting data and applications in use. The NVIDIA Hopper architecture introduces the world’s first accelerated computing platform with confidential computing capabilities.

With strong hardware-based security, users can run applications on-premises, in the cloud, or at the edge and be confident that unauthorized entities can’t view or modify the application code and data when it’s in use. This protects confidentiality and integrity of data and applications while accessing the unprecedented acceleration of H200 and H100 GPUs for AI training, AI inference, and HPC workloads."


Let's be clear about something: even Intel didn't expect AMX would effectively counter GPUs!

I think AMX was part of a hedge on the bets Intel was making in Ponte Vecchio (GPU Max) and Habana Labs (Gaudi). They knew Xeon Phi was going away and they'd have to transition to those new product lines, but they weren't sure how good or quick uptake would be, or how many customers would come along.
 
Last edited:
  • Like
Reactions: helper800

CmdrShepard

Prominent
Dec 18, 2023
416
307
560
I know what dot product is, read this to see what I had in mind (skip to This isn’t about Machine Learning or Artificial Intelligence sub-heading if you are impatient):


People find all kinds of creative uses for new instructions all the time.

As for NVIDIA CC, it relies on SEV-SNP (AMD) or TDX (Intel) -- it basically just covers the GPU code and data paths and can't do anything with the system itself.
 
  • Like
Reactions: helper800

bit_user

Polypheme
Ambassador
I know what dot product is, read this to see what I had in mind (skip to This isn’t about Machine Learning or Artificial Intelligence sub-heading if you are impatient):

Yes, like I said: it's really a convolution engine. That's a degenerate case.

It's funny to me that he thinks multiplying by lots of zeros is better than the conventional approaches of de-interleaving. Planar image formats are better for most purposes, anyways. If he started with a planar format, he could get an easy ~4x speedup! (And if you don't have a planar image, then you can write a simple AVX-based deinterleaver that's almost as fast as a memcpy().)
 
Last edited:
  • Like
Reactions: helper800

TRENDING THREADS