News Intel Launches Sapphire Rapids Fourth-Gen Xeon CPUs and Ponte Vecchio Max GPU Series

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

bit_user

Polypheme
Ambassador
I am curious to see what will be the next gen Non Volatile Memory (NVM) options that will connect through CXL will appear on the market in a couple of years ?

And also what will be the latency because it will be usung the PCIe5 or PCIe6 bus, but Intel Optane DC PMEM were pluggable on the DRAM bus ?
Last things first: CXL doens't run over PCIe, but runs instead of it. They share the same PHY specification, but different protocols. Two big differences include CXL's support for cache-coherency and its lower latency.

With that out of the way, let's dig into the latency question. The first piece of this is how much latency Optane PMem modules have. According to this helpful article I found, the answer seems to be about 350 ns.


The same article also answers the question of how much latency CXL is likely to add. I'm not sure the article answers that directly, but they seem to quote a figure of 170 to 250 ns, which they claim is on par with accessing DRAM connected to a different CPU, in a NUMA system (presumably 2-socket). So, anyone using Optane PMem as a "cheap" or more-scalable alternative to DRAM should find the transition to CXL.mem fairly painless. With a good memory tiering implementation, you could even see a noticeable performance improvement.

Any guess ? Nantero carbon nanotube NRAM ? Everspin / Avalanche magnetic MRAM (STT-MRAM, SOT-MRAM, VG-SOT-MRAM,…) ?
For anyone who truly needs persistence, I think NAND-backed DRAM will be hard to beat, unless some of those technologies can really blow past the GB/$ of DDR5.

it is a very slow process, and some government [investment] and/or regulations (ex: a law that would mandate that by 2035 all IT systems must use bistable Non Volatile Memory (NVM) instead of DRAM…) would tremendously help to accelerate the transition from volatile to Non Volatile Memory (NVM).

I believe that NVM is the holy grail of any IT system as it would be really disruptive in the way IT systems are architectured (no more boot-up time) and really, really want it to happen as soon as possible.
No, that's insane. First, software is buggy and it's generally good to have the ability to start from a clean slate, every so often. Hardware is also buggy; same conclusion applies. And you want startup & shutdown to be well-tested, which would be less the case if it were intended to be an even rarer event.

Secondly, datacenters essentially have uninterrupted power, giving you as much uptime as you want. And with VMs that you can suspend and resume at will, you don't need persistent memory just to avoid rebooting when you don't want to.

I had more in mind all IoT devices, laptop computers, tablets,
IoT devices either need to stay on 24/7 and probably are slow enough they could even use existing NV memory, if they need the power savings badly enough.

Laptops and tablets do fine with sleeping. In neither case would you probably want to take the performance hit of using NV memory instead of RAM. However, if it were fast enough, I guess it'd be cool to have a pseudo-hibernation where you page out all of DRAM to NV storage.

desktop computers… that are also « at risk » in case of power failure (you lose everything you were working on).
Autosave has finally been incorporated in most software. And gosh, if you really did suddenly lose power, maybe some of the program's working state would've been corrupted in the process. Did you ever consider that?

Also for IoT devices, when there is no activity, even during weeks or months, the memory in standby would not consume power (add some power harvesting technologies, and you might have a remote IoT sensor device that could run for months / years)
I'm pretty sure they already manage this, with existing technologies.
 
Last edited:
  • Like
Reactions: TJ Hooker

bit_user

Polypheme
Ambassador
Can someone explain the accelerator on demand thing? Are the chips cheaper to produce with unactivated accelerator cores? How does this work? The cores have to function ahead of time right? So it's a software thing? It's all very weird to me.

Accelerators are usually created for specific uses cases ...
Okay description of accelerators, but the slides featured in the article go into excruciating detail regarding exactly what accelerators it has and what they're useful for. As you said, the software needs to be aware of them. Where you're wrong is that any of them are handled by the compiler. They're not. The slides go as far as indicating which Intel libraries software needs to use, in order to add support for them.

You completely missed the "on demand" part of the question, which I think is what @gg83 was mainly asking about. To the extent that I think we know how this will work, it's described here:



The slides go into detail about which CPU models support on-demand, as well as how many accelerators of which type they have enabled by default.
 
  • Like
Reactions: gg83

bit_user

Polypheme
Ambassador
For those that are deep into Comp Sci and professional hard core devs, Intel just released an updated version of the Optimization and S/W dev manuals. You might find them an useful intro into how to release the power of accelerators. I particular i recommend chapter 19 (FP16 instructions), chapter 20 (AMX and AI - highly recommended for Comp Sci folks who work in or study the AI field) and chapter 22 (Quick Assist). For some reason they have not updated Chapter 21 Crypto (may not be ready yet...)
It helps if you already are pretty comfortable with Intel assembly and the Intel CPU architecture.
Intel® 64 and IA-32 Architectures Optimization Reference Manual
Some of the other manuals are also updated:
Intel® 64 and IA-32 Architectures Software Developer Manuals
Sure, if you like details and want to know how the accelerators really work. But to use them for anything you really ought to use their recommended libraries:
Y66r9HFPkJoTX5jrptSjjH.jpg

Also mentioned in the slides: Intel Data Mover Library and Intel Query Processing Library.

I spent a few hours on this so far scratching the surface, primarily on Chapter 19 (FP16) and 20 (AMX).

In my view it is pretty impressive work by Intel!
Some really deep thinking went into this! E.g. how to update your existing neural net implementations, adjust/update GEMM and other core components of matrix multiplication in neural net, connecting to Py torch etc, to get max value out of AMX with BF16. and I mean both at the conceptual level and in detail - like complete code at the assembly instruction level detail!
You'll often find similar stuff for other AI accelerators. Ultimately, it's a very small number of people who should probably be writing that code. The rest of us can benefit simply by using the frameworks with the proper backend for your hardware.

I'm almost sold. When these are available for the W790 platform, I'm getting a W34xx and building a new system (if i can afford one 😂)!
Or just get a fairly recent, higher-end Nvidia GPU, put it in a regular desktop PC, and it'll run circles around any W3400 CPU. GPU tensor cores have more TFLOPs and GPUs have more memory bandwidth. Nvidia's 4000-series also has oodles of SRAM to help keep it fed.

all the big once. Amazon for instance
Amazon has GPU instances. You don't think they use them, too?

The thing is that using the GPU or a specific AI accelerator adds latency to the whole process (movement back and forth between CPU and GPU/AI accelerator) .
Sure, launching a CUDA kernel typically takes about 10 microseconds. A typical AI inference takes several milliseconds, and you typically batch and pipeline them. GPUs have video & image decoding hardware, so you can send the compressed data over PCIe and do the decoding directly on the GPU.

The point is that most of the work before and after inference has to be done on the CPU anyway
Not the before part, necessarily. Anyway, whatever the CPU has to do to get the input data and whatever it does with the output data are pretty cheap by comparison. I mean, you might as well argue that we shouldn't use GPUs to render frames in video games, "because the data just comes from the CPU, so why not also render it there?"

Seriously, why do you think Intel made Knights Mill and Ponte Vecchio, or bought Nervana and Habana Labs? They certainly know GPUs and AI accelerators make sense!

AMX is a pretty good matrix multiplication accelerator.
For low-precision data, only. Also, seems to lack support for sparsity, which AI accelerators and GPUs have been optimizing for a couple generations, already.

Right they use both FPGA (well advertised) BUT they also use Intel Xeon's (not as well advertised) - AWS is huge and with tons of customers that runs many different AI implementations.
AWS is now populated by the majority of their own CPUs. And Graviton 3 now supports SVE (ARM's answer to AVX).

Sure, you can still rent Xeon or EPYC instances, but I think Amazon's internal stuff is probably running > 95% on Graviton.

Update:

So, I was curious how its deep learning performance actually compared with GPUs. For the Xeon CPU, I'm using the 8480+, which has base clocks of 2 GHz (and let's not kid ourselves, it should struggle to maintain even that speed, if you're saturating the AMX pipelines of all cores). For the GPUs, I'm comparing tensor/matrix compute performance on fp16 or bf16, according to the data I found in: https://www.tomshardware.com/reviews/nvidia-geforce-rtx-4090-review and https://www.tomshardware.com/reviews/amd-radeon-rx-7900-xtx-and-xt-review-shooting-for-the-top

MakeModelTensor TFLOPS (f16)
IntelXeon 8480+
115​
NvidiaH100
1979*​
NvidiaRTX 4090
661*​
NvidiaRTX 3090
142​
IntelData Center GPU Max 1550
839​
IntelA770
138​
AMDInstinct MI250X
383​
AMDRX 7900 XTX
123​
AMDRX 6950 XT
47​

* I think I recall someone claiming that Nvidia's RTX 4090 spec is for performance on sparse matrices, and that it's is up to 2x their performance on dense matrices. In this case, the figure should really be about 331 for an apples-to-apples comparison. That said, sparsity is quite common, which is why they bothered to accelerate it. So, it's not entirely unfair for them to quote the sparse number, if they in fact did.

As I mentioned, memory bandwidth is wildly different. The non-Max Xeons have about 300 GB/s, while the top-end AMD and Nvidia cards are about 1 TB/s. For the Xeon to hold court on that metric, you'd have to spring for the $13k Xeon Max 9480. For that price you could get about 8x RTX 4090's, with an aggregate performance of 49 times as much! (or maybe just 25 times, if we're comparing performance on dense matrices).

So, no. This doesn't represent the end of GPU or purpose-built AI accelerators for inferencing or training. Not even close. It even makes sense, if you just look at the relative amount of die area used by their respective tensor or matrix multipliers. And that also suggests that CPUs will never match the compute performance of GPUs or purpose-built AI accelerators.
 
Last edited:

JamesJones44

Reputable
Jan 22, 2021
664
596
5,760
Okay description of accelerators, but the slides featured in the article go into excruciating detail regarding exactly what accelerators it has and what they're useful for. As you said, the software needs to be aware of them. Where you're wrong is that any of them are handled by the compiler. They're not. The slides go as far as indicating which Intel libraries software needs to use, in order to add support for them.

You completely missed the "on demand" part of the question, which I think is what @gg83 was mainly asking about. To the extent that I think we know how this will work, it's described here:



The slides go into detail about which CPU models support on-demand, as well as how many accelerators of which type they have enabled by default.

Not every accelerator needs to be addressed by the end application developer. For example, MMX instructions can be substituted in when an application is compiled by an end developer using media API calls (VMs do this as well). So while underlying SDKs/VM must be updated to use the instructions, the upstream developers do not need to change their code to get the benefits. They simply recompile.
 

bit_user

Polypheme
Ambassador
Not every accelerator needs to be addressed by the end application developer. For example, MMX instructions can be substituted in when an application is compiled by an end developer using media API calls
So, that's an example of where a library is implementing the functionality. Intel listed some software packages (including libraries) that have optimization to use the accelerators.

developers do not need to change their code to get the benefits. They simply recompile.
You don't necessarily even need to recompile. If the upgraded library has the same interface and is implemented as a DLL, you can just upgrade the DLL. In the case of something like a database, you might just need to upgrade that software package.

The slides which detail all of the accelerators take the trouble of listing some software packages (potentially) updated to support them, as well as Intel-provided libraries & SDKs you can use to integrate them into your own code. OS updates (drivers or maybe OS kernel) might also be necessary, in some cases.
 
  • Like
Reactions: JamesJones44

JamesJones44

Reputable
Jan 22, 2021
664
596
5,760
So, that's an example of where a library is implementing the functionality. Intel listed some software packages (including libraries) that have optimization to use the accelerators.


You don't necessarily even need to recompile. If the upgraded library has the same interface and is implemented as a DLL, you can just upgrade the DLL. In the case of something like a database, you might just need to upgrade that software package.

The slides which detail all of the accelerators take the trouble of listing some software packages (potentially) updated to support them, as well as Intel-provided libraries & SDKs you can use to integrate them into your own code. OS updates (drivers or maybe OS kernel) might also be necessary, in some cases.

Right, this is where I was going with my post originally. I dumbed it down and said "compiler", but what I meant was, most software developers don't necessarily have to do anything to benefit from the accelerators. It strongly depends on what the accelerator does and what the developer needs/wants to do with it. If the use case can be generalized, they don't have to do much if anything. If it can't then they have to specially use the functionality (SSE2 falls in this camp often).
 

bit_user

Polypheme
Ambassador
The SPR talk in here is amazing for this thread. Worth a listen:
...
Regards.
Thanks but, MLiD seems very divisive. And is it really 2.5 hours? If the SPR stuff is from a L1 Techs guy, wouldn't it be easier just to get it straight from the source?

Anyway, I try pretty hard to stay off youtube. But thanks for thinking of us.
 
Thanks but, MLiD seems very divisive. And is it really 2.5 hours? If the SPR stuff is from a L1 Techs guy, wouldn't it be easier just to get it straight from the source?

Anyway, I try pretty hard to stay off youtube. But thanks for thinking of us.
Careful with prejudice. Plus, it's an interview, not a review. Wendel is still working or SPR and getting information from his Micro server loan.

Plus, there's a few sites with reviews out there already, this is just a subjective take on what means SPR from outside the numbers and from Wendel's and Tom's perspective.

Regards.
 

bit_user

Polypheme
Ambassador
Careful with prejudice.
Not prejudice - reputation. You cannot deny that MLiD is one of the most controversial tech tubers out there, today. I've seen more debate over him and his predictions than mere mentions of all others, combined.

Anyway, I already said I stay off youtube, so it's a moot point. But 2.5 hours, sheesh! Sounds like he needs to hire an editor!
 
Not prejudice - reputation. You cannot deny that MLiD is one of the most controversial tech tubers out there, today. I've seen more debate over him and his predictions than mere mentions of all others, combined.

Anyway, I already said I stay off youtube, so it's a moot point. But 2.5 hours, sheesh! Sounds like he needs to hire an editor!
Fair. But you're still using hearsay as fact (or it sounds . Keep that in mind.

I would invite you to make your own mind on Tom and his information (generally speaking). In the particular video I linked, there's conjecture and speculation as it naturally would come out of an interview when you ask "what do you think of this or that?".

But again, fair. If you don't want to spend that much time listening to them talking about it, I can't blame you, haha. It's not easy to do. Personally, I put podcasts and such when I play games that allow me to "mind-split", so it's a nice thing to have in the background and pay full attention from time to time when they touch on something you find interesting.

Regards.
 

bit_user

Polypheme
Ambassador
Fair. But you're still using hearsay as fact
No, the hearsay is a fact. No matter what side of it you're on, you cannot deny the controversy. It's literally the only thing I know about MLiD. I do not have time for controversial predictions from anyone. I prefer to use my limited time reading actual news, reviews, and analysis.

That is not a value judgement on you. Feel free to consume content and support whomever you like.

Personally, I put podcasts and such when I play games
I can't. I listen to news when I'm doing chores. When I'm doing anything else, all I can have on is music, because I usually end up tuning it out. And there are way too many news & newsy podcasts in my queue. It's already overflowing.
 
  • Like
Reactions: -Fran-