News Intel Puts 10nm Ice Lake CPUs on an M.2 Stick, Meet the Nervana NPP-I Accelerator

Holy multicoring Batman!
Anybody knows how fast or slow m.2/pcie 2 is compared to infinity fabric?
Not that it matters for AI or DC in general but in case this trickles down to consumer products,now that would be some big little that we could get behind.
 
Holy multicoring Batman!
Anybody knows how fast or slow m.2/pcie 2 is compared to infinity fabric?
Not that it matters for AI or DC in general but in case this trickles down to consumer products,now that would be some big little that we could get behind.
Its designed as an add-on A.I accelerator for servers. , Intel used Ice Lake CPUs instead of the usual ARM based CPUs (such as those used in Intel's FPGAs). M.2 slots are common on many motherboards including the mainstream desktop ones. Have you seen any standardized slots for Infinity Fabric?
 
How do standardized slots affect any speed or latency?
I just asked how much slower it would be compared to IF to cross communicate between CPUs through the pci/m.2.
The problem is that currently Infinity Fabric is only used by AMD internally for their own chips and can be considered proprietary. The PCI Express used on the M.2 slot is standardized and most common. If you are talking about communications between Intel's Ice Lake CPUs with the NNP-I then that is the fastest (likely thru the ring cache interconnect) because its all internal in a single chip. If you are talking about communications between the server CPU (either Intel Xeon, AMD Opteron or AMD Epyc) with the add-on A.I accelerator module then of course it would be much slower with higher latency and less bandwidth.
 
  • Like
Reactions: bit_user
There is an NNP-I article from March 17 on serverhome that describes boards with up to 12 M.2 slots. There is also an article on March 16 that has more info on NNP-L1000 learning chip.
 
  • Like
Reactions: bit_user
The problem is that currently Infinity Fabric is only used by AMD internally for their own chips and can be considered proprietary. The PCI Express used on the M.2 slot is standardized and most common. If you are talking about communications between Intel's Ice Lake CPUs with the NNP-I then that is the fastest (likely thru the ring cache interconnect) because its all internal in a single chip. If you are talking about communications between the server CPU (either Intel Xeon, AMD Opteron or AMD Epyc) with the add-on A.I accelerator module then of course it would be much slower with higher latency and less bandwidth.

Intel has their own interconnect they would use either way, OmniPath.

This also might be a way to test it out and see how it does first before moving onto their Forevos ideas. I could imagine putting a NNP stacked into a chip would be beneficial.
 
I could imagine putting a NNP stacked into a chip would be beneficial.
Depends for what.
NNP is just the modified iGPU to be better at AI so most chips just need these changes to their existing iGPU.
The additional cores are, I believe, just a by product for them it's just cheaper to use existing full CPUs instead of figuring out how to make an iGPU an external independent thing.
 
Intel has their own interconnect they would use either way, OmniPath.

This also might be a way to test it out and see how it does first before moving onto their Forevos ideas. I could imagine putting a NNP stacked into a chip would be beneficial.
That OmniPath is usually for server rack-to-rack interconnect (much like Ethernet). However within the chip itself, usually either the ring interconnect or the mesh interconnect would be used.

NNP is just the modified iGPU to be better at AI so most chips just need these changes to their existing iGPU.
The additional cores are, I believe, just a by product for them it's just cheaper to use existing full CPUs instead of figuring out how to make an iGPU an external independent thing.
That NNP-I is not a modified integrated GPU but rather a dedicated A.I. accelerator architecture. Intel's inclusion of Ice Lake cores removes the need to pay licensing fees for using ARM based cores.
 
Last edited:
  • Like
Reactions: bit_user
in case this trickles down to consumer products,now that would be some big little that we could get behind.
If you have a RTX Nvidia card, then you already have something on par with it. We'll have to see how the specs ultimately match up, but their statement about it being "multiple orders" faster than a CPU or GPU is almost certainly excluding GPUs with Tensor cores.
 
Intel has their own interconnect they would use either way, OmniPath.
No. OmniPath is for connecting multiple boxes (or blades) - not chips that are so tightly-integrated. And they already said that PCIe is used for communication between this board and the host.

The Intel equivalent of Infinity Fabric is not OmniPath - it's UPI. In future, perhaps that will be replaced by CXL.
 
Last edited:
What I want to know is which of those dies is which? I assume the bigger one is the Nervana chip?

And how much RAM do these blades have? One reason GPUs excel at AI is the oodles of really fast RAM they've got. As models are often hundreds of MB, I'm skeptical that Nervana is going to have enough on-chip storage for bigger ones.

BTW, I had a good chuckle at that slide where they have a down-arrow labelled "NN Performance" that's meant to show increasing NN performance toward the bottom of the pyramid. Pretty poor graphic, there.
 
So whatever it does will be constrained by 32 Gbps PCI-e 3.0 x4 lanes of bandwidth...
Okay, so think about image classification - 4 GB/sec is a heck of a lot of images! And yes, I'm talking about compressed images - decompressing them would probably be the job of the Ice Lake CPU. That would scale a lot better than having the host CPU try to do all the decoding.

Now, that brings up one thing they might not have considered, which is video processing. If they removed the iGPU, then they probably also lost the hardware video decoder, which is critical for being able to analyze video - something Nvidia knows quite well. Nvidia substantially beefed up the decoder hardware in their Tesla T4 GPU relative to previous generations, presumably to match the increased inferencing horsepower provided by its tensor cores.