News Intel Demoes 8-Core, 528-Thread PUMA Chip with 1 TB/s Silicon Photonics

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
If you are going to build an application-specific (graph analysis) CPU with 66 hardware threads per core, you likely don't want a complex instruction decoder bottlenecking the whole thing and adding a handful of cycles latency penalty to every branch prediction miss and every hardware thread switch.
My guess is that the choice of RISC was driven primarily by the fact that this is a research chip, and therefore they went with the easiest thing to implement that would suit their needs. RISC should also be more area-efficient and energy-efficient. Finally, it sounds like they didn't need a lot of the functionality of x86, particularly the vector instructions and bulky vector registers that come with them.

66 threads is an odd number to settle on. Must have barely missed mandatory performance targets with 64.
Oops. As this slide clearly shows, they reached 66 threads by the fact that it has 4x 16-thread pipelines + 2x single-thread pipelines.

FzB73jRUzE6DRB7PUAv66E.jpg


because intel fabs are still at 10nm TSMC is at 5nm.
I find it's usually worth the time to click through the slides. As you can see, it says TSMC 7nm.

NUqz7NRuERpG4QV2aR8QdF.jpg


Interesting that they called it PUMA, because Intel already had a product called Puma which was DOCSIS cable chips for cable modems. They sold the business unit in 2020 for $150m.
Short memories at Intel?
Given that this is a research project, it probably won't have gone through the standard product naming process. Considering that cable modem chips would be developed in a different business unit, yeah it's totally unsurprising there's a name conflict. Different namespaces, though.
 
Last edited:
  • Like
Reactions: bobdvb
So what OS/software can this custom chip run? Also, is Intel signaling the end of x86 and the dawn of some new RISC-based architecture ?
I'm sure it's a custom software stack, like most embedded ASICs and accelerators. That they note it has PCIe x8, "for communication with the host", implies there's one or more host CPUs, running the actual OS.

There are plenty of Linux variants out there that can run on RISC-V and other RISC variants. My guess would be a fork of one of those variants with updated kernels to handle the custom instructions.
Running a Linux kernel on this thing would make about as much sense as running it on your GPU. In other words: none.

This is made to be a special-purpose, programmable compute accelerator - not a general-purpose CPU.

wvcNq8RmigH2k7CTUoZ9dQ.jpg


Also, I think this slide deserves to be highlighted. It provides the motivation for a custom microarchitecture, rather than just instantiating existing, industry-standard CPU cores:

idAC4aAGvYFrk2BP6b97kQ.jpg


What immediately jumped out at me, about their approach, is how GPU-like it is. They used extreme SMT for latency-hiding. Another feature it shares with GPUs is direct-addressed, on-chip SRAM. BTW, 32 registers per thread * 8 bytes per register * 66 threads = 16.5 kiB per core. If the cell size is similar to that of their scratchpad RAM, then it's not bad.

BTW, that direct-addressed SRAM probably means there's no context-switching of these SMT threads. Load-balancing probably happens at a higher-level.
 
Last edited:
I'm not sure what all this chatter is about the death of x86. It's likely this will be used in x86 systems.

This is a 1Tbs NIC that can communicate with really dense multipoint connections. It has accelerators on board because the performance targets couldn't be hit if the CPU had to do the traffic management.
 
This is a 1Tbs NIC that can communicate with really dense multipoint connections. It has accelerators on board because the performance targets couldn't be hit if the CPU had to do the traffic management.
What??? Not at all.

This is the compute accelerator! It's not a general-purpose NIC, and they do mean 1 Terrabyte/s, not 1 Terrabit/s. All of this is right in the article!

"The eight-core chip features 32 optical I/O ports that operate at 32 GB/s/dir apiece, thus totaling 1TB/s of total bandwidth. The chips drop into an eight-socket OCP server sled, offering up to 16 TB/s of total optical throughput for the system"

16 TB/s = 128 Tb/s. Furthermore, it's not even the chip's cores that are doing the routing - that's handled in the router blocks.

Furthermore, with just a PCIe 4.0 x8 host interface, this would make a lousy NIC. That host connection is only 16 GB/s, which is 1/64th of the aggregate link bandwidth it supports and only half the peak bandwidth of a single link!

What strikes me as a little weird is how mismatched its interconnect bandwidth is vs. memory bandwidth. However, maybe that's the type of thing they'd address in a production version.
 
Last edited:
And how close is 200Gb/s to 1Tb/s ?
This tech will connect multiple racks with each rack having multiple sockets.
Having 400ns, basically from one "computer" to the next, is pretty good.
28ja8EeXi2yUJFaiAodKPF-1200-80.jpg.webp
Between dies in a single package. The entire point of this chip is to demo interconnects between packages in different racks.
It makes about as much sense as complaining that a 100m 10Gb/s ethernet connection is unimpressive because DDR5 is 50GB/s.
Thanks for the clarification. I still think the tech specs aren't super useful though. It really doesn't exceed PCI-e USB 3.2 bandwidth or decrease latency between different machines YET.

EDIT: I did my math wrong and used bits instead of bytes. So I'm off on the above by an order of magnitude.

So what's the usage case? Simulated neural nets? When else do you need THAT MANY CPUs connected to each other instead of doing specific tasks on distributed computing?
 
Last edited:
What??? Not at all.

This is the compute accelerator! It's not a general-purpose NIC, and they do mean 1 Terrabyte/s, not 1 Terrabit/s. All of this is right in the article!
"The eight-core chip features 32 optical I/O ports that operate at 32 GB/s/dir apiece, thus totaling 1TB/s of total bandwidth. The chips drop into an eight-socket OCP server sled, offering up to 16 TB/s of total optical throughput for the system"​

16 TB/s = 128 Tb/s. Furthermore, it's not even the chip's cores that are doing the routing - that's handled in the router blocks.

Furthermore, with just a PCIe 4.0 x8 host interface, this would make a lousy NIC. That host connection is only 16 GB/s, which is 1/64th of the aggregate link bandwidth it supports and only half the peak bandwidth of a single link!

What strikes me as a little weird is how mismatched its interconnect bandwidth is vs. memory bandwidth. However, maybe that's the type of thing they'd address in a production version.
That's kind of what I was getting at. The numbers they're throwing around aren't special. But maybe the fact that it works at all, while competitive in performance with current tech, is the accomplishment?
 
So what's the usage case? Simulated neural nets? When else do you need THAT MANY CPUs connected to each other instead of doing specific tasks on distributed computing?
The paper says it all: PB-scale sparse graphs analysis. This is Intel's version of PB-scale distributed and meshed compute-near-memory to avoid having to haul all data all the way from whichever memory node hosts a piece of graph back to the CPU all of the time and they claim 1000X better power-efficiency doing it that way.

You don't know when a graph or combination of graphs may nuke any given node with multiple accesses, so you need each node to have enough local compute to deliver adequate worst-case performance.
 
The numbers they're throwing around aren't special.
Excuse me, but how do you call 1 TB/s of link bandwidth per chip "not special"? Especially for a 75 W chip??

By comparison, the aggregate NVLink bandwidth of Nvidia's 700 W H100 is only 1.8 GB/s. Nvidia's product literature claims 3.6 GB/s, but they're counting each direction, in which case Intel's PIUMA would be 2 TB/s - still a 1.8x ratio. More importantly, with Nvidia, you have to factor in the NVSwitch, which adds cost and power, and occupies board space. Intel's chip is switchless, as each chip integrates switching and routing capability.

But maybe the fact that it works at all, while competitive in performance with current tech, is the accomplishment?
It's a research chip. It was fabricated to demonstrate feasibility of the techniques, designs, and technologies they're investigating. It's made on TSMC N7 (which is actually a lot more recent than academic research chips would use). It is not supposed to be directly competitive with cutting-edge products, at least outside the narrow range of its application area.
 
  • Like
Reactions: domih and dalauder
It's a research chip. It was fabricated to demonstrate feasibility of the techniques, designs, and technologies they're investigating...
I agree, it's a cool research chip.
Excuse me, but how do you call 1 TB/s of link bandwidth per chip "not special"? Especially for a 75 W chip??

By comparison, the aggregate NVLink bandwidth of Nvidia's 700 W H100 is only 1.8 GB/s...
Sorry, my math was bad. I read "GB/s" for PCIe bandwidth and thought that even PCIe 4.0 easily got that on 60 lanes. Considering that Threadripper exposed 128 PCIe 4.0 lanes, it exclipsed 1TB/s easily. But it's 1Tb/s for Threadripper...or an order of magnitude less bandwidth.

I'm just repeatedly not making any sense today.
 
Last edited:
The paper says it all: PB-scale sparse graphs analysis. This is Intel's version of PB-scale distributed and meshed compute-near-memory to avoid having to haul all data all the way from whichever memory node hosts a piece of graph back to the CPU all of the time and they claim 1000X better power-efficiency doing it that way.

You don't know when a graph or combination of graphs may nuke any given node with multiple accesses, so you need each node to have enough local compute to deliver adequate worst-case performance.
Okay, that's pretty cool. I need to start sleeping more before reading anything technical. I should have seemed more impressed.
 
  • Like
Reactions: bit_user
What??? Not at all.

This is the compute accelerator! It's not a general-purpose NIC, and they do mean 1 Terrabyte/s, not 1 Terrabit/s. All of this is right in the article!
"The eight-core chip features 32 optical I/O ports that operate at 32 GB/s/dir apiece, thus totaling 1TB/s of total bandwidth. The chips drop into an eight-socket OCP server sled, offering up to 16 TB/s of total optical throughput for the system"​

16 TB/s = 128 Tb/s. Furthermore, it's not even the chip's cores that are doing the routing - that's handled in the router blocks.

Furthermore, with just a PCIe 4.0 x8 host interface, this would make a lousy NIC. That host connection is only 16 GB/s, which is 1/64th of the aggregate link bandwidth it supports and only half the peak bandwidth of a single link!

What strikes me as a little weird is how mismatched its interconnect bandwidth is vs. memory bandwidth. However, maybe that's the type of thing they'd address in a production

What??? Not at all.

This is the compute accelerator! It's not a general-purpose NIC, and they do mean 1 Terrabyte/s, not 1 Terrabit/s. All of this is right in the article!
"The eight-core chip features 32 optical I/O ports that operate at 32 GB/s/dir apiece, thus totaling 1TB/s of total bandwidth. The chips drop into an eight-socket OCP server sled, offering up to 16 TB/s of total optical throughput for the system"​

16 TB/s = 128 Tb/s. Furthermore, it's not even the chip's cores that are doing the routing - that's handled in the router blocks.

Furthermore, with just a PCIe 4.0 x8 host interface, this would make a lousy NIC. That host connection is only 16 GB/s, which is 1/64th of the aggregate link bandwidth it supports and only half the peak bandwidth of a single link!

What strikes me as a little weird is how mismatched its interconnect bandwidth is vs. memory bandwidth. However, maybe that's the type of thing they'd address in a production version.
I took the article to indicate this was a specific accelerator and not a general purpose processor. Meaning it would plug in to a host with a (potentially x86) CPU and more traditional OS for management.

Is that not the way you see it?
 
Sorry, my math was bad. I read "GB/s" for PCIe bandwidth and thought that even PCIe 4.0 easily got that on 60 lanes. Considering that Threadripper exposed 128 PCIe 4.0 lanes, it exclipsed 1TB/s easily. But it's 1Tb/s for Threadripper...or an order of magnitude less bandwidth.
PCIe 4.0 is roughly16 Gbps per lane, meaning you're talking about 2 Tbps or 256 GB/s for x128 lanes.

Actually being able to sustain that is a different matter. At DDR4-3200, an 8-channel configuration only gets you a nominal throughput of 204.8 GB/s (simplex). Even if a good amount of that PCIe traffic were device <-> device, we don't know the throughput limits of the switch integrated into the IO Die, but I think it's a good bet it can't run any-to-any traffic patterns at the max theoretical aggregate rate of the links.

In contrast, I do expect the switch network in Intel's chip can probably run all of the links at close to their limit. According to the stats given in the article, they dominate the die area - not the cores!

I'm just repeatedly not making any sense today.
We all have those days.
 
I swear, if they use something like this to make SkyNet...
What I think is likely is that we'll see this optical mesh interconnect technology surface in future Intel products. From the article:

"the design could eventually enable systems with two million cores to be directly connected with under 400ns latency."

At 8 cores per chip, that's 256k chips. Imagine if you instead had a mesh of 256k of H100-caliber GPUs!
 
I took the article to indicate this was a specific accelerator and not a general purpose processor. Meaning it would plug in to a host with a (potentially x86) CPU and more traditional OS for management.

Is that not the way you see it?
As @InvalidError pointed out, it's for Petabyte-scale graph analytics, where both the compute and memory are distributed. Each of these chips has 528 threads for running graph algorithms + the routing hardware capable of performing (relatively) low-latency data accesses from remote nodes (as well as bridging requests from peer nodes).

The entire mesh would function like a monolithic (potentially) multi-rack, programmable accelerator - not dissimilar to Nvidia's NVLink-based multi-GPU systems. The cool thing about meshes is that they scale pretty well, which is probably why Intel seems to like them so much.

The thing I really took issue with is your calling it a "NIC" - Network Interface Card. It's really not. The chip is a mesh processing node, and includes all the compute, memory, and communication facilities that would imply. It's clearly not designed primarily to do communication on behalf of the host. It does network communication, but as an intrinsic part of its computation - the "Unified Memory Architecture" part of its name is a nod to the fact that its communication is very specialized.
 
Last edited:
  • Like
Reactions: jp7189
What I think is likely is that we'll see this optical mesh interconnect technology surface in future Intel products. From the article:

"the design could eventually enable systems with two million cores to be directly connected with under 400ns latency."

At 8 cores per chip, that's 256k chips. Imagine if you instead had a mesh of 256k of H100-caliber GPUs!

IDK that sure does sound a lot like Skynet. Some sort of massive AI to rule over us all.
 
Status
Not open for further replies.