News 432-Core Chiplet-Based RISC-V Chip Nearly Ready to Blast Into Space

bit_user

Polypheme
Ambassador
the dual-tile CPU can deliver 0.75 FP64 TFLOPS
In absolute terms, that's really not impressive, especially if it includes dedicated matrix-multiply hardware.
"The top-bin Epyc 9654 part comes in a 320-400 watt TDP and provides 5.38 teraflops of peak double-precision performance running at max boost frequency of 3.5 GHz"​

So, it can only win on FLOPS/W or FLOPS/mm^2. And even on those terms, I don't expect it will hold a candle to HPC GPUs.

The tiles are made by GlobalFoundries using its 14LPP fabrication process.
Hmmm... I guess we should compare it to 1st gen EPYC, then. Data on that is much harder to find, but this paper measured 1.03 TFLOPS on dual 32-core Naples system (in contrast, I think the above figures are theoretical).

What might make all the difference is fault-tolerance. Depending on how they handle that, it could further explain the performance delta.

I'd love to know more about the actual cores they used.
 

InvalidError

Titan
Moderator
"An unknown number of FPUs"

There aren't too many options: 0.75 TFLOPs / 432 total cores / 1 FLOP/cycle = 1.744GHz.

Looks like the CPU has only one FMA64 unit per core assuming it runs at 1.8-2GHz with near 100% occupancy, two if it runs slower or at a much lower unit utilization rate.
 
  • Like
Reactions: bit_user

kjfatl

Reputable
Apr 15, 2020
188
131
4,760
To some, this might not seem impressive, but it is intended for a space bases application where it must be radiation hardened. Cosmic rays do nasty things to standard electronics.
 

InvalidError

Titan
Moderator
To some, this might not seem impressive, but it is intended for a space bases application where it must be radiation hardened. Cosmic rays do nasty things to standard electronics.
I doubt that a 432 cores CPU will be used to run any operations-critical equipment. Non-essential stuff where occasional crashes and errors are only a minor inconvenience don't need to be radiation-hardened.

Also, to cram 216 cores in 73sqmm, they are almost certainly using high-density libraries instead of radiation-hardened ones.
 

bit_user

Polypheme
Ambassador
There aren't too many options: 0.75 TFLOPs / 432 total cores / 1 FLOP/cycle = 1.744GHz.

Looks like the CPU has only one FMA64 unit per core assuming it runs at 1.8-2GHz with near 100% occupancy, two if it runs slower or at a much lower unit utilization rate.
FMA is usually counted as 2 ops. Then again, its throughput could be 0.5 per cycle, still resulting in 1 FLOP/cycle.
 

kjfatl

Reputable
Apr 15, 2020
188
131
4,760
I doubt that a 432 cores CPU will be used to run any operations-critical equipment. Non-essential stuff where occasional crashes and errors are only a minor inconvenience don't need to be radiation-hardened.

Also, to cram 216 cores in 73sqmm, they are almost certainly using high-density libraries instead of radiation-hardened ones.
I agree that is is highly unlikely that this part will be used in a critical system. I'm also making the assumption that some level of radiation hardening is needed to keep from getting latch-up conditions that would damage the silicon. It's probably done is a library that has some level of increased radiation hardening. A few glitches in an image are no big deal. Gates that turn into SCRs are a big deal.

It will be interesting if the ESA publishes this sort of detail.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
Speaking of radiation hardening, I thought it was fascinating that NASA's Ingenuity drone used a standard Snapdragon 801 phone SoC.

I think that's because all it had to do was demonstrate controlled flight was possible, and its core mission would be considered successful. Furthermore, getting enough compute power into a radiation-hardened package for autonomous flight might've been a challenge.

Meanwhile, the Perseverance rover was plodding along with a ~20-year-old 133 MHz PowerPC CPU.

BTW, the panoramic photos Perseverance captured are of stunning resolution and clarity. I cropped and scaled one down to use as wallpaper, for my multi-monitor setup. It's not the most picturesque landscape, but it really gives the feeling of what it'd be like to stand on Mars.
 
  • Like
Reactions: Steve Nord_

Steve Nord_

Prominent
Nov 7, 2022
56
7
535
In space, noone can hear your screaming overclocked APU serve 130 shards of conference? Which of derβauer's cats is going? It's going to juggle a bunch of neodymium magnets around Mars and let it accumulate an atmosphere again?
 
  • Like
Reactions: bit_user
May 10, 2023
3
5
15
I am part of the team that has designed Occamy. While we are very happy that our project has gotten some attention, the article (and the HPCWire article that started this) is not accurate and have misrepresented several parts. We wrote a short article explaining the background and what Occamy is and isn't:
https://pulp-platform.org/occamy/

We would love to send our designs into space, and we hope we will also get there at some point, but Occamy is not designed for Space, or in collaboration with ESA, it is not a product, and we do not think it can be passively cooled when running at full speed.
 

bit_user

Polypheme
Ambassador
I am part of the team that has designed Occamy. While we are very happy that our project has gotten some attention, the article (and the HPCWire article that started this) is not accurate and have misrepresented several parts. We wrote a short article explaining the background and what Occamy is and isn't:
Thanks!

This part jumped out at me:

"Each chiplet has a private 16GB high-bandwidth memory (HBM2e) and can communicate with a neighboring chiplet over a 19.5 GB/s wide, source-synchronous technology-independent die-to-die DDR link."
So, I'm guessing you get like 1 TB/s of HBM2e bandwidth, but you're limited to like 1/50th of that for die-to-die communication? So, then are they even cache-coherent? Or, is it the idea basically "cluster on a chip"?

It looks like each group of cores has ScratchPad Memory ("SPM"). How fast do they access it? I'm guessing it's entirely software-managed? Are accesses to it cache-coherent, between the cores in the group? The architecture seems design for very tight-collaboration of cores in a group, so I'm guessing you want a more efficient way for them to communicate than relying on whatever level of the cache hierarchy they have in common.

The pairing of 8 SIMD cores with one communication & coordination core seems reminiscent of IBM's Cell, BTW. Was there any thought given to that, or you basically just ended up at the same place?

Also, what's ZMem?
 
May 10, 2023
3
5
15
So, I'm guessing you get like 1 TB/s of HBM2e bandwidth, but you're limited to like 1/50th of that for die-to-die communication? So, then are they even cache-coherent? Or, is it the idea basically "cluster on a chip"?
There are no (data) caches in Occamy, all local memories are scratch pads data transfer is managed through DMA transfers. The plan was to use a different die-to-die interface (with more BW), the timeline for that IP was not compatible with the tape-out date, so we had to use a backup solution that would allow us to demonstrate the principle, but as you point out would not have the BW on this chip. Once we have Occamy working, we think it would be relatively easy to show how the measured performance could scale with a higher BW interface.

It looks like each group of cores has ScratchPad Memory ("SPM"). How fast do they access it? I'm guessing it's entirely software-managed? Are accesses to it cache-coherent, between the cores in the group? The architecture seems design for very tight-collaboration of cores in a group, so I'm guessing you want a more efficient way for them to communicate than relying on whatever level of the cache hierarchy they have in common.
Cores in a cluster basically have single cycle access to the SPM. As you point out, SPM content is entirely SW managed, no caches involved. This is designed for data centric applications, not general purpose processing and follows a design we have been using successfully for a number of years. At IoT level, Greenwaves commercialized processors with a single cluster (GAP8, GAP9)

The pairing of 8 SIMD cores with one communication & coordination core seems reminiscent of IBM's Cell, BTW. Was there any thought given to that, or you basically just ended up at the same place?
Technically Cell uses 6 cores for computing, 1 is reserve and 1 is for OS. We use 8 compute cores in a cluster, and a ninth one that orchestrates the memory transfers in parallel.

Also, what's ZMem?
The Zero memory, a small HW trick to have a physical memory respond to array transfers which would be all zeroes (instead of copying data that was initialized to zeroes from main memory). I know it sounds silly, but comes in handy at times, and does not cost that much.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
There are no (data) caches in Occamy, all local memories are scratch pads data transfer is managed through DMA transfers.
Wow, so do cores each have a private scratchpad memory, or are they sharing it with the 8 other cores in their group?

I can appreciate the efficiency benefits of avoiding cache lookups and coherency, but a cacheless architecture is going to create a large burden for porting legacy software. I guess you should be able to write a decent compiler for languages like OpenCL, though.

I've long thought the best approach might be to have software-managed caches with (on-demand) hardware accelerated lookups and coherency. In other words, you have scratchpad memory, but also a CAM and maybe some transactional extensions along the lines of what Intel or ARM has. That way, you only pay the energy and latency costs of a cache when you actually need to, and otherwise you get the efficiency of a dedicated scratchpad memory.

Cores in a cluster basically have single cycle access to the SPM.
Wow. When I programmed a custom RISC core, in an ASIC my old company made (20 years ago), the read latency of our scratchpad memory was like 8 cycles. The core was no slouch, being designed by a few former architects and designers of DEC Alpha, so it's quite impressive you got it down to single-cycle!

Technically Cell uses 6 cores for computing, 1 is reserve and 1 is for OS. We use 8 compute cores in a cluster, and a ninth one that orchestrates the memory transfers in parallel.
I think that was true of how they used it in Playstation 3, except one was disabled for yield and the 8th was devoted to DRM. The SPEs would be pretty bad at "OS" stuff, because they're in-order and rely on DMA to get data in/out of their scratchpad memory, like your CPU. The PPE runs at a lower clockspeed, but has a data cache and can do out-of-order - I'm sure that's where the OS runs.

So, the actual chip had 8 SPE's and one PPE and I think IBM sold some accelerator boards which had all 8 SPEs enabled.

The Zero memory, a small HW trick to have a physical memory respond to array transfers which would be all zeroes (instead of copying data that was initialized to zeroes from main memory). I know it sounds silly, but comes in handy at times, and does not cost that much.
I've heard Apple does that, in their SoCs. I know Linux has fault-based page allocation and I think new pages are zero-initialized. So, it would be useful there.
 
May 10, 2023
3
5
15
Wow, so do cores each have a private scratchpad memory, or are they sharing it with the 8 other cores in their group?
All cores in a cluster (8+1 in Occamy) share one scratchpad memory with multiple banks.

I can appreciate the efficiency benefits of avoiding cache lookups and coherency, but a cacheless architecture is going to create a large burden for porting legacy software. I guess you should be able to write a decent compiler for languages like OpenCL, though.
The goal is not to port legacy SW, but use such architectures as an accelerator attached to a general purpose computing system (i.e something that boots runs Linux) for data centric computations. We use OpenMP as the main programming approach
Wow. When I programmed a custom RISC core, in an ASIC my old company made (20 years ago), the read latency of our scratchpad memory was like 8 cycles. The core was no slouch, being designed by a few former architects and designers of DEC Alpha, so it's quite impressive you got it down to single-cycle!
It basically boils down to the tradeoff between size of the memory and the latency you want to have. Notice that we are also not pushing the clock speed as much (1 GHz in 12nm) through deeper pipelining so that we can find a spot where single cycle latency could work.

We spend actually quite a bit of time, investigating alternatives for these architectures, you can find our publications (many with publicly accessible papers) under: https://pulp-platform.org/publications.html
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
All cores in a cluster (8+1 in Occamy) share one scratchpad memory with multiple banks.
I'm no hardware designer, but all of those muxes and ports sound pretty area- and power- intensive. GPUs seem to have a nice idea by having both some private scratchpad memory and some locally-shared memory.

The goal is not to port legacy SW, but use such architectures as an accelerator attached to a general purpose computing system
Yeah, but the obvious temptation is to say: "hey, it's RISC-V! Why not make it run general-purpose software?". I get it, though. You're building this truly as a GPU-like accelerator.

My next thought is that you'd immediately be at a disadvantage without SMT. Many-core CPUs can get away without SMT as long as the cores are tiny, but I think if you add big enough vector pipelines, then it really starts to be worth the trouble of adding SMT. However, then it occurs to me that DMA engines probably fill a similar role. You just have to hope you can queue up enough DMA transfers to keep the cores from stalling too much. In the end, the best mix might be a combination of low-order SMT + DMAs, rather than leaning too heavily on either one.

Also, now that I properly understand it as a GPU-like accelerator, the appropriate basis for comparison seems to be Radeon Instinct MI25 or Nvidia V100. That makes me curious how the number and size of compute units compares. How wide are your vectors?

We use OpenMP as the main programming approach
I haven't really kept up with OpenMP, but my sense is that OpenCL works a lot better in non-trivial scenarios.