There are no (data) caches in Occamy, all local memories are scratch pads data transfer is managed through DMA transfers.
Wow, so do cores each have a private scratchpad memory, or are they sharing it with the 8 other cores in their group?
I can appreciate the efficiency benefits of avoiding cache lookups and coherency, but a cacheless architecture is going to create a large burden for porting legacy software. I guess you should be able to write a decent compiler for languages like OpenCL, though.
I've long thought the best approach might be to have software-managed caches with (on-demand) hardware accelerated lookups and coherency. In other words, you have scratchpad memory, but also a CAM and maybe some transactional extensions along the lines of what Intel or ARM has. That way, you only pay the energy and latency costs of a cache when you actually need to, and otherwise you get the efficiency of a dedicated scratchpad memory.
Cores in a cluster basically have single cycle access to the SPM.
Wow. When I programmed a custom RISC core, in an ASIC my old company made (20 years ago), the read latency of our scratchpad memory was like 8 cycles. The core was no slouch, being designed by a few former architects and designers of DEC Alpha, so it's quite impressive you got it down to single-cycle!
Technically Cell uses 6 cores for computing, 1 is reserve and 1 is for OS. We use 8 compute cores in a cluster, and a ninth one that orchestrates the memory transfers in parallel.
I think that was true of how they used it in Playstation 3, except one was disabled for yield and the 8th was devoted to DRM. The SPEs would be pretty bad at "OS" stuff, because they're in-order and rely on DMA to get data in/out of their scratchpad memory, like your CPU. The PPE runs at a lower clockspeed, but has a data cache and can do out-of-order - I'm sure that's where the OS runs.
So, the actual chip had 8 SPE's and one PPE and I think IBM sold some accelerator boards which had all 8 SPEs enabled.
The Zero memory, a small HW trick to have a physical memory respond to array transfers which would be all zeroes (instead of copying data that was initialized to zeroes from main memory). I know it sounds silly, but comes in handy at times, and does not cost that much.
I've heard Apple does that, in their SoCs. I know Linux has fault-based page allocation and I think new pages are zero-initialized. So, it would be useful there.