News China planning 1,600-core chips that use an entire wafer — similar to American company Cerebras 'wafer-scale' designs

bit_user

Titan
Ambassador
I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.
 
I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.
I think this is more of a proof of concept and research design rather than something pushed to premature product. I imagine this would lead to domestic capability in the same realm as what Cerebras does currently, when it actually matures.
 

toffty

Distinguished
Feb 11, 2015
79
73
18,610
Two main issues with this approach are:
1. Keeping lanes the same length to each core to memory
2. Cooling such a behemoth

Let alone imperfections. Depending on the transistor size, they'll never get a fully working chip
 

bit_user

Titan
Ambassador
Two main issues with this approach are:
1. Keeping lanes the same length to each core to memory
Why?

2. Cooling such a behemoth
They can run it at a low enough clock speed to make the heat manageable. Here are some specs on Cerebras' CS-1:

This shows exploded views + info about the CS-2:

Let alone imperfections. Depending on the transistor size, they'll never get a fully working chip
Cerebras reported full yields of their WSE-1. They built enough redundancy into each die that they didn't even have to disable any of them.
 

ThomasKinsley

Notable
Oct 4, 2023
385
384
1,060
Let me get this straight. China is preparing to produce a wafer chip, the likes of which only Cerebras has made. And this is happening amid a CIA investigation into a potential leak of Cerebras technology into China's hands from a United Arab Emirates company headed by an ethnic Chinese CEO who renounced his American citizenship for UAE citizenship?
 

George³

Prominent
Oct 1, 2022
228
124
760
Let me get this straight. China is preparing to produce a wafer chip, the likes of which only Cerebras has made. And this is happening amid a CIA investigation into a potential leak of Cerebras technology into China's hands from a United Arab Emirates company headed by an ethnic Chinese CEO who renounced his American citizenship for UAE citizenship?
You apparently failed to understand that they already have a 256 core model that they hope they can increase further. If they copied, they would already be eating whole silicon wafers, with no intermediate stages.
 

ThomasKinsley

Notable
Oct 4, 2023
385
384
1,060
You apparently failed to understand that they already have a 256 core model that they hope they can increase further. If they copied, they would already be eating whole silicon wafers, with no intermediate stages.
The 256 core model is not at wafer scale as the new 1,600 core chip is. The article indicates Cerebras finally figured out how to do it after overcoming significant manufacturing complexity. The timing of this is peculiar given that there is an international investigation analyzing whether G42 gave China Cerebras IP. It's not proof, but it's indicative that there may have been a technology transfer.
 
  • Like
Reactions: bit_user
Dec 22, 2023
3
1
10
I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.
Current
 
Dec 22, 2023
3
1
10
I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.
Current GPUs has 10k-20k cores in it, while all of them connected in common RAM bus. It is not a rocket science. Just connect CPUs together. And link them to RAM.
 
  • Like
Reactions: purpleduggy

Notton

Commendable
Dec 29, 2023
865
764
1,260
To me, this looks like an experiment to test how good the domestic chip production is.
Even if it doesn't work well, they will gain experience from it.
 

bit_user

Titan
Ambassador
Current GPUs has 10k-20k cores in it,
That's different, and it's really just Nvidia saying that. Nvidia is pretending that each SIMD lane is a "core", even though they're tied together. By that definition, AMD's Bergamo 128-core EPYC would actually have 6144 "cores", because they can issue up to six 256-bit AVX instructions per clock.

Anyway, so no. These aren't like GPU "cores", they're real RISC-V CPU cores that can each execute a different program.

while all of them connected in common RAM bus.
First, the term "bus" doesn't apply. I won't go into the details, but you can't do it with a bus. You need a packet-switched network of some kind.

It is not a rocket science. Just connect CPUs together. And link them to RAM.
Heh, that's cute. No, it's not as simple as you make it sound. Cache coherence is actually pretty challenging - especially as you try to scale it up.

Beyond that, you need both enough memory bandwidth and relatively low latency. Now, let's just look at the bandwidth aspect. The industry seems to have settled on having at least one 64-bit DDR5 DIMM per 8 cores. Since these are smaller, simpler cores, even a formula like 1 DIMM per 16 cores would mean having at least 100 DIMMs in the box with the 1600 core version.

Where are you going to put them all, and how are you going to correct them? If we consider that each of the latest EPYC CPUs can support up to 24 DIMMs, the 2-CPU boards with 48 DIMM slots hardly have room for anything else!

 
  • Like
Reactions: snemarch

purpleduggy

Prominent
Apr 19, 2023
167
44
610
I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.
put the ram on the wafer itself, HBM.
 
  • Like
Reactions: bit_user

purpleduggy

Prominent
Apr 19, 2023
167
44
610
That's different, and it's really just Nvidia saying that. Nvidia is pretending that each SIMD lane is a "core", even though they're tied together. By that definition, AMD's Bergamo 128-core EPYC would actually have 6144 "cores", because they can issue up to six 256-bit AVX instructions per clock.

Anyway, so no. These aren't like GPU "cores", they're real RISC-V CPU cores that can each execute a different program.


First, the term "bus" doesn't apply. I won't go into the details, but you can't do it with a bus. You need a packet-switched network of some kind.


Heh, that's cute. No, it's not as simple as you make it sound. Cache coherence is actually pretty challenging - especially as you try to scale it up.

Beyond that, you need both enough memory bandwidth and relatively low latency. Now, let's just look at the bandwidth aspect. The industry seems to have settled on having at least one 64-bit DDR5 DIMM per 8 cores. Since these are smaller, simpler cores, even a formula like 1 DIMM per 16 cores would mean having at least 100 DIMMs in the box with the 1600 core version.

Where are you going to put them all, and how are you going to correct them? If we consider that each of the latest EPYC CPUs can support up to 24 DIMMs, the 2-CPU boards with 48 DIMM slots hardly have room for anything else!
those shader cores are actual cores, not just simd lanes. simplifying them to simd lanes is just an often used analogy. if they were just solely simd lanes you would not be able to run LLM, AI or Raytracing on them. they are that and much more, with each core having its own resources and cache. they really are cores. if you simplify them to simd lanes, then you can simplify any cpu core to a simd lane as well. the analogy works. but its just an analogy.
 

expunged

Distinguished
Jun 15, 2015
61
8
18,545
This will never be feasible. China cannot even build an apartment building that does not fall over. TSMC is miles ahead of china and still has defects on platters. If they try this i would guestimate they get 1 working platter out of 1000. They would be better off with the AMD approach, cut out the non-defective chips and glue them together on silicone.
 

Pierce2623

Prominent
Dec 3, 2023
485
368
560
those shader cores are actual cores, not just simd lanes. simplifying them to simd lanes is just an often used analogy. if they were just solely simd lanes you would not be able to run LLM, AI or Raytracing on them. they are that and much more, with each core having its own resources and cache. they really are cores. if you simplify them to simd lanes, then you can simplify any cpu core to a simd lane as well. the analogy works. but its just an analogy.
Nvidia does NOT have the number of discrete cores that they list as “Cuda Cores”. Each core has two simd pipes so they count them twice. They issue a wave32 wavefront to each SM.
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
put the ram on the wafer itself, HBM.
The article mentions that. Assuming it's practical, I think it's the best option. Even better would be to sandwich SRAM between the DRAM and compute dies.

Capacity scaling would be limited, but I think it'd be enough for many general-purpose use cases.

I wonder if Cerebras will do die stacking of the SRAM, at least.
 

bit_user

Titan
Ambassador
those shader cores are actual cores, not just simd lanes. simplifying them to simd lanes is just an often used analogy.
No, they're definitely not. If we define a core a something with its own program counter and fully-independent execution flow, then no. In Nvidia's docs, a thread is not the same thing as a CPU thread. What they call a "warp" is equivalent to a CPU thread.

Here's a view of a Hopper SM (Streaming Multiprocessor). Inside each one, there are basically 4 cores.

H100-Streaming-Multiprocessor-SM-625x869.png

Source: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

They play some games that narrow the gap between their SIMD lanes and a conventional CPU thread, but for the sake of comparing with CPUs, a Nvidia "core" is most definitely not equivalent to a CPU core.
 

bit_user

Titan
Ambassador
TSMC is miles ahead of china and still has defects on platters. If they try this i would guestimate they get 1 working platter out of 1000. They would be better off with the AMD approach, cut out the non-defective chips and glue them together on silicone.
Here's a Hot Chips presentation which describes Cerebras' approach to defects. Basically, they build redundancy into the dies, and also have a mechanism for disabling entire bad dies (which I think I read they've never had to do).
 
  • Like
Reactions: purpleduggy

purpleduggy

Prominent
Apr 19, 2023
167
44
610
No, they're definitely not. If we define a core a something with its own program counter and fully-independent execution flow, then no. In Nvidia's docs, a thread is not the same thing as a CPU thread. What they call a "warp" is equivalent to a CPU thread.

Here's a view of a Hopper SM (Streaming Multiprocessor). Inside each one, there are basically 4 cores.
H100-Streaming-Multiprocessor-SM-625x869.png

They play some games that narrow the gap between their SIMD lanes and a conventional CPU thread, but for the sake of comparing with CPUs, a Nvidia "core" is most definitely not equivalent to a CPU core.
you can apply this same tangent to cpu cores. the way the instruction sets work and handle threads are exactly simd lanes. for eg. MMX, SSE, AVX etc
all cores are essentially simd lanes.
 

bit_user

Titan
Ambassador
you can apply this same tangent to cpu cores. the way the instruction sets work and handle threads are exactly simd lanes. for eg. MMX, SSE, AVX etc
all cores are essentially simd lanes.
You can program CPU cores as if they're SIMD, but it's actually a misnomer to call these vector instructions "SIMD", since that's only one programming model they support. For instance, it's common to do 3D vector arithmetic with them, by treating them as vector quantities and doing horizontal operations like dot-product. That's not SIMD and it's not how GPUs are programmed.
 
  • Like
Reactions: purpleduggy

purpleduggy

Prominent
Apr 19, 2023
167
44
610
You can program CPU cores as if they're SIMD, but it's actually a misnomer to call these vector instructions "SIMD", since that's only one programming model they support. For instance, it's common to do 3D vector arithmetic with them, by treating them as vector quantities and doing horizontal operations like dot-product. That's not SIMD and it's not how GPUs are programmed.
i get what you are saying, but you're differentiating what cores are based on how directly it is connected to hardware infrastructure, ie. what buses it has connected to it and the usual hardware intricacies like clock speed. just because a core like CUDA doesn't have any buses like a cpu core, doesn't mean it doesn't qualify as a core. a good example are cores emulated on an FPGA. while the the physical FPGA is directly connected to hardware, the core being emulated is still a core. the fpga is pretending to be that core, latencies and inefficiencies as a result of this aside. anyway this is semantics and i'll cede my point because you are right and i just wanted to test whether i could defend this hypothetical position. even if you are right, cpu cores are still very similar if you think about it. in truth there is only 1 cpu, the "cores" are virtual groupings of resources and not really separate cpus. even more, multithreaded applications still suck and single core IPC is still what matters most, as core 0 does all the work and the other virtual cores just chip in every now and then if they are allowed to without reducing speed or increasing latency. amdahls law still applies regardless of what marketing claims.
 

bit_user

Titan
Ambassador
i get what you are saying, but you're differentiating what cores are based on how directly it is connected to hardware infrastructure, ie. what buses it has connected to it and the usual hardware intricacies like clock speed.
Basically, I'm saying a real core has its own instruction stream and branch unit. In modern GPUs, there's both a scalar unit and a vector unit - if that doesn't tell you each lane of the vector unit isn't a "core", then I'm not really sure you can define what else a core would be. If we still called each SIMD lane a "core", then what is all of that other stuff? You'd have to invent some new construct, which would basically end up being synonymous with a CPU "core".

even if you are right, cpu cores are still very similar if you think about it. in truth there is only 1 cpu, the "cores" are virtual groupings of resources and not really separate cpus.
They're physical implementations of thread execution state. You have real, physical registers, ALUs, and all the rest of it, and they're all connected together in a way that you can't just have parts of one core work with other parts of another core. In that sense, there's nothing virtual about them.

SMT does introduce the notion of a virtual core, but a SMT thread is executing on a real, physical core at any given point in time. The physical cores are then explicitly designed to support this fiction, by maintaining multiple sets of execution state and keeping track of which resources belong to which SMT thread.

even more, multithreaded applications still suck and single core IPC is still what matters most, as core 0 does all the work and the other virtual cores just chip in every now and then if they are allowed to without reducing speed or increasing latency.
Leaving aside the topic of Big.Little hybrid architectures, CPUs don't generally have a preference among cores. Intel has sort of introduced that, by designating the highest-clocking core as the preferred one (I think this their name for this feature is Turbo Boost Max, and not all of their CPUs have it), but the cores are architecturally peers and there are even mechanisms to ensure that interrupts get distributed evenly.

As for how multithreaded applications are designed, it really depends. You're right that there's often a main thread that's responsible for launching & synchronizing with worker threads, but it's not the only processing model out there. For instance, you can write a multithreaded program as a set of state machines, with each state transition triggering another work item. Start up a thread pool, queue up the initial work items, and then the main thread can join the workers and they all share in processing work items until the end state is reached.

amdahls law still applies regardless of what marketing claims.
The key is to parallelize the overhead. That's an example of where dataflow architectures really come into their own. The following year, Cerebras gave a talk focused on the programming model of their CS-1.
 
  • Like
Reactions: purpleduggy