News China planning 1,600-core chips that use an entire wafer — similar to American company Cerebras 'wafer-scale' designs

Admin · Jan 4, 2024

Chinese Academy of Sciences builds 256-core Zhejiang 'Big Chip,' looking towards wafer-scale chips .

China planning 1,600-core chips that use an entire wafer — similar to American company Cerebras 'wafer-scale' designs : Read more

bit_user · Jan 4, 2024

I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.

TCA_ChinChin · Jan 4, 2024

bit_user said:
I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.

I think this is more of a proof of concept and research design rather than something pushed to premature product. I imagine this would lead to domestic capability in the same realm as what Cerebras does currently, when it actually matures.

toffty · Jan 4, 2024

Two main issues with this approach are:
1. Keeping lanes the same length to each core to memory
2. Cooling such a behemoth

Let alone imperfections. Depending on the transistor size, they'll never get a fully working chip

bit_user · Jan 4, 2024

toffty said:
Two main issues with this approach are:
1. Keeping lanes the same length to each core to memory

Why?

toffty said:
2. Cooling such a behemoth

They can run it at a low enough clock speed to make the heat manageable. Here are some specs on Cerebras' CS-1:

https://www.eetimes.com/powering-and-cooling-a-wafer-scale-die/

This shows exploded views + info about the CS-2:

https://www.cerebras.net/cs2virtualtour

toffty said:
Let alone imperfections. Depending on the transistor size, they'll never get a fully working chip

Cerebras reported full yields of their WSE-1. They built enough redundancy into each die that they didn't even have to disable any of them.

ThomasKinsley · Jan 4, 2024

Let me get this straight. China is preparing to produce a wafer chip, the likes of which only Cerebras has made. And this is happening amid a CIA investigation into a potential leak of Cerebras technology into China's hands from a United Arab Emirates company headed by an ethnic Chinese CEO who renounced his American citizenship for UAE citizenship?

George³ · Jan 4, 2024

ThomasKinsley said:
Let me get this straight. China is preparing to produce a wafer chip, the likes of which only Cerebras has made. And this is happening amid a CIA investigation into a potential leak of Cerebras technology into China's hands from a United Arab Emirates company headed by an ethnic Chinese CEO who renounced his American citizenship for UAE citizenship?

You apparently failed to understand that they already have a 256 core model that they hope they can increase further. If they copied, they would already be eating whole silicon wafers, with no intermediate stages.

ThomasKinsley · Jan 4, 2024

George³ said:
You apparently failed to understand that they already have a 256 core model that they hope they can increase further. If they copied, they would already be eating whole silicon wafers, with no intermediate stages.

The 256 core model is not at wafer scale as the new 1,600 core chip is. The article indicates Cerebras finally figured out how to do it after overcoming significant manufacturing complexity. The timing of this is peculiar given that there is an international investigation analyzing whether G42 gave China Cerebras IP. It's not proof, but it's indicative that there may have been a technology transfer.

eryenakgun · Jan 4, 2024

bit_user said:
I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.

Current

eryenakgun · Jan 4, 2024

bit_user said:
I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.

Current GPUs has 10k-20k cores in it, while all of them connected in common RAM bus. It is not a rocket science. Just connect CPUs together. And link them to RAM.

Notton · Jan 4, 2024

To me, this looks like an experiment to test how good the domestic chip production is.
Even if it doesn't work well, they will gain experience from it.

bit_user · Jan 4, 2024

eryenakgun said:
Current GPUs has 10k-20k cores in it,

That's different, and it's really just Nvidia saying that. Nvidia is pretending that each SIMD lane is a "core", even though they're tied together. By that definition, AMD's Bergamo 128-core EPYC would actually have 6144 "cores", because they can issue up to six 256-bit AVX instructions per clock.

Anyway, so no. These aren't like GPU "cores", they're real RISC-V CPU cores that can each execute a different program.

eryenakgun said:
while all of them connected in common RAM bus.

First, the term "bus" doesn't apply. I won't go into the details, but you can't do it with a bus. You need a packet-switched network of some kind.

eryenakgun said:
It is not a rocket science. Just connect CPUs together. And link them to RAM.

Heh, that's cute. No, it's not as simple as you make it sound. Cache coherence is actually pretty challenging - especially as you try to scale it up.

Cache coherency protocols (examples) - Wikipedia

en.wikipedia.org

Beyond that, you need both enough memory bandwidth and relatively low latency. Now, let's just look at the bandwidth aspect. The industry seems to have settled on having at least one 64-bit DDR5 DIMM per 8 cores. Since these are smaller, simpler cores, even a formula like 1 DIMM per 16 cores would mean having at least 100 DIMMs in the box with the 1600 core version.

Where are you going to put them all, and how are you going to correct them? If we consider that each of the latest EPYC CPUs can support up to 24 DIMMs, the 2-CPU boards with 48 DIMM slots hardly have room for anything else!

Gigabyte has a 48 DIMM 2P AMD EPYC Genoa GPU Server at SC22

At SC22, we saw a Gigabyte GPU server with 48x DDR5 DIMM slots. We also saw a STH video on display and an Ampere Altra Arm with NVIDIA server

www.servethehome.com

purpleduggy · Jan 5, 2024

bit_user said:
I think this doesn't have too much potential, as a general-purpose architecture. The main problem is how to connect up enough RAM to support all of those cores running general-purpose workloads. Even if you could connect the RAM and have enough bandwidth, maintaining cache coherency over 1600 cores would seem to be quite taxing.

Right now, the best way to use such dense compute is in dataflow computing, like what Cerebras does.

put the ram on the wafer itself, HBM.

purpleduggy · Jan 5, 2024

bit_user said:
That's different, and it's really just Nvidia saying that. Nvidia is pretending that each SIMD lane is a "core", even though they're tied together. By that definition, AMD's Bergamo 128-core EPYC would actually have 6144 "cores", because they can issue up to six 256-bit AVX instructions per clock.

Anyway, so no. These aren't like GPU "cores", they're real RISC-V CPU cores that can each execute a different program.

First, the term "bus" doesn't apply. I won't go into the details, but you can't do it with a bus. You need a packet-switched network of some kind.

Heh, that's cute. No, it's not as simple as you make it sound. Cache coherence is actually pretty challenging - especially as you try to scale it up.

Cache coherency protocols (examples) - Wikipedia

en.wikipedia.org

Beyond that, you need both enough memory bandwidth and relatively low latency. Now, let's just look at the bandwidth aspect. The industry seems to have settled on having at least one 64-bit DDR5 DIMM per 8 cores. Since these are smaller, simpler cores, even a formula like 1 DIMM per 16 cores would mean having at least 100 DIMMs in the box with the 1600 core version.

Where are you going to put them all, and how are you going to correct them? If we consider that each of the latest EPYC CPUs can support up to 24 DIMMs, the 2-CPU boards with 48 DIMM slots hardly have room for anything else!

Gigabyte has a 48 DIMM 2P AMD EPYC Genoa GPU Server at SC22

At SC22, we saw a Gigabyte GPU server with 48x DDR5 DIMM slots. We also saw a STH video on display and an Ampere Altra Arm with NVIDIA server

www.servethehome.com

those shader cores are actual cores, not just simd lanes. simplifying them to simd lanes is just an often used analogy. if they were just solely simd lanes you would not be able to run LLM, AI or Raytracing on them. they are that and much more, with each core having its own resources and cache. they really are cores. if you simplify them to simd lanes, then you can simplify any cpu core to a simd lane as well. the analogy works. but its just an analogy.

expunged · Jan 5, 2024

This will never be feasible. China cannot even build an apartment building that does not fall over. TSMC is miles ahead of china and still has defects on platters. If they try this i would guestimate they get 1 working platter out of 1000. They would be better off with the AMD approach, cut out the non-defective chips and glue them together on silicone.

Pierce2623 · Jan 5, 2024

purpleduggy said:
those shader cores are actual cores, not just simd lanes. simplifying them to simd lanes is just an often used analogy. if they were just solely simd lanes you would not be able to run LLM, AI or Raytracing on them. they are that and much more, with each core having its own resources and cache. they really are cores. if you simplify them to simd lanes, then you can simplify any cpu core to a simd lane as well. the analogy works. but its just an analogy.

Nvidia does NOT have the number of discrete cores that they list as “Cuda Cores”. Each core has two simd pipes so they count them twice. They issue a wave32 wavefront to each SM.

bit_user · Jan 5, 2024

purpleduggy said:
put the ram on the wafer itself, HBM.

The article mentions that. Assuming it's practical, I think it's the best option. Even better would be to sandwich SRAM between the DRAM and compute dies.

Capacity scaling would be limited, but I think it'd be enough for many general-purpose use cases.

I wonder if Cerebras will do die stacking of the SRAM, at least.

bit_user · Jan 5, 2024

purpleduggy said:
those shader cores are actual cores, not just simd lanes. simplifying them to simd lanes is just an often used analogy.

No, they're definitely not. If we define a core a something with its own program counter and fully-independent execution flow, then no. In Nvidia's docs, a thread is not the same thing as a CPU thread. What they call a "warp" is equivalent to a CPU thread.

Here's a view of a Hopper SM (Streaming Multiprocessor). Inside each one, there are basically 4 cores.

Source: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

They play some games that narrow the gap between their SIMD lanes and a conventional CPU thread, but for the sake of comparing with CPUs, a Nvidia "core" is most definitely not equivalent to a CPU core.

bit_user · Jan 5, 2024

expunged said:
TSMC is miles ahead of china and still has defects on platters. If they try this i would guestimate they get 1 working platter out of 1000. They would be better off with the AMD approach, cut out the non-defective chips and glue them together on silicone.

Here's a Hot Chips presentation which describes Cerebras' approach to defects. Basically, they build redundancy into the dies, and also have a mechanism for disabling entire bad dies (which I think I read they've never had to do).

Source: https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

purpleduggy · Jan 5, 2024

bit_user said:
No, they're definitely not. If we define a core a something with its own program counter and fully-independent execution flow, then no. In Nvidia's docs, a thread is not the same thing as a CPU thread. What they call a "warp" is equivalent to a CPU thread.

Here's a view of a Hopper SM (Streaming Multiprocessor). Inside each one, there are basically 4 cores.

Source: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

They play some games that narrow the gap between their SIMD lanes and a conventional CPU thread, but for the sake of comparing with CPUs, a Nvidia "core" is most definitely not equivalent to a CPU core.

you can apply this same tangent to cpu cores. the way the instruction sets work and handle threads are exactly simd lanes. for eg. MMX, SSE, AVX etc
all cores are essentially simd lanes.

purpleduggy · Jan 5, 2024

cooling a whole wafer can be done from both sides, with the PCIe lanes and buses on the surrounding edge.

bit_user · Jan 6, 2024

purpleduggy said:
you can apply this same tangent to cpu cores. the way the instruction sets work and handle threads are exactly simd lanes. for eg. MMX, SSE, AVX etc
all cores are essentially simd lanes.

You can program CPU cores as if they're SIMD, but it's actually a misnomer to call these vector instructions "SIMD", since that's only one programming model they support. For instance, it's common to do 3D vector arithmetic with them, by treating them as vector quantities and doing horizontal operations like dot-product. That's not SIMD and it's not how GPUs are programmed.

purpleduggy · Jan 6, 2024

bit_user said:
You can program CPU cores as if they're SIMD, but it's actually a misnomer to call these vector instructions "SIMD", since that's only one programming model they support. For instance, it's common to do 3D vector arithmetic with them, by treating them as vector quantities and doing horizontal operations like dot-product. That's not SIMD and it's not how GPUs are programmed.

i get what you are saying, but you're differentiating what cores are based on how directly it is connected to hardware infrastructure, ie. what buses it has connected to it and the usual hardware intricacies like clock speed. just because a core like CUDA doesn't have any buses like a cpu core, doesn't mean it doesn't qualify as a core. a good example are cores emulated on an FPGA. while the the physical FPGA is directly connected to hardware, the core being emulated is still a core. the fpga is pretending to be that core, latencies and inefficiencies as a result of this aside. anyway this is semantics and i'll cede my point because you are right and i just wanted to test whether i could defend this hypothetical position. even if you are right, cpu cores are still very similar if you think about it. in truth there is only 1 cpu, the "cores" are virtual groupings of resources and not really separate cpus. even more, multithreaded applications still suck and single core IPC is still what matters most, as core 0 does all the work and the other virtual cores just chip in every now and then if they are allowed to without reducing speed or increasing latency. amdahls law still applies regardless of what marketing claims.

bit_user · Jan 6, 2024

purpleduggy said:
i get what you are saying, but you're differentiating what cores are based on how directly it is connected to hardware infrastructure, ie. what buses it has connected to it and the usual hardware intricacies like clock speed.

Basically, I'm saying a real core has its own instruction stream and branch unit. In modern GPUs, there's both a scalar unit and a vector unit - if that doesn't tell you each lane of the vector unit isn't a "core", then I'm not really sure you can define what else a core would be. If we still called each SIMD lane a "core", then what is all of that other stuff? You'd have to invent some new construct, which would basically end up being synonymous with a CPU "core".

purpleduggy said:
even if you are right, cpu cores are still very similar if you think about it. in truth there is only 1 cpu, the "cores" are virtual groupings of resources and not really separate cpus.

They're physical implementations of thread execution state. You have real, physical registers, ALUs, and all the rest of it, and they're all connected together in a way that you can't just have parts of one core work with other parts of another core. In that sense, there's nothing virtual about them.

SMT does introduce the notion of a virtual core, but a SMT thread is executing on a real, physical core at any given point in time. The physical cores are then explicitly designed to support this fiction, by maintaining multiple sets of execution state and keeping track of which resources belong to which SMT thread.

purpleduggy said:
even more, multithreaded applications still suck and single core IPC is still what matters most, as core 0 does all the work and the other virtual cores just chip in every now and then if they are allowed to without reducing speed or increasing latency.

Leaving aside the topic of Big.Little hybrid architectures, CPUs don't generally have a preference among cores. Intel has sort of introduced that, by designating the highest-clocking core as the preferred one (I think this their name for this feature is Turbo Boost Max, and not all of their CPUs have it), but the cores are architecturally peers and there are even mechanisms to ensure that interrupts get distributed evenly.

As for how multithreaded applications are designed, it really depends. You're right that there's often a main thread that's responsible for launching & synchronizing with worker threads, but it's not the only processing model out there. For instance, you can write a multithreaded program as a set of state machines, with each state transition triggering another work item. Start up a thread pool, queue up the initial work items, and then the main thread can join the workers and they all share in processing work items until the end state is reached.

purpleduggy said:
amdahls law still applies regardless of what marketing claims.

The key is to parallelize the overhead. That's an example of where dataflow architectures really come into their own. The following year, Cerebras gave a talk focused on the programming model of their CS-1.

Source: https://www.anandtech.com/show/16006/hot-chips-2020-live-blog-cerebras-wse-programming-300pm-pt

News China planning 1,600-core chips that use an entire wafer — similar to American company Cerebras 'wafer-scale' designs

Administrator

Titan

Distinguished

Distinguished

Titan

Notable

Respectable

Notable

Estimable

Titan

Prominent

Prominent

Distinguished

Commendable

Titan

Titan

Titan

Prominent

Prominent

Titan

Prominent

Titan

Share this page