News Cerebras launches 900,000-core 125 PetaFLOPS wafer-scale processor for AI — theoretically equivalent to about 62 Nvidia H100 GPUs

Status
Not open for further replies.

gg83

Distinguished
Jul 10, 2015
756
354
19,260
So cerebras is better per transitor. If one cerebras has 4 trillion transistors and it's "equivalent" to 62 H100's @ 80 billion transistors each, I get 4 trillion to 4.96 trillion on the nvidia side. Maybe my math is nonsense. If Cerebras can pull this off, could tenstorrent have a harder time getting in to the top AI game?
 

bit_user

Titan
Ambassador
Chips like this are why Sam Altman wants to scale up global fab capacity by like 10x.

I'm a little curious why none of their slides (at least of those shown here) highlighted energy efficiency. I expect that's also an area of considerable lead over Nvidia. I'd be even more curious to know how perf/$ compares!
 
  • Like
Reactions: gg83

bit_user

Titan
Ambassador
One chip to rule all.
Technically one chip, but the tiles are effectively different chips.

54gb on chip memory It's have more cache than I have memory on my computer.
Accessing SRAM from another tile probably takes longer than your CPU does to read any part of your DRAM. That's not to say one is better or worse than the other, in the abstract.

Their solution is very much suited towards dataflow processing and is nearly ideal, if you can fit all of your weights in the 44 GB* on a single wafer. Also, you want those weights to roughly align with the computational density of the network. Otherwise, there could be some underutilization of either compute or memory.

ug4XQArmDfkgqxUkJigTM9-650-80.jpg.webp


* Image indicates 44 GB of SRAM.
 

JTWrenn

Distinguished
Aug 5, 2008
330
234
19,170
Wait....only 8 times faster on the AI FLOPs? 50 times the transistors and only 8 times the compute? How are they going to deal with chip manufacturing issues at that level to get manufacturing up? It's an interesting idea but....this just seems like a weird approach.

Will be glad to see some real competition for Nvidia but this idea seems like it may have a lot of issues. Maybe I am missing something.
 

bit_user

Titan
Ambassador
Wait....only 8 times faster on the AI FLOPs? 50 times the transistors and only 8 times the compute? How are they going to deal with chip manufacturing issues at that level to get manufacturing up? It's an interesting idea but....this just seems like a weird approach.
You're getting mixed up between system specifications and chip specifications. A "DGX H100" system contains 8x H100 chips[1].

Compute-wise, it's worth pointing out that Nvidia touts the DGX H100's 32 TFLOPS of fp8 performance[2], which is probably why Cerebras is focusing on training performance. On that front (BF16), the DGX H100 has only 16 TFLOPS.

References:
  1. https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html
  2. https://nvidianews.nvidia.com/news/...ds-most-advanced-enterprise-ai-infrastructure
 

nightbird321

Distinguished
Sep 15, 2012
54
44
18,560
One chip to rule all.
54gb on chip memory It's have more cache than I have memory on my computer.
I'm going to go with 44GB since that is what is in the article.

44GB over 900,000 cores is only 50KB per core of cache memory which is tiny compared with consumer CPUs. As a specific purpose chip with specifically written code, it likely doesn't need a big memory for random access. (Unless I am misunderstanding what they mean by on-chip memory)
 
  • Like
Reactions: Amdlova

bit_user

Titan
Ambassador
44GB over 900,000 cores is only 50KB per core
If you compute the amount of silicon area or transistors per core, it's pretty clear that these are probably similar to what Nvidia calls a "core". In other words, something more like a SIMD lane.

of cache memory which is tiny compared with consumer CPUs.
I doubt it works as cache, but is probably directly addressable. Cache lookups require associative memory, which wastes die space and power for something that's really not necessary, when your access patterns are predictable. The normal way this works is you have a double-buffering type scheme and you've got a DMA engine which drains/fills inactive buffers while you're computing on the contents of the active buffer.

The IBM Cell processor worked this way, which was popularly used in Sony's PS3. Each of its 8 SPEs had 256 kB of SRAM that it used like this. The SPEs had no direct access to system memory, but rather relied upon DMAs to copy everything in/out of their scratchpad memory.

As a specific purpose chip with specifically written code, it likely doesn't need a big memory for random access. (Unless I am misunderstanding what they mean by on-chip memory)
I think a big use case for external DRAM access is streaming in weights, in the event you don't have enough SRAM to keep them all on-die. That's probably what eats most of the massive memory bandwidth on Nvidia's Hopper, for instance.
 

bit_user

Titan
Ambassador
SRAM accessible by all cores would have huge latency compared with L1 cache
Who said it's accessible by all cores? The analogy I used was Cell's local scratch pad memory, which was private to each SPE.

Your IBM example is L1 cache, let's not split hairs there.
It's not cache. The wikipedia link provides quite a decent explanation of how it works.

The Cell can be a little confusing, since it has 2 classes of cores. The SPEs are the ones which have scratchpad memory, while the PPE is more of a general-purpose core with regular L1 I/D caches and a 512 kB L2 cache. The wikipedia article claims the DMA engines servicing each of the cores maintain coherence with the PPE's caches.

640px-Schema_Cell.png


In this diagram, the SPEs' scratchpad memory is labelled "LS" (Local Store).
 

JTWrenn

Distinguished
Aug 5, 2008
330
234
19,170
You're getting mixed up between system specifications and chip specifications. A "DGX H100" system contains 8x H100 chips[1].

Compute-wise, it's worth pointing out that Nvidia touts the DGX H100's 32 TFLOPS of fp8 performance[2], which is probably why Cerebras is focusing on training performance. On that front (BF16), the DGX H100 has only 16 TFLOPS.

References:
  1. https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html
  2. https://nvidianews.nvidia.com/news/...ds-most-advanced-enterprise-ai-infrastructure
If your chip is this big it is meant to be a system replacement not a chip replacement right? So shouldn't it be compared to systems, not chips? The difference is this is all integrated so....don't see why you wouldn't compare system to system here...and even more dollar to dollar is the really important thing.
 

bit_user

Titan
Ambassador
If your chip is this big it is meant to be a system replacement not a chip replacement right? So shouldn't it be compared to systems, not chips?
Yes, that's what they did. Their statement indicated 8x the compute as a DGX system. However, you seemed to think that was a comparison against a single H100 chip:

Wait....only 8 times faster on the AI FLOPs? 50 times the transistors and only 8 times the compute?
 
Status
Not open for further replies.