News Cerebras launches 900,000-core 125 PetaFLOPS wafer-scale processor for AI — theoretically equivalent to about 62 Nvidia H100 GPUs

Admin · Mar 13, 2024

Cerebras unveils Wafer Scale Engine 3: 900,000 AI cores, four trillion transistors, 125 PetaFLOPS AI performance.

Cerebras launches 900,000-core 125 PetaFLOPS wafer-scale processor for AI — theoretically equivalent to about 62 Nvidia H100 GPUs : Read more

gg83 · Mar 13, 2024

So cerebras is better per transitor. If one cerebras has 4 trillion transistors and it's "equivalent" to 62 H100's @ 80 billion transistors each, I get 4 trillion to 4.96 trillion on the nvidia side. Maybe my math is nonsense. If Cerebras can pull this off, could tenstorrent have a harder time getting in to the top AI game?

vanadiel007 · Mar 13, 2024

1 million trillion billion and a few spares coming in next.
Just counting the 4 trillion transistors would likely take a lifetime...

Amdlova · Mar 13, 2024

One chip to rule all.
54gb on chip memory It's have more cache than I have memory on my computer.

ivan_vy · Mar 13, 2024

Amdlova said:
One chip to rule all.
54gb on chip memory It's have more cache than I have memory on my computer.

maybe if you add your two systems.

bit_user · Mar 13, 2024

Chips like this are why Sam Altman wants to scale up global fab capacity by like 10x.

I'm a little curious why none of their slides (at least of those shown here) highlighted energy efficiency. I expect that's also an area of considerable lead over Nvidia. I'd be even more curious to know how perf/$ compares!

bit_user · Mar 13, 2024

Amdlova said:
One chip to rule all.

Technically one chip, but the tiles are effectively different chips.

Amdlova said:
54gb on chip memory It's have more cache than I have memory on my computer.

Accessing SRAM from another tile probably takes longer than your CPU does to read any part of your DRAM. That's not to say one is better or worse than the other, in the abstract.

Their solution is very much suited towards dataflow processing and is nearly ideal, if you can fit all of your weights in the 44 GB* on a single wafer. Also, you want those weights to roughly align with the computational density of the network. Otherwise, there could be some underutilization of either compute or memory.

* Image indicates 44 GB of SRAM.

JTWrenn · Mar 14, 2024

Wait....only 8 times faster on the AI FLOPs? 50 times the transistors and only 8 times the compute? How are they going to deal with chip manufacturing issues at that level to get manufacturing up? It's an interesting idea but....this just seems like a weird approach.

Will be glad to see some real competition for Nvidia but this idea seems like it may have a lot of issues. Maybe I am missing something.

bit_user · Mar 14, 2024

JTWrenn said:
Wait....only 8 times faster on the AI FLOPs? 50 times the transistors and only 8 times the compute? How are they going to deal with chip manufacturing issues at that level to get manufacturing up? It's an interesting idea but....this just seems like a weird approach.

You're getting mixed up between system specifications and chip specifications. A "DGX H100" system contains 8x H100 chips[1].

Compute-wise, it's worth pointing out that Nvidia touts the DGX H100's 32 TFLOPS of fp8 performance[2], which is probably why Cerebras is focusing on training performance. On that front (BF16), the DGX H100 has only 16 TFLOPS.

References:

nightbird321 · Mar 14, 2024

Amdlova said:
One chip to rule all.
54gb on chip memory It's have more cache than I have memory on my computer.

I'm going to go with 44GB since that is what is in the article.

44GB over 900,000 cores is only 50KB per core of cache memory which is tiny compared with consumer CPUs. As a specific purpose chip with specifically written code, it likely doesn't need a big memory for random access. (Unless I am misunderstanding what they mean by on-chip memory)

bit_user · Mar 14, 2024

nightbird321 said:
44GB over 900,000 cores is only 50KB per core

If you compute the amount of silicon area or transistors per core, it's pretty clear that these are probably similar to what Nvidia calls a "core". In other words, something more like a SIMD lane.

nightbird321 said:
of cache memory which is tiny compared with consumer CPUs.

I doubt it works as cache, but is probably directly addressable. Cache lookups require associative memory, which wastes die space and power for something that's really not necessary, when your access patterns are predictable. The normal way this works is you have a double-buffering type scheme and you've got a DMA engine which drains/fills inactive buffers while you're computing on the contents of the active buffer.

The IBM Cell processor worked this way, which was popularly used in Sony's PS3. Each of its 8 SPEs had 256 kB of SRAM that it used like this. The SPEs had no direct access to system memory, but rather relied upon DMAs to copy everything in/out of their scratchpad memory.

https://en.wikipedia.org/wiki/Cell_(processor)#Overview

nightbird321 said:
As a specific purpose chip with specifically written code, it likely doesn't need a big memory for random access. (Unless I am misunderstanding what they mean by on-chip memory)

I think a big use case for external DRAM access is streaming in weights, in the event you don't have enough SRAM to keep them all on-die. That's probably what eats most of the massive memory bandwidth on Nvidia's Hopper, for instance.

nightbird321 · Mar 14, 2024

bit_user said:
snip

SRAM accessible by all cores would have huge latency compared with L1 cache, which even Nvidia GPUs have to improve performance. Your IBM example is L1 cache, let's not split hairs there.

bit_user · Mar 14, 2024

nightbird321 said:
SRAM accessible by all cores would have huge latency compared with L1 cache

Who said it's accessible by all cores? The analogy I used was Cell's local scratch pad memory, which was private to each SPE.

nightbird321 said:
Your IBM example is L1 cache, let's not split hairs there.

It's not cache. The wikipedia link provides quite a decent explanation of how it works.

The Cell can be a little confusing, since it has 2 classes of cores. The SPEs are the ones which have scratchpad memory, while the PPE is more of a general-purpose core with regular L1 I/D caches and a 512 kB L2 cache. The wikipedia article claims the DMA engines servicing each of the cores maintain coherence with the PPE's caches.

In this diagram, the SPEs' scratchpad memory is labelled "LS" (Local Store).

das_stig · Mar 14, 2024

But can it calculate my pay packet correctly, just for 1 month would be nice without complaining?

Amdlova · Mar 14, 2024

Amdlova said:
One chip to rule all.
54gb on chip memory It's have more cache than I have memory on my computer.

My bad... see 54 in my 6 inch cell 😀 need fine tunning on my eyes maybe some overclock

JTWrenn · Mar 18, 2024

bit_user said:
You're getting mixed up between system specifications and chip specifications. A "DGX H100" system contains 8x H100 chips[1].

Compute-wise, it's worth pointing out that Nvidia touts the DGX H100's 32 TFLOPS of fp8 performance[2], which is probably why Cerebras is focusing on training performance. On that front (BF16), the DGX H100 has only 16 TFLOPS.

References:

https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html

https://nvidianews.nvidia.com/news/...ds-most-advanced-enterprise-ai-infrastructure

If your chip is this big it is meant to be a system replacement not a chip replacement right? So shouldn't it be compared to systems, not chips? The difference is this is all integrated so....don't see why you wouldn't compare system to system here...and even more dollar to dollar is the really important thing.

bit_user · Mar 19, 2024

JTWrenn said:
If your chip is this big it is meant to be a system replacement not a chip replacement right? So shouldn't it be compared to systems, not chips?

Yes, that's what they did. Their statement indicated 8x the compute as a DGX system. However, you seemed to think that was a comparison against a single H100 chip:

JTWrenn said:
Wait....only 8 times faster on the AI FLOPs? 50 times the transistors and only 8 times the compute?

Search

News Cerebras launches 900,000-core 125 PetaFLOPS wafer-scale processor for AI — theoretically equivalent to about 62 Nvidia H100 GPUs

Admin

Administrator

gg83

Distinguished

vanadiel007

Distinguished

Amdlova

Distinguished

ivan_vy

Reputable

bit_user

Titan

bit_user

Titan

JTWrenn

Distinguished

bit_user

Titan

nightbird321

Distinguished

bit_user

Titan

nightbird321

Distinguished

bit_user

Titan

das_stig

Glorious

Amdlova

Distinguished

JTWrenn

Distinguished

bit_user

Titan

TRENDING THREADS

Latest posts

Moderators online

Share this page