News AI GPU clusters with one million GPUs are planned for 2027 — Broadcom says three AI supercomputers are in the works

Can this possibly be true?
If so it is madness at an unprecedented level.
Let's just do some math, that if we're talking $30,000 per B200 GPU, a million times that is thirty billion dollars. And that's probably about half the cost of the completed, installed system. Not to mention ongoing facility cost and the electric bill. Depreciate it over what, ten years, finance it at even 5% opportunity cost, ... even minimal staffing, you're talking $5b-$10b per year and more to own and operate such a thing.

Now perhaps using a much cheaper chip, slower but with not too different price/performance, you get down to a tenth of that, still not cheap but you get the "million GPU" bragging rights.

Still it's my perception the technology has already moved on, a cluster of even 1,000 B200s should be enough for any team, though you might have ten or twenty teams going. So if these things are overbuilt because of some technoid FOMO, they won't ever operate 90% of it on any regular basis.
 
  • Like
Reactions: gg83
How many cores will be in each GPU? 1,000,000(gpu) x 20,000(cores) is a lot of cores. Just to run a crappy computer version of a 4 year old child with instant access to all the worlds twitter statements.
 
The article said:
The company believes that in 2027, the serviceable addressable market (SAM) for AI XPU and networking will be between $60 and $90 billion, and the firm is positioned to command a leading share of this market.
Meanwhile, AMD and Intel are squabbling over table scraps.

I wonder about software support for Broadcom's AI accelerators. Do popular machine learning frameworks already have backends for them? I think they must, but it hasn't come up on my radar. This seems to be one of the issues AMD long struggled with, and Intel recently said Gaudi 3 will miss sales projections because its software support is running behind schedule. That makes me really curious what Broadcom has been doing!

Also, what are the chances they include a small version of one of the AI accelerator blocks in the next Raspberry Pi SoC?
 
Can this possibly be true?
If so it is madness at an unprecedented level.
Let's just do some math, that if we're talking $30,000 per B200 GPU, a million times that is thirty billion dollars. And that's probably about half the cost of the completed, installed system. Not to mention ongoing facility cost and the electric bill. Depreciate it over what, ten years, finance it at even 5% opportunity cost, ... even minimal staffing, you're talking $5b-$10b per year and more to own and operate such a thing.
I think your numbers are a bit high, but you're in the right order of magnitude. I doubt customers buying 1M units will pay that $30k list price, even in spite of the extremely high demand. I'd peg the final build cost of the facility at not much more than $30B, although that's a pretty insane amount of money. It makes me wonder what other sorts of construction projects have similar price tags!

Now perhaps using a much cheaper chip, slower but with not too different price/performance, you get down to a tenth of that, still not cheap but you get the "million GPU" bragging rights.
I doubt Nvidia's price/perf is that far off what's possible. I could believe better price/perf by maybe a factor of 2, but not 10. Not if we're talking about training, that is.

Still it's my perception the technology has already moved on, a cluster of even 1,000 B200s should be enough for any team,
How do you figure that?

though you might have ten or twenty teams going. So if these things are overbuilt because of some technoid FOMO, they won't ever operate 90% of it on any regular basis.
A lot of the article focuses on hyperscalers, which means they definitely will have many customers using subsets of those pools. It's definitely not going to be 1M GPUs all training a single model, or anything silly like that.
 
  • Like
Reactions: P.Amini
How many cores will be in each GPU? 1,000,000(gpu) x 20,000(cores) is a lot of cores.
They're not cores, in the CPU sense of the word. They're just ALU pipelines, but bundled into SIMD-32 blocks (SIMD-32 is 1024 bits, if you'd prefer to look at it that way). So, the H100 has 528 (enabled) SIMD-32 engines, arranged into 132 SMs (Shader Multiprocessors). So, depending on how you look at it, it's either 528 or 132 cores per GPU. The 528 number puts it on par with an Intel server core, since that has 2x AVX-512 FMA ports per core.
 
  • Like
Reactions: P.Amini
I doubt Nvidia's price/perf is that far off what's possible. I could believe better price/perf by maybe a factor of 2, but not 10. Not if we're talking about training, that is.
Same price/perf but someone might use 10 slower chips at 1/10 the price, just for variety.
How do you figure that?
Well even NVidia bills the new chips at 20x faster (by using FP4, though that only applies to maybe half the training). Say they're right. Then 1000 B200s is like 20,000 H100s or whatever.

But more than that ... see next item.
A lot of the article focuses on hyperscalers, which means they definitely will have many customers using subsets of those pools. It's definitely not going to be 1M GPUs all training a single model, or anything silly like that.
The hunger for huge numbers of GPUs came from Altman's "scale is everything!" mantra five years ago, but even the training for ChatGPT 4.o was done in four pieces, using rather less than 100k GPUs of slower vintage.

Now, they are moving more work out of training and into inference time, but that's probably the right move, too. But it means all work is done in much smaller chunks, giving huge economies of scale.

Plus the search is on for more continuous, human-like learning. Nobody has to wipe your brain in order to accommodate reading one more book. And, some other stuff, too.

No doubt someone still wants to try their hand at mega-machine monotonic models, but the "scale, scale, scale" idea never made actual sense, when computational cost rises exponentially with scale, scale, scale. It's not the history of computation that stuff works like that, algorithms generally improve as fast or faster than hardware, things get exponentially easier.
 
Same price/perf but someone might use 10 slower chips at 1/10 the price, just for variety.
Training is difficult to scale like that. There's a lot of communication, hence why NVLink is such a beast. Communication doesn't scale linearly, so by having lots more nodes, your communication overhead is going to increase by an even greater amount, possibly even to point where you spend more energy on communication than computation. That's why I think maybe using a solution like half or 1/3rd as fast might be viable, but 1/10th probably isn't.

Well even NVidia bills the new chips at 20x faster (by using FP4, though that only applies to maybe half the training).
No, they said it's only 4x as fast at training as Hopper.

Now, they are moving more work out of training and into inference time, but that's probably the right move, too. But it means all work is done in much smaller chunks, giving huge economies of scale.
Link?

Plus the search is on for more continuous, human-like learning. Nobody has to wipe your brain in order to accommodate reading one more book. And, some other stuff, too.
There's a long-practiced concept called "transfer-learning", which is exactly what you're talking about. That's standard practice for at least 6 years.
 
  • Like
Reactions: P.Amini
Ever notice how all the climate doomsday types are okay with AI centers sucking up electricity as if it were free? Not a peep from the compliant media. I personally don't care as long as I am not required to help pay for more powerplants for them. I merely point out the hypocrisy.
 
Ever notice how all the climate doomsday types are okay with AI centers sucking up electricity as if it were free? Not a peep from the compliant media.
No, I did not. I've heard lots of people are complaining about how much power AI us using. It's literally one of the top complaints I hear about it, even in non-techie contexts and media.

Even on this site, it's been covered quite heavily, from multiple different angles. Here's just a small sampling of recent articles discussing the subject:

I personally don't care as long as I am not required to help pay for more powerplants for them.
It always has to get paid for, somehow. As powerless consumers, we're sure to foot some of that bill whether we like it or not. I think pretty much the only thing you can do is just try to boycott AI-based services and features as much as you can, in order to make it less profitable for the companies using it.