News AMD talks 1.2 million GPU AI supercomputer to compete with Nvidia — 30X more GPUs than world's fastest supercomputer

A Stoner

Distinguished
Jan 19, 2009
378
145
18,960
And even with that, they would not even have the processing power of an insect at their disposal. We still have no factual intelligence from all the AI spending that has happened. No AI knows anything at all as of yet.
 

JRStern

Distinguished
Mar 20, 2017
178
67
18,660
Well Musk was just out there raising money for 300,000 GPUs, we're talking billions or trillions before they're all installed and usable, not to mention gigawatts of power to run. OTOH this is crazy stuff, IMHO, and perhaps Elon isn't hip to the news that much smaller LLMs are now being seen as workable so maybe nobody will need a single training system with more than 300 or perhaps 3000 GPUs, to do a retrain within 24 hours. And maybe whole-hog retrains won't be as necessary anymore, either.

So AMD is just trolling, is what this comes down to, unlikely to actually build it out.
 

Pierce2623

Commendable
Dec 3, 2023
503
386
1,260
The record Dynex set recently was only a quantum record and the record they beat wasn’t even real quantum computing. The record they beat only involves 896 GPUs
 
Last edited by a moderator:
  • Like
Reactions: Sluggotg

DS426

Prominent
May 15, 2024
278
207
560
Usually business is all about ROI and profit but... really, c'mon, someone show me the math on how investments like this pay off without losing money?? We're also talking about cooling, electric bills, sys admins, and so on, so... wtf is so magically about a (relatively?) well-trained and advanced AI LLM or such that costifies this?

Seriously, not being a hater just to hate but again being on the business side of things in IT, I need to see some math.

On another note, at least some folks are seeing the value in not paying ridiculous cash just to have "the best" (nVidia) whereas AMD can honestly and probably provide a better return on investment. Kind of that age-old name brand vs. generic argument.

Still mindblown over here. How many supercomputers have more than 1.2 million CPU's? I know this doesn't account for core counts but holy smokes, we're clearly not talking apples to apples here!! Pretty sure a mini power plant is literally needed to sit beside a datacenter/supercomputing facility like this.
 

oofdragon

Distinguished
Oct 14, 2017
327
292
19,060
I honestly don't get it. Ok so someone like Elon is considering 300 thousand GPUs like Blackwell's, spending in the order of billions just to buy them, then you have the electric bill and maintenance as well every month. In what war can he possible make a profit out of this situation?
 

ThomasKinsley

Notable
Oct 4, 2023
385
384
1,060
Not to get all cynical, but this sounds like a bit of a stretch to me. The reporter gave the random number 1.2 million and the AMD staff member responded with, "It’s in that range? Yes." A range needs more than one number. Are we talking 700,000? 1 million? 1.4 million? There's no way to know.
 

kjfatl

Reputable
Apr 15, 2020
216
157
4,760
If Musk is serious about the 300,000 GPU's it makes perfect sense that the design would support an upgrade path where compute modules could be replaced with future modules with 2X or 4X the capacity.
The most obvious use for such a machine is for constant updates to self-driving vehicle software. Daily or even by the minute upgrades are needed for this to be seamless. This is little different than what Google or Garman does with maps. When 'interesting' data is seen by vehicles it would be sent to the compute farm for processing. Real-time data from a landslide just before the driver ran off the side of the road would qualify as 'interesting'. Preventing the crash in the next landslide would be the goal.

This sort of system is large enough to justify custom compute silicon supporting a limited set of models. This alone might cut the hardware requirements by a factor of 4. Moving to Intel 14A or the equivalent from TSMC or Samsung might give another factor of 8 toward density. Advanced packaging techniques might double it again. Combining all of these could provide a machine with the same footprint and power envelope of today's supercomputer with 30,000 GPUs.
 

scottscholzpdx

Honorable
Sep 14, 2017
19
14
10,515
Well Musk was just out there raising money for 300,000 GPUs, we're talking billions or trillions before they're all installed and usable, not to mention gigawatts of power to run. OTOH this is crazy stuff, IMHO, and perhaps Elon isn't hip to the news that much smaller LLMs are now being seen as workable so maybe nobody will need a single training system with more than 300 or perhaps 3000 GPUs, to do a retrain within 24 hours. And maybe whole-hog retrains won't be as necessary anymore, either.

So AMD is just trolling, is what this comes down to, unlikely to actually build it out.
Save us some time and just say "I own tesla stock"
 
  • Like
Reactions: Thunder64

sygreenblum

Distinguished
Feb 25, 2016
32
24
18,535
How much power would Million GPUs would consume. its seems off the charts if all of them are fully used. !!!
Well, 1 Blackwell GPU can consume 1kw of electricity. So 1.2 million is 1.2gw. This is more than one of Diablo Canyon full size nuclear reactors or roughly 5 percent of the entire state of California's power grid.
 
  • Like
Reactions: Amdlova
Well, 1 Blackwell GPU can consume 1kw of electricity. So 1.2 million is 1.2gw. This is more than one of Diablo Canyon full size nuclear reactors or roughly 5 percent of the entire state of California's power grid.
You only quoting thr gpu... cooling man power and other infrastructure to get this working :)

AMD frontier plague they resolve the hardware failure???

:)
 

Silas Sanchez

Proper
Feb 2, 2024
109
65
160
And even with that, they would not even have the processing power of an insect at their disposal. We still have no factual intelligence from all the AI spending that has happened. No AI knows anything at all as of yet.
This is nonsense. You cant compare a biological mind to a computer. As far as carrying out operations per second the human brain is extraordinarily slow vs even an old slow computer. The amount of reliable objects/numbers etc a human mind can hold at a given instance is again extraordinarily tiny and the speed and reliability the mind processes things like making inferences from senses, memories, knowledge is extraordinarily slow and unreliable. Compounded by the fact that our mind is a slave to physics and arguably has no free will, e.g how its so easy for our mind to get distracted/side tracked and how thoughts uncontrollable pop up out of knowhere interfearing with our internal reasoning. But of course computers are extraordinarily bad at making what seem like to us easy decisions and conclusions on many matters. They cant understand jokes.
As per even strong weak AI, AI cant ever just start knowing things, it doesnt magical just do that despite all the mainstream popular misunderstanding. That is actually the big mystery of artificial intelligence, how can it know when every single thing it does has to literally be the result of a designer. Its not designed to know anything, not a single thing as we would recognize, it can be extraordinarily intelligent-pass the turing test with flying colors and yet it doesnt need to know anything. There is no known bridge as of yet between high intelligence and actual knowing. "AI" currently is dumb as rocks, its a very affordable and easy to implement chessy tech and bares so little to what researchers envisioned weak AI being.
The problem is made worse by the fact that we charge an AI with having the same knowing that we do, but we dont know were and how that knowing would arise, literally no idea under the sun. Since computers have very diffreent stucture, its ultimatley possible that some kind of knowing already arises from a computer but it would be impossile to be aware of it and have any idea what it is like.
 
AMD frontier plague they resolve the hardware failure???
At supercomputer scale these things fail a lot faster than you think. Frontier has it is failure goal and is working. Aurora is still going through its teething stage. It doesn't matter who makes the CPUs and GPUs you are going to have failures every day in a supercomputer.

 
  • Like
Reactions: Amdlova

jp7189

Distinguished
Feb 21, 2012
532
303
19,260
This is nonsense. You cant compare a biological mind to a computer. As far as carrying out operations per second the human brain is extraordinarily slow vs even an old slow computer. The amount of reliable objects/numbers etc a human mind can hold at a given instance is again extraordinarily tiny and the speed and reliability the mind processes things like making inferences from senses, memories, knowledge is extraordinarily slow and unreliable. Compounded by the fact that our mind is a slave to physics and arguably has no free will, e.g how its so easy for our mind to get distracted/side tracked and how thoughts uncontrollable pop up out of knowhere interfearing with our internal reasoning. But of course computers are extraordinarily bad at making what seem like to us easy decisions and conclusions on many matters. They cant understand jokes.
As per even strong weak AI, AI cant ever just start knowing things, it doesnt magical just do that despite all the mainstream popular misunderstanding. That is actually the big mystery of artificial intelligence, how can it know when every single thing it does has to literally be the result of a designer. Its not designed to know anything, not a single thing as we would recognize, it can be extraordinarily intelligent-pass the turing test with flying colors and yet it doesnt need to know anything. There is no known bridge as of yet between high intelligence and actual knowing. "AI" currently is dumb as rocks, its a very affordable and easy to implement chessy tech and bares so little to what researchers envisioned weak AI being.
The problem is made worse by the fact that we charge an AI with having the same knowing that we do, but we dont know were and how that knowing would arise, literally no idea under the sun. Since computers have very diffreent stucture, its ultimatley possible that some kind of knowing already arises from a computer but it would be impossile to be aware of it and have any idea what it is like.
Nature is full of mysteries; computers not so much. There is nothing unknowable about "AI". It's still just 1's and 0's put together in massive quantities, but they produce specific results.

For those that haven't seen GPT in Excel, I highly recommend you play with it. It steps through the LLM calculations in an Excel spreadsheet so you can see exactly how an LLM arrives at responses. There's no magic, just some fairly simple math.
 

Pierce2623

Commendable
Dec 3, 2023
503
386
1,260
If Musk is serious about the 300,000 GPU's it makes perfect sense that the design would support an upgrade path where compute modules could be replaced with future modules with 2X or 4X the capacity.
The most obvious use for such a machine is for constant updates to self-driving vehicle software. Daily or even by the minute upgrades are needed for this to be seamless. This is little different than what Google or Garman does with maps. When 'interesting' data is seen by vehicles it would be sent to the compute farm for processing. Real-time data from a landslide just before the driver ran off the side of the road would qualify as 'interesting'. Preventing the crash in the next landslide would be the goal.

This sort of system is large enough to justify custom compute silicon supporting a limited set of models. This alone might cut the hardware requirements by a factor of 4. Moving to Intel 14A or the equivalent from TSMC or Samsung might give another factor of 8 toward density. Advanced packaging techniques might double it again. Combining all of these could provide a machine with the same footprint and power envelope of today's supercomputer with 30,000 GPUs.
If Musk is serious about the 300,000 GPU's it makes perfect sense that the design would support an upgrade path where compute modules could be replaced with future modules with 2X or 4X the capacity.
The most obvious use for such a machine is for constant updates to self-driving vehicle software. Daily or even by the minute upgrades are needed for this to be seamless. This is little different than what Google or Garman does with maps. When 'interesting' data is seen by vehicles it would be sent to the compute farm for processing. Real-time data from a landslide just before the driver ran off the side of the road would qualify as 'interesting'. Preventing the crash in the next landslide would be the goal.

This sort of system is large enough to justify custom compute silicon supporting a limited set of models. This alone might cut the hardware requirements by a factor of 4. Moving to Intel 14A or the equivalent from TSMC or Samsung might give another factor of 8 toward density. Advanced packaging techniques might double it again. Combining all of these could provide a machine with the same footprint and power envelope of today's supercomputer with 30,000 GPUs.
the funny part is Musk is actually taking compute away from the self driving Tesla program to throw it at “X AI” his “AI company” that literally just develops a LLM that sucks.
 

NinoPino

Respectable
May 26, 2022
496
310
2,060
Well, 1 Blackwell GPU can consume 1kw of electricity. So 1.2 million is 1.2gw. This is more than one of Diablo Canyon full size nuclear reactors or roughly 5 percent of the entire state of California's power grid.
With this number of GPUs I suppose we are talking of custom components.
So, maybe the performance per GPU are not so high and more important the power usage are lowered a lot.
 
It is a form of distributed computing. This type of thing has been done for quite a long time already so this is nothing new. Now it just has a better marketing term. This is also being used for a lot of things like Bitcoin mining which is probably going to end up its main use.
 
Last edited by a moderator:

Pierce2623

Commendable
Dec 3, 2023
503
386
1,260
Usually business is all about ROI and profit but... really, c'mon, someone show me the math on how investments like this pay off without losing money?? We're also talking about cooling, electric bills, sys admins, and so on, so... wtf is so magically about a (relatively?) well-trained and advanced AI LLM or such that costifies this?

Seriously, not being a hater just to hate but again being on the business side of things in IT, I need to see some math.

On another note, at least some folks are seeing the value in not paying ridiculous cash just to have "the best" (nVidia) whereas AMD can honestly and probably provide a better return on investment. Kind of that age-old name brand vs. generic argument.

Still mindblown over here. How many supercomputers have more than 1.2 million CPU's? I know this doesn't account for core counts but holy smokes, we're clearly not talking apples to apples here!! Pretty sure a mini power plant is literally needed to sit beside a datacenter/supercomputing facility like this.
I would think the business case here would be a specific “AI training” setup that gets consistent business from OpenAI, Meta, Amazon, Google etc.