News Frontier trained a ChatGPT-sized large language model with only 3,000 of its 37,888 Radeon GPUs — the world's fastest supercomputer blasts through...

Status
Not open for further replies.
  • Like
Reactions: artk2219
Frontier needs 3072 GPUs for a trillion parameter model, Aurora needs 384 GPUs (64 nodes*6GPUs apiece) for a trillion parameter model: https://www.tomshardware.com/news/intel-supercomputing-2023-aurora-xeon-max-gpu-gaudi-granite-rapids

And Intel is in second place.

More specifics are probably needed for an accurate comparison, but AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
How many GPUs seems less relevant than how much time is needed.

And according to wikipedia,

[Aurora] has around 10 petabytes of memory and 230 petabytes of storage. The machine is estimated to consume around 60 MW of power. For comparison, the fastest computer in the world today, Frontier uses 21 MW while Summit uses 13 MW.
 
Frontier needs 3072 GPUs for a trillion parameter model, Aurora needs 384 GPUs (64 nodes*6GPUs apiece) for a trillion parameter model: https://www.tomshardware.com/news/intel-supercomputing-2023-aurora-xeon-max-gpu-gaudi-granite-rapids

And Intel is in second place.

More specifics are probably needed for an accurate comparison, but AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
from what I took from the article they only used as many GPU's as the memory needed (as more gpu wasnt beneficial but they still needed the memory)
 
  • Like
Reactions: prtskg and artk2219
How long it took to train?
Exactly. The most crucial piece of information was omitted. I glanced at the paper, but a quick search didn't find any unit of time. Maybe someone wants to have a closer read?

AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
How can you say they're trailing, when you don't know how much time (or power) either used? If Intel used 1/10th the GPUs, but took 20x as long and used more power in the process, can we really count that as a win?
 
Last edited:
more gpu wasnt beneficial but they still needed the memory
Given that GPT-3 was rumored to take 10k A100 GPUs an entire month to train. I don't see how you can claim more GPUs aren't beneficial.

BTW, I'm sure GPT-3 was trained on a far larger dataset, which would explain the apparent contradiction in compute power used for that vs. these examples.
 
  • Like
Reactions: prtskg
Exactly. The most crucial piece of information was omitted. I glanced at the paper, but a quick search didn't find any unit of time. Maybe someone wants to have a closer read?


How can you say they're trailing, when you don't know how much time (or power) either used? If Intel used 1/10th the GPUs, but took 20x as long and used more power in the process, can we really count that as a win?
That is all true, which is why I said more specifics are probably needed for an accurate comparison.
But this article, since it is lacking in any other information than just the number of GPUs used, makes it look like AMD needs 8x as many and makes AMD look like they are trailing badly.

When we do not know that to be the case. We don't know how long it took, if the models were equally complex, or even if the "training" was done to the same standards.

But still entertaining to put that in perspective with the headline with the loaded language: "Frontier trained a ChatGPT-sized large language model with only 3,000 of its 37,888 Radeon GPUs — the world's fastest supercomputer blasts through one trillion parameter model with only 8 percent of its MI250X GPUs"

When Intel used 0.64% of Aurora's GPUs to do what sounds like the same thing a few months back.

Since there is not enough information to even know which GPU is faster at this point I'll just gloat at throwing egg at the baiting headline.
 
Frontier needs 3072 GPUs for a trillion parameter model, Aurora needs 384 GPUs (64 nodes*6GPUs apiece) for a trillion parameter model: https://www.tomshardware.com/news/intel-supercomputing-2023-aurora-xeon-max-gpu-gaudi-granite-rapids

And Intel is in second place.

More specifics are probably needed for an accurate comparison, but AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
Each MI250X has 383TOPs Int8 capability while each Data Center GPU 1550 series has 1678TOPS Int8, meaning more than 4x as much per GPU. It's 500W for MI250X and 600W for GPU 1550. Also has 408MB of "Rambo Cache" while MI250X is only 16MB L2, which will have real world benefits.

Just by that metric it's like having 3072 Frontier GPUs versus 1536 Aurora GPUs.

Intel's GPU also has twice the amount of transistors, and much harder to fabricate. One would think it would be faster in something.
 
Last edited:
did you not read article?

"a new problem: parallelism. Throwing more GPUs at an LLM requires increasingly better communication to actually use more resources effectively. Otherwise, most or all of that extra GPU horsepower would be wasted."

"Strong scaling refers to increasing processor count without changing the size of the workload, and this tends to be where higher core counts become less useful"
With a more realistic training dataset, the workload would naturally increase by a lot. Their limited training data is probably why it didn't take them inordinately long to run this experiment with only 3k GPUs.
 
Frontier needs 3072 GPUs for a trillion parameter model, Aurora needs 384 GPUs (64 nodes*6GPUs apiece) for a trillion parameter model: https://www.tomshardware.com/news/intel-supercomputing-2023-aurora-xeon-max-gpu-gaudi-granite-rapids

And Intel is in second place.

More specifics are probably needed for an accurate comparison, but AMD looks to be trailing very badly. I wonder how much more power Frontier had to consume to perform the same task?
I own an RTX4090 and this is just astroturf. There are decent competing products from both Nvidia and AMD. being so supportive of only one side is cringe. these are multibillion dollar companies that you are simping for
 
I own an RTX4090 and this is just astroturf. There are decent competing products from both Nvidia and AMD. being so supportive of only one side is cringe. these are multibillion dollar companies that you are simping for
Nvidia is definitely in first place in AI.

The headline of this article being so supportive of one side while completely ignoring the other two better performing companies is cringe and is what I was reacting to.
It practically reads -Amazing AMD blasts through AI model like no company ever has before- even though there are two companies that get little fanfare in the article for doing more, earlier.
And then the article leaves out specifics needed for an even comparison.

Nothing against you and your 4090. I would have bought one if I had a 20 series to upgrade from, I'm just tired of seeing biased simping for AMD.

AMD has some good stuff. I like how they get good results from speeding up data access with their large caches. Their desktop CPU chiplets are very efficient with continuous heavy loads. Their older GPUs have better compatibility with newer games than pre Maxwell GPUs, even if they predate Kepler. Their newer GPUs are easier on CPUs. They get a ton of cores in server chips. But AMD isn't perfect, it is a multibillion dollar company and shouldn't get hype and credit where it isn't due.

I was just going by the numbers that were available for comparison. The task performed was training of a trillion parameter large language AI model like ChatGPT3. AMD needed 3072 currently used GPUs and Intel needed 384. That is all the information we were given. On it's face it makes AMD look much worse than the sensationalist headline suggests.

Maybe if we knew the time each setup took to compete the task, the relative complexity of each setup's task a better comparison could be made. AMD's GPU performance would very likely look better with more information. But it is silly to hype AMD's AI prowess from an achievement that is underwhelming and late compared even to dGPU newcomer Intel, much less Nvidia.
 
Nvidia is definitely in first place in AI.
I'm not sure you realize that AMD and Nvidia both have the same majority stake investors: Blackrock and Vanguard. There is no real competition. Both companies are just focused on different segments with some overlap, both made at the same fabs. I just buy what gives the bang for buck. I don't care about a side. Both are decent.
 
Status
Not open for further replies.