News Elon Musk and Larry Ellison begged Nvidia CEO Jensen Huang for AI GPUs at dinner

Status
Not open for further replies.

JRStern

Distinguished
Mar 20, 2017
151
59
18,660
The customer is always right, ROFLMAO.
Ellison says that frontier AI models coming in the next three years will cost $100 billion to train, echoing Anthropic CEO Dario Amodei’s thoughts on the matter.
Yeah, and Sam Altman who wants to raise $1t so he can be a player (that means he takes 10% off the top).
But they're wrong.
OpenAI's talk about their new o1 model that "thinks before it answers", does more work at inference time rather than putting all the load on pre-training. D'yall see what that means?
First it means needing more GPUs for inference time, OK, but maybe not that many. Actually GPU may not even be the right architecture to optimize this kind of inference, but that's complicated.
But it means a LOT less GPUs for training.
It even suggests that trillion parameter models will be decomposed saving exponential amounts of training work.

IOW I think it's the right path.
And also that it will cut even the big boys' demand for mass quantities of GPUs by a lot, by far more than are incrementally needed for inference.
Have a nice day.
 

bit_user

Titan
Ambassador
Jensen be eating good! More power to him, ...
He should enjoy it, while it lasts. Such a power imbalance never lasts forever.

if two rich guys are begging another guy to take their $ they should be paying more in taxes.
This is funny, but it's actually their businesses that want the GPUs from his business.

Now, say what you will about real corporate taxes rates...
(if it weren't off-topic, that is. So, actually please don't.)

Why this sound like a scene from the move The Godfather 1 :ROFLMAO::ROFLMAO:
It definitely leaves a bad taste, when backroom deals can gain someone such an important advantage in what's supposed to be an open market. I'll bet Jensen would rather Larry not have posted about it, but at least it shed a glimpse of transparency.
 
Last edited:

jp7189

Distinguished
Feb 21, 2012
486
287
19,060
The customer is always right, ROFLMAO.

Yeah, and Sam Altman who wants to raise $1t so he can be a player (that means he takes 10% off the top).
But they're wrong.
OpenAI's talk about their new o1 model that "thinks before it answers", does more work at inference time rather than putting all the load on pre-training. D'yall see what that means?
First it means needing more GPUs for inference time, OK, but maybe not that many. Actually GPU may not even be the right architecture to optimize this kind of inference, but that's complicated.
But it means a LOT less GPUs for training.
It even suggests that trillion parameter models will be decomposed saving exponential amounts of training work.

IOW I think it's the right path.
And also that it will cut even the big boys' demand for mass quantities of GPUs by a lot, by far more than are incrementally needed for inference.
Have a nice day.
I'd actually say the future is inferencing moving outwards towards low power end user devices. A few large training datacenters and disturbed inferencing on very efficient models
 

bit_user

Titan
Ambassador
I'd actually say the future is inferencing moving outwards towards low power end user devices. A few large training datacenters and disturbed inferencing on very efficient models
I think it depends a lot on how big the models are. You're not going to be using something like GPT 4 on a cell phone any time soon, just due to its sheer size. Not only does it chew up lots of storage, but also download bandwidth. Then, there's the issue of battery power, if you're inferencing huge models very much, like with some kind of Alexa/Siri assistant.

There would also be IP concerns about letting models run on edge devices, for those which aren't already open source. All someone needs to do is find one device with a known exploit to bypass memory encryption and now your model leaks out into the world.
 

jp7189

Distinguished
Feb 21, 2012
486
287
19,060
I think it depends a lot on how big the models are. You're not going to be using something like GPT 4 on a cell phone any time soon, just due to its sheer size. Not only does it chew up lots of storage, but also download bandwidth. Then, there's the issue of battery power, if you're inferencing huge models very much, like with some kind of Alexa/Siri assistant.

There would also be IP concerns about letting models run on edge devices, for those which aren't already open source. All someone needs to do is find one device with a known exploit to bypass memory encryption and now your model leaks out into the world.
I'm thinking of the pruning and distillation work e.g. mistral et al are doing. The focus is on maintaining accuracy while greatly reducing size and processing power. The minitron 8b runs on fairly low power.
 

JRStern

Distinguished
Mar 20, 2017
151
59
18,660
I think it depends a lot on how big the models are. You're not going to be using something like GPT 4 on a cell phone any time soon, just due to its sheer size. Not only does it chew up lots of storage, but also download bandwidth.
If there were value in it I suppose a terabyte model could be run on a phone, you just download it once a year or so, perhaps from some local genius bar where you pay $29 for the privilege (or purchase it on ROM and just buy an upgrade and plug it in as available). I think 1tb would cover GPT4, and if not I'll bet it could be compressed some just for edge distribution, with little or no impact on performance.

Now the problem is that this new form of inference may need more horsepower than one typically finds in a phone. The old form probably is OK, the new form I'm going to guess not so much. But again, going to the other side, *your* phone may very well learn *your* patterns of interference and be able to cache the most important parts specifically to perform YOUR inferences. That's the real promise of edge computing, that it can be as individual as you are, assuming that's a good thing, LOL.
 

bit_user

Titan
Ambassador
I'm thinking of the pruning and distillation work e.g. mistral et al are doing. The focus is on maintaining accuracy while greatly reducing size and processing power.
It's not like LLMs are renown for their accuracy. Yeah, a pruned version will still be useful for certain things, but current LLMs have more issues than just their resource requirements.
 

bit_user

Titan
Ambassador
If there were value in it I suppose a terabyte model could be run on a phone, you just download it once a year or so, perhaps from some local genius bar where you pay $29 for the privilege (or purchase it on ROM and just buy an upgrade and plug it in as available). I think 1tb would cover GPT4, and if not I'll bet it could be compressed some just for edge distribution, with little or no impact on performance.
If the model is 1 TB, then inferencing speeds are going to be limited by how long it takes to read 1 TB from your phone's storage, which typically isn't the fastest NAND out there. It's also going to burn yet more power, continually having to read it in.
 

Pierce2623

Prominent
Dec 3, 2023
405
292
560
The customer is always right, ROFLMAO.

Yeah, and Sam Altman who wants to raise $1t so he can be a player (that means he takes 10% off the top).
But they're wrong.
OpenAI's talk about their new o1 model that "thinks before it answers", does more work at inference time rather than putting all the load on pre-training. D'yall see what that means?
First it means needing more GPUs for inference time, OK, but maybe not that many. Actually GPU may not even be the right architecture to optimize this kind of inference, but that's complicated.
But it means a LOT less GPUs for training.
It even suggests that trillion parameter models will be decomposed saving exponential amounts of training work.

IOW I think it's the right path.
And also that it will cut even the big boys' demand for mass quantities of GPUs by a lot, by far more than are incrementally needed for inference.
Have a nice day.
Buy the hype. Have fun with a model where they take shortcuts in training it.
 

JRStern

Distinguished
Mar 20, 2017
151
59
18,660
If the model is 1 TB, then inferencing speeds are going to be limited by how long it takes to read 1 TB from your phone's storage, which typically isn't the fastest NAND out there. It's also going to burn yet more power, continually having to read it in.
No query touches more than a tiny fraction of that.
It would be interesting to have actual numbers.
It might be down as low as 1mb, let's call it 10mb, that would be 0.00001%, about the same as one medium-res picture.
 

JRStern

Distinguished
Mar 20, 2017
151
59
18,660
Buy the hype. Have fun with a model where they take shortcuts in training it.
LOL. If you looked into how they "train" it now you wouldn't touch the current versions.
"Shortcuts" are more like going straight to the destination instead of knocking on every door until you happen to hit the right one.
Warp speed.
 
Last edited:

JRStern

Distinguished
Mar 20, 2017
151
59
18,660
The problem is which fraction?
Sure, well, that's why it's not exactly like fetching a single picture or even a video 1000x larger.
So it's a little more like complex SQL query of a database.
But it's 1tb because it holds data for 1000 topics, and your query is typically just one facet of one topic.
 

bit_user

Titan
Ambassador
No query touches more than a tiny fraction of that.
It would be interesting to have actual numbers.
It might be down as low as 1mb, let's call it 10mb, that would be 0.00001%, about the same as one medium-res picture.
I don't doubt that a minority of the weights are needed, but your figure seems extremely low, to me. What's your source?

Also, as @USAFRet pointed out, the weights you need aren't all going to be compactly organized. NAND organizes storage into blocks and you pretty much have to read an entire block just to access one small part of it. So, the overheads and amplification will make reading quite a bit more expensive than if you just loaded in a file of the same length as the specific weights would occupy.

it's 1tb because it holds data for 1000 topics, and your query is typically just one facet of one topic.
It's not neatly taxonomized. I expect there will be some coherency and ad hoc structure to it, in order to achieve the necessary representational efficiency, but it won't be partitioned, per se.
 
Status
Not open for further replies.