News DeepSeek might not be as disruptive as claimed, firm reportedly has 50,000 Nvidia GPUs and spent $1.6 billion on buildouts

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Nobody lied.

DeepSeek was extremely transparent that $5.576 million is only variable training cost, not including capital cost associated with prior research, acquisition costs, data fixed costs. It's the people who misquote them that doesn't know AI that is misrepresenting things.


https://arxiv.org/html/2412.19437v1

DeepSeek-V3 Technical Report​



DeepSeek was very open that this is purely training compute cost, not the total "fixed+variable" cost. It's the media who misquote them and misrepresent it as a "all-in-one" cost that should be criticized.
Sure. However they did seem to mislead to make bigger headlines. If proven true, they didn't reveal they distilled their model from OpenAI. They left that out it would seem and that is awfully conspicuous. Who knows how much that helped them achieve what they did in the timeline they cited. Should they be celebrated for this? Or are we saying distillation is a fair game just as LLMs sucking up copyrighted content from across the web to train is fair game? No easy answers there unless you start with some prejudice towards the country of origin.

Aside from that, DeepSeek showed some innovative steps. Turning a portion of the GPUs they had into L3 memory cache controllers was pretty brilliant. My understanding is cache memory access is a big limiter in training performance. So that's just huge for performance tuning perspective. I am wondering if just revealing this technique will cause a major jump in GPU usage efficiency for everyone or have companies like OpenAI already been doing this without revealing as such? Releasing the model open source could be disruptive, but I'm not sure. Better models are coming. Will they pull this off again to update/improve the model? I doubt it. How far will this open source model carry a company that attempts to build off it before it becomes obsolete? Seems risky given it still is not cheap to buy even low end GPU's to run this model.

As such I think at this point we're all wildly speculating on the net impact of DeepSeek's release to AI LLM development. Anyone saying big tech LLM development is dead is reaching. Anyone claiming this all a big chinese orchestrated scam is reaching. Let's wait and see.

I am thoroughly enraptured by the headlines in this space almost daily. It is fascinating both learning about the tools as they improve as well as using them more and more every day. I can't imagine this is just some hype train that will never arrive at the station from an investment perspective. All the while I look forward to reading the news in a couple months to see what people have done attempting to replicate DeepSeek's approach and/or building upon their opensource. Some more disruption and competition is good even if it shakes my fellow US investors. We need this to beomce more efficient for smaller companies to build and run. I don't want 2-3 giant tech companies controlling access to this emerging technology because it needs a nuclear reactor's worth of energy output to build/run. Nevermind the supply constraint for the chips.
 
If I collect all the "if", "suppose", "believe" and synonyms of them and phrases with that meaning from the comments and from the semianalysis "report", I will have enough for a whole year. I will eat them raw, fried, baked, boiled in soup.
 
It's not necessarily a conspiracy by the state. DeepSeek is run by a hedge fund, which might've profited off the fall in Nvidia shares, when they made their announcement of how much it cost to train.
I would agree if they were public, but being private and going open source made it seem much more like a state to state swing at the markets. I think it was just way to weaken the US marekts before the possibilities of tariffs.
 
I wonder how long it'll take for this thread to be locked due to politics. Besides that, I don't trust the 1.6 billion number if they lied about the 6 million number. Who knows, maybe it cost 1 trillion but the power cost was only 1.6 billion.
will have enough for a whole year. I will eat them raw, fried, baked, boiled in soup.
Boil em mash em stick em in a stew
 
Last edited:
The question I have based on this article is just how was this Chinese company, that hires almost exclusively from mainland China, able to buy 50,000 NVIDIA datacenter GPUs given there is a complete ban on them to China? And not just the older, discontinued A100’s as previously reported, or even consumer ones, but the latest Hopper GPUs?
From what I've read, they already know where to look... Singapore. The sudden spike in exports to Singapore is clearly visible in NVIDIA's financial reports. Do you think someone will do anything about this? Investigate more? No, they're already flooding the internet with articles to save NVIDIA's stock price...

We're screwed. We have a massive giant with an inflated market capitalization that looms like a specter in the stock market, and if it crashes, it risks causing damage throughout the entire market. This is what happens when there are such concentrations. The problem is NVIDIA. Too many people, including analysts who didn't understand anything, not even the difference between training and inference, have inflated it senselessly, and now with every little issue it faces, the entire market trembles (because everyone know that it's inflated...they want to make money with it but they know that sooner or later...). Anything even remotely related to them, in the same sector or not, is now at risk, even the companies that supply screws for their servers, 😆
 
Last edited:
Sure. However they did seem to mislead to make bigger headlines. If proven true, they didn't reveal they distilled their model from OpenAI. They left that out it would seem and that is awfully conspicuous. Who knows how much that helped them achieve what they did in the timeline they cited. Should they be celebrated for this? Or are we saying distillation is a fair game just as LLMs sucking up copyrighted content from across the web to train is fair game? No easy answers there unless you start with some prejudice towards the country of origin.

Aside from that, DeepSeek showed some innovative steps. Turning a portion of the GPUs they had into L3 memory cache controllers was pretty brilliant. My understanding is cache memory access is a big limiter in training performance. So that's just huge for performance tuning perspective. I am wondering if just revealing this technique will cause a major jump in GPU usage efficiency for everyone or have companies like OpenAI already been doing this without revealing as such? Releasing the model open source could be disruptive, but I'm not sure. Better models are coming. Will they pull this off again to update/improve the model? I doubt it. How far will this open source model carry a company that attempts to build off it before it becomes obsolete? Seems risky given it still is not cheap to buy even low end GPU's to run this model.

As such I think at this point we're all wildly speculating on the net impact of DeepSeek's release to AI LLM development. Anyone saying big tech LLM development is dead is reaching. Anyone claiming this all a big chinese orchestrated scam is reaching. Let's wait and see.

I am thoroughly enraptured by the headlines in this space almost daily. It is fascinating both learning about the tools as they improve as well as using them more and more every day. I can't imagine this is just some hype train that will never arrive at the station from an investment perspective. All the while I look forward to reading the news in a couple months to see what people have done attempting to replicate DeepSeek's approach and/or building upon their opensource. Some more disruption and competition is good even if it shakes my fellow US investors. We need this to beomce more efficient for smaller companies to build and run. I don't want 2-3 giant tech companies controlling access to this emerging technology because it needs a nuclear reactor's worth of energy output to build/run. Nevermind the supply constraint for the chips.
I feel that if their “compute power used” claims prove true. The fact that their model was made through using their “distillation” process on open source models, then they’ve still told the truth. Regardless of the methods they used, if they trained an equivalent performing model on a fraction of the resources then they’ve still essentially told the truth. My only concern is if the amount of compute power they claimed to have used turns out to be BS.
 
Sure. However they did seem to mislead to make bigger headlines. If proven true, they didn't reveal they distilled their model from OpenAI. They left that out it would seem and that is awfully conspicuous. Who knows how much that helped them achieve what they did in the timeline they cited. Should they be celebrated for this? Or are we saying distillation is a fair game just as LLMs sucking up copyrighted content from across the web to train is fair game? No easy answers there unless you start with some prejudice towards the country of origin.

Aside from that, DeepSeek showed some innovative steps. Turning a portion of the GPUs they had into L3 memory cache controllers was pretty brilliant. My understanding is cache memory access is a big limiter in training performance. So that's just huge for performance tuning perspective. I am wondering if just revealing this technique will cause a major jump in GPU usage efficiency for everyone or have companies like OpenAI already been doing this without revealing as such? Releasing the model open source could be disruptive, but I'm not sure. Better models are coming. Will they pull this off again to update/improve the model? I doubt it. How far will this open source model carry a company that attempts to build off it before it becomes obsolete? Seems risky given it still is not cheap to buy even low end GPU's to run this model.

As such I think at this point we're all wildly speculating on the net impact of DeepSeek's release to AI LLM development. Anyone saying big tech LLM development is dead is reaching. Anyone claiming this all a big chinese orchestrated scam is reaching. Let's wait and see.

I am thoroughly enraptured by the headlines in this space almost daily. It is fascinating both learning about the tools as they improve as well as using them more and more every day. I can't imagine this is just some hype train that will never arrive at the station from an investment perspective. All the while I look forward to reading the news in a couple months to see what people have done attempting to replicate DeepSeek's approach and/or building upon their opensource. Some more disruption and competition is good even if it shakes my fellow US investors. We need this to beomce more efficient for smaller companies to build and run. I don't want 2-3 giant tech companies controlling access to this emerging technology because it needs a nuclear reactor's worth of energy output to build/run. Nevermind the supply constraint for the chips.
Actually you the actual reason why they missed out the distilled part, it’s not to lead the headlines or earn money, it’s to please the leader, if they stated they distilled form Open AI they can’t claim overtaken the world and is independent from any US sanctions, which is what the leader wanted to promote. If they dont fulfil that they will have much more to worry than money
 
  • Like
Reactions: bit_user
One thing to note it's 50,000 hoppers (older H20, H800s) to make DeepSeek, whereas xAi needs 100,000 H100s to make GrokAI, or Meta's 100,000 H100s to make Llama 3. So even if you compare fixed costs, DeepSeek needs 50% of the fixed costs (and less efficient NPUs) for 10-20% better performance in their models, which is a hugely impressive feat.
Why do you assume they're using all of their GPUs to train a single model? This seems unlikely, to me. They have many researchers investigating different approaches, different tunings, and even developing other models.

One thing that people don't understand is, no matter what model OpenAI publishes, DeepSeek will distill the output, and make it free/publically available
Unless the legality of doing so is successfully challenged. I'm not saying whether it will be, just that OpenAI is almost certain to try.

Also, OpenAI can implement measures to try and detect people doing this and cut them off.
 
One thing that people don't understand is, no matter what model OpenAI publishes, DeepSeek will distill the output, and make it free/publically available (v3 is dstilled 4o, r1 is distilled o1, and they are going to clone o3 etc...) So 90% of the AI LLM market will be "commoditized", with remaining occupied by very top end models, which inevitably will be distilled as well. OpenAI's only "hail mary" to justify enormous spend is trying to reach "AGI", but can it be an enduring moat if DeepSeek can also reach AGI, and make it open source?
I am pretty sure they will, but that's exactly when innovatioin will be started to be killed off, it's essentially privacy, when it make's everything commoditized, you just kill off those who invest in hundreds of thousands of H100 or similar plus all those expert who created the model, and when you outcompete them, good luck you can have something progressing forward.
 
A company outside of China and with no bad history buys H100s, sells them to China, profits. If you won’t need the H100s for yourself in the future there’s no real risk to doing it once for a company.
That's why a new set of restrictions was about to go into effect, last month. I haven't heard what happened with these:

The aspect you're missing is financial. The tier-1 countries on that list are probably all members of the financial transaction monitoring network used to track organized crime, terrorism finance, etc.

As soon as you start doing import/export of thousands of these GPUs, you're moving tens of millions of $, or more. That's not trivial to hide and it's also illegal if you don't go through all the proper formalities that would reveal what you're doing. Obviously, the systems in place for spotting such smuggling aren't perfect, but it's another set of tools and trained individuals whose job it is to watch out for these sorts of illegal movements.
 
Sure. However they did seem to mislead to make bigger headlines. If proven true, they didn't reveal they distilled their model from OpenAI. They left that out it would seem and that is awfully conspicuous. Who knows how much that helped them achieve what they did in the timeline they cited. Should they be celebrated for this? Or are we saying distillation is a fair game just as LLMs sucking up copyrighted content from across the web to train is fair game? No easy answers there ...
IMO, distillation is like just memorizing the problems & answers likely to be on a test, rather than really learning the material. As long as you stay within the areas covered by the benchmarks used to evaluate it, the performance is probably pretty good. However, I'd expect it to be much more uneven and have some hard limits to what it knows or can do.
 
That’s an impressive investment! $1.6 billion in AI hardware really shows how serious the push is in the AI space right now. It’ll be interesting to see how these resources are utilized and how they impact Nvidia’s market position
I guess you didn't hear about Musk buying 100k H100 GPUs? At the street price of $30k each, that's a cool $3B. He's planning on expanding it to 2-3 times the size:

Even bigger is the recently announced Stargate project, which aims to start with an initial $100B investment and has plans to eventually ramp up to $500B?

Of course, that was before all this DeepSeek commotion. So, we'll have to see if they get cold feet and scale back those plans quite substantially.
 
The most relevant thing isn't whether DeepSeek has damaged or will damage NVIDIA in the future, but rather the fact that as soon as the news spread (and by the way, we already had information about DeepSeek in December, so well before), a small breeze was enough to make NVIDIA's stock collapse.

It was like a test. And this is what's really interesting, because it's tangible proof of how solid NVIDIA's position in the AI sector is perceived to be, practically nothing. All it takes is a group of skilled programmers in China to show off an inference model (and who knows what could emerge in the coming months) and everyone runs away from NVIDIA. They have solid foundations, indeed
 
DeepSeek has strong backing with 50,000 Nvidia GPUs and $1.6B investment, but real disruption remains to be seen.
They don’t have that GPU access after all. They can’t even train v2 because they don’t access to tens of thousands of H100 and H800 like people claimed. V1 really was super small to train (like less than 10% of GPU hours) compared to stuff with similar capabilities like ChatGPT 3.5 etc, but now v2 is pretty big to train but still MUCH SMALLER than other current competitors. It’s probably only slightly smaller than ChatGPT 3. Regardless, it’s STILL like an order of magnitude less GPU hours, compared to where ChatGPT is now.
 
They don’t have that GPU access after all. They can’t even train v2 because they don’t access to tens of thousands of H100 and H800 like people claimed. V1 really was super small to train (like less than 10% of GPU hours) compared to stuff with similar capabilities like ChatGPT 3.5 etc, but now v2 is pretty big to train but still MUCH SMALLER than other current competitors. It’s probably only slightly smaller than ChatGPT 3. Regardless, it’s STILL like an order of magnitude less GPU hours, compared to where ChatGPT is now.
That's because they are utilizing distillation method, i.e. get tons of ChatGPT account, feed them questions and mimick the results, that no wonder will be a lot less GPU hours. But problem is, without the OG ChatGPT, it is useless..
 
  • Like
Reactions: bit_user
That's because they are utilizing distillation method, i.e. get tons of ChatGPT account, feed them questions and mimick the results, that no wonder will be a lot less GPU hours. But problem is, without the OG ChatGPT, it is useless..
OpenAI's chief research scientist congratulated DeepSeek for their research paper and optimization technique which significantly reduces compute requirements. He did not any accusations like cheating or lying.

It's likely DeepSeek trained on public internet data, and not surprisingly, the internet is filled with ChatGPT generated content. It's not necessarily distillation as the amount of API calls needed to create a 671 billion parameter model (3X the size of ChatGPT 4o, the most popular model) would easily be caught and prohibitively expensive.

DeepSeek also employed a novel combination of optimization techniques to reduce RAM requirements, such as multi-token prediction instead of single-token predictions reduces RAM, mixture-of-experts reduces RAM, compression of key indices reduces RAM, 8bit instead of 32bit, and even used assembly-like PTX programming for execution efficiency, all of which reduces compute requirements significantly, and these optimization techniques have been adopted by leading LLM makers like Meta, Google, etc...

What's eye opening is this is all published research information, yet so much cope and misinformation abound.
 
Last edited:
OpenAI's chief research scientist congratulated DeepSeek for their research paper and optimization technique which significantly reduces compute requirements. He did not any accusations like cheating or lying.
That was before the claims emerged of their supposed compute requirements for training their R1 model. You're starting with a conclusion and working backwards to try and justify it.

It's likely DeepSeek trained on public internet data, and not surprisingly, the internet is filled with ChatGPT generated content.
No, that's not what @YSCCC was referring to.

the amount of API calls needed to create a 671 billion parameter model (3X the size of ChatGPT 4o, the most popular model) would easily be caught and prohibitively expensive.
OpenAI did catch it and subsequently banned them.

DeepSeek also employed a novel combination of optimization techniques to reduce RAM requirements, such as multi-token prediction instead of single-token predictions reduces RAM, mixture-of-experts reduces RAM, compression of key indices reduces RAM, 8bit instead of 32bit, and even used assembly-like PTX programming for execution efficiency, all of which reduces compute requirements significantly, and these optimization techniques have been adopted by leading LLM makers like Meta, Google, etc...
This is copium. A lot of those optimization techniques are used by the other big AI houses, they just don't publish them due to regarding them as either common sense or trade secrets.
 
That was before the claims emerged of their supposed compute requirements for training their R1 model. You're starting with a conclusion and working backwards to try and justify it.
You mean before Sam Altman said DeepSeek was a national security threat and suggested Congress ban it? No conflict of interest there.
No, that's not what @YSCCC was referring to.


OpenAI did catch it and subsequently banned them.
Then theoretically, there will be no more V3 iteration update due to lack of distillation access? Yet V3 has posted an update since, and will continue to improve their model, so the premise that DeepSeek is simply distillation of ChatGPT is an exaggeration and fallacy by competitors seeking to downplay it's achievements.
This is copium. A lot of those optimization techniques are used by the other big AI houses, they just don't publish them due to regarding them as either common sense or trade secrets.
OpenAI chief research scientist has been on public record that says these optimization techniques employed in concert together is novel, even if the individual components were already known. Some aren't even used like bypassing CUDA and instead using an assembly -like PTX language, none of the leading model makers do that because none of them are sanctioned from CUDA. Challenges and barriers encourages innovation, it's copium to dismiss DeepSeek as a mere distillation and dismiss it as a mere copy cat.
 
Then theoretically, there will be no more V3 iteration update due to lack of distillation access?
Well, now they have much more funding, so they can afford more hardware.

OpenAI chief research scientist has been on public record that says these optimization techniques employed in concert together is novel, even if the individual components were already known. Some aren't even used like bypassing CUDA and instead using an assembly -like PTX language, none of the leading model makers do that because none of them are sanctioned from CUDA.
No, that's BS. PTX is specified by Nvidia precisely because customers demand it. If Nvidia really didn't want people to use it, they would simply stop publishing the spec and the tools for it.

And I can assure you some of the biggest AI house do use it. They just have no reason to crow about it, because all that would do is potentially convince any of their competitors not using it to start.
 
No, that's BS. PTX is specified by Nvidia precisely because customers demand it. If Nvidia really didn't want people to use it, they would simply stop publishing the spec and the tools for it.

And I can assure you some of the biggest AI house do use it. They just have no reason to crow about it, because all that would do is potentially convince any of their competitors not using it to start.
PTX might be public, but that doesn’t mean it’s commonly used the way DeepSeek is using it. Most big AI labs stick to CUDA or higher-level tools like Triton because PTX is low-level, fragile, and not stable across driver versions. It’s not that you can’t use PTX, it’s that almost nobody does at scale because it’s a pain to maintain (unless you are heavily sanctioned)

What makes DeepSeek’s approach stand out is how aggressively they bypass CUDA and lean into PTX for performance. That’s not typical, and it’s exactly why OpenAI’s team called it out as novel. Just because the tools exist doesn’t mean others are using them the same way.
 
Last edited:
PTX might be public, but that doesn’t mean it’s commonly used the way DeepSeek is using it. Most big AI labs stick to CUDA or higher-level tools like Triton because PTX is low-level, fragile, and not stable across driver versions.
How do you know what big AI houses use, internally? You don't even know what Nvidia's own libraries use. I'm certain cudnn contains PTX or lower, for instance.

It’s not that you can’t use PTX, it’s that almost nobody does at scale because it’s a pain to maintain
This wasn't true before and it's not true now. What is true is that you must usually have a need that makes it worth the effort. When we're talking about literally $billions in hardware and a hyper-competitive business environment, such optimizations are easily justifiable, especially given they only need to work on a couple generations of server GPUs.
 
That's because they are utilizing distillation method, i.e. get tons of ChatGPT account, feed them questions and mimick the results, that no wonder will be a lot less GPU hours. But problem is, without the OG ChatGPT, it is useless..
I’m aware of their distillation method. They basically just write it to give the same reaction as ChatGPT/Llama etc through the minimal code base and training possible. I’ve got zero issues with that method UNLESS they’re using. whole chunks of open source code but there’s not been any evidence of that
 
PTX might be public, but that doesn’t mean it’s commonly used the way DeepSeek is using it. Most big AI labs stick to CUDA or higher-level tools like Triton because PTX is low-level, fragile, and not stable across driver versions. It’s not that you can’t use PTX, it’s that almost nobody does at scale because it’s a pain to maintain (unless you are heavily sanctioned)

What makes DeepSeek’s approach stand out is how aggressively they bypass CUDA and lean into PTX for performance. That’s not typical, and it’s exactly why OpenAI’s team called it out as novel. Just because the tools exist doesn’t mean others are using them the same way.
I can tell you for sure that XAI and OpenAI don’t stick purely with CUDA when they’re getting to the point of $100,000,000 dollars in GPU hours over a 6 -8 week period to for the final “main” training run on their next new model. At that point sticking to CUDA cost millions of extra dollars in a short period of time. Unfortunately for OpenAI, they’ve had so much talent drain recently from Zuckerberg and Musk and Google absolutely raiding their people that they might have to go back to pure CUDA lol.