News Meta to build 2GW data center with over 1.3 million Nvidia AI GPUs — invest $65B in AI in 2025

DeepSeek V3 and R1 trained on $5 million budget and 2043 H800s, with performance exceeding GPT 4o and o1, already rendered the massive NPU farm clusters obsolete. Same with Sky-T1 model out of Berkeley trained an 01 peer for $450.
 
Performance according to what metric? I don't see any of those on the OpenLLM Leaderboard:
Huggingface only has open sourced models on their leaderboard, so no GPT or Claude.

Here is an ELO-style leaderboard with OpenAI, Anthropic, Google, DeepSeek models included, where users are blind tested AI chat responses.

https://lmarena.ai/?leaderboard
 
Last edited:
Huggingface only has open sourced models on their leaderboard, so no GPT or Claude.
It's not just GPT and Claude missing. I didn't see any of the models you listed!

Here is an ELO-style leaderboard with OpenAI, Anthropic, Google, DeepSeek models included, where users are blind tested AI chat responses.

https://lmarena.ai/?leaderboard
That's just based on votes by the public. They're not ranking it on skills tests, like Huggingface's scores do.

What I'm getting at is that I can believe you can train a specialized model with a lot fewer resources and end up with something competitive. Training one that's as versatile as leading contenders on Huggingface's LLM leaderboard is a lot harder and probably takes disproportionately more resources.
 
  • Like
Reactions: jp7189
I never thought I'd be a fan of Meta/Facebook, but i am a HUGE fan of llama specifically because it's open source/open weight and allows people to add their own fine tune training on top of it.
 
  • Like
Reactions: bit_user
That's just based on votes by the public.
These are ranked by blinded votes on best anonymous AI responses, not based on some popularity contest.
They're not ranking it on skills tests, like Huggingface's scores do.
For skills based test, those are available on DeepSeek 's publication.
benchmark.png


Also, for DeepSeek R-1's publication
benchmark.jpg


The MMLU Pro and MMLU scores for DeepSeek/GPT/Claude/Google really blows everything out of the water compared to any model on HuggingFace (which peaks at 70 points for the top model)
What I'm getting at is that I can believe you can train a specialized model with a lot fewer resources and end up with something competitive. Training one that's as versatile as leading contenders on Huggingface's LLM leaderboard is a lot harder and probably takes disproportionately more resources.
Except as shown by the MMLU Pro scores, DeepSeek/GPT/Claude/Google blows every open source model on the HuggingFace's LLM board out of the water, and they aren't even listed on HuggingFace's leader board. It's likely because it's voluntary submission.
 
Last edited:
These are ranked by blinded votes on best anonymous AI responses, not based on some popularity contest.
But the questions are just a pool submitted by the public and not sorted by skill. So, you can still have one specialized model that's really good at certain prompts winning out over a more general and versatile model.

For skills based test, those are available on DeepSeek 's publication.
benchmark.png
Why didn't they submit it to Hugging Face's LLM leaderboard, at least in addition to their comparison against GPT and Claude?
 
  • Like
Reactions: phead128
Why didn't they submit it to Hugging Face's LLM leaderboard, at least in addition to their comparison against GPT and Claude?
Maybe because HuggingFace's LLM leaderboard is not good?

HuggingFace’s leaderboards show how truly blind they are because they actively hurting the open source movement by tricking it into creating a bunch of models that are useless for real usage.

That explains why no respectable LLM is listed on there, because they are useless (except maybe MMLU Pro).
 
Why didn't they submit it to Hugging Face's LLM leaderboard, at least in addition to their comparison against GPT and Claude?

probably because huggingface havn't completed the test yet? I just checked the leaderboard, DeepSeek-R1 has 3 models finished the test and has score there:
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

anyway, there are other tests done by 3rd party, which aim to provide more challenging tests (since the current tests became too simple/too familiar for new models), like the "Humanity's Last Exam", which actually shows DeepSeek can solve hard questions better than GPT-O1 (or any other model)

I personally also compared those 2 models, uploaded the code framework, and asked them to give opinions and analysis,. DeepSeek-R1 does provide WAY BETTER insights than GPT-O1.