News DeepSeek research suggests Huawei's Ascend 910C delivers 60% Nvidia H100 inference performance

I just want to point out that the H100 burns a lot of die area on stuff that's not relevant to inference. For instance, you don't need quite so much inter-GPU bandwidth and connectivity, when just doing inference. Also, H100 has quite a bit of FP64 horsepower, for HPC. If you're building a pure AI processor, you wouldn't need that stuff. I actually expected Nvidia to have separated off their AI and HPC products by now. Maybe in the next generation, they will finally do this.

Finally, I wonder how many people are even using H100 for inference. It'd be cheaper to distribute your model over a set of L40 cards.
 
"Rumored To Come Pretty Close With NVIDIA’s H100"... 60% is pretty close? I am not sure when being one->two generations behind, performance wise, was considered close but I for one find that math suspect as also assumes the purported performance is as described. My guess is it closer to 50% or less in non-cherry picked workloads. This "news" seems to be anything but... Until it is tested rigorously by non-Chinese reviewers I'll take this news with a truck load of salt.
 
Last edited:
I just want to point out that the H100 burns a lot of die area on stuff that's not relevant to inference. For instance, you don't need quite so much inter-GPU bandwidth and connectivity, when just doing inference. Also, H100 has quite a bit of FP64 horsepower, for HPC. If you're building a pure AI processor, you wouldn't need that stuff. I actually expected Nvidia to have separated off their AI and HPC products by now. Maybe in the next generation, they will finally do this.

Finally, I wonder how many people are even using H100 for inference. It'd be cheaper to distribute your model over a set of L40 cards.
The Ascend 910 can do more than inference too. It was originally marketed as a high power training solution. It still has s somewhat respectable amount of fp32 compute too. Its not solely a pure tensor/matrix math accelerator like a Google Coral or whatever it was called.
 
But not HPC, correct? You mostly need fp64, for that.
Correct. It’s dogshit in fp64 from what I’ve seen as it has to use two fp32 data paths to run fp64 calculations. That applies to pretty much everything outside of AMD Instinct though. AMD is literally the only one still serving the fp64 market. A MI300x has more than double the fp64 throughput of anything Nvidia offers since they’ve gone all in on AI and LESS precise data formats.
 
well...if anyone truly wants to find out the performance of Ascend...I guess they can just rent an Ascend cluster on Huawei Cloud and play with it:
https://www.huaweicloud.com/intl/en-us/product/modelarts.html

Ascend cards can be used for model training, Huawei has a coop case report (with another Chinese AI company) on their official website, and introduced their training solutions:
https://e.huawei.com/cn/case-studies/solutions/storage/iflyte

Huawei also has its own models (like the one that serves in Huawei's HarmonyOS as AI assistant), according to Huawei Developer Conference announcements, all the models were trained on their Ascend clusters.

furthermore, all major Chinese companies with AI requirements (such as ByteDance, Alibaba, Tencent, etc.), have placed mass Ascend card orders. I think the recent numbers I read are that Huawei is estimated to ship 1 million 910C in 2025.

even though one could argue that Nvidia can sell 2x more H100 than 910C, and H100 is roughly 2x more powerful than 910C, let alone more powerful B200...but the point is that...the strategy of "preventing China from getting computing power needed for AI development through embargo & sanctions", doesn't seem working.
 
Last edited:
what I’ve seen as it has to use two fp32 data paths to run fp64 calculations. That applies to pretty much everything outside of AMD Instinct though. AMD is literally the only one still serving the fp64 market.
Yeah, AMD's MI250X definitely leapfrogged Nvidia on fp64, but then the H100 nearly caught up. MI300X went double-or-nothing, at which point you're right that B100 said "no thank you", barely upping its fp64 and instead opting to devote most of its additional resources to AI.

Weirdly, this has left AMD with the strongest offering in HPC - a market where Nvidia first established CUDA and the bona fides of its GPU Compute solution.
 
  • Like
Reactions: Al Zaidi
even though one could argue that ... H100 is roughly 2x more powerful than 910C,
The article just said inference, and on one model (which also happened to run disproportionately well on AMD's RX 7900 XTX, which has only 37% of the AI TOPS as RTX 4090 - suggesting DeepSeek is more bandwidth-hungry than compute-intensive). Where do you get such a figure for training?
 
So 910C is 60% performance of H100, what about the per unit price/power/workload, at what point does it make more sense to use the multiple 910c for a single H100 locally?
Besides the fact that the H100 is embargoed in China, the 910C should be in a much lower price class. If that claim of 60% performance is true it would be a no brainer.

That is a colossal "if", though. If specs are to be believed, the H100 is about 4 times as capable in INT8 and FP16. Thus we'd have to believe they managed to extract over twice the efficiency using CUNN than that of H100 using CUDA. The claim also ignores the possibility that whatever efficiency gains they managed on the Huawei card could likely be achieved in the Nvidia one in a similar fashion.

Seems to me like marketing done over some best case scenario.
 
  • Like
Reactions: bit_user
EUV-made chips are needed in portable battery-efficient applications like mobile phones, but you can certainly make due with older more power hungry chips like 7nm yet remain competitive in AI.

40% difference in light of US sanctions is pretty damn impressive.
 
Making such PPT will earn you some big money from the Chinese government if you have some relationship with the governor.
this, and first you just have to make the claim, when the money is in the bank they will be too ashamed to announce even if they got found out being abyssmal in real performance in the name of "face saving" but you will see that they ban foreign usage as to "protect national secret"
 
you can certainly make due with older more power hungry chips like 7nm yet remain competitive in AI.
Not really. AI performance tends to be highly-proportional to transistor count. You can't really compensate for that by going nuts with clock speed. Scaling out the number of chips also hits limits, beyond which it becomes very costly and inefficient to continue.

If the process node really didn't matter, then people wouldn't be trying so hard to use the latest and greatest node for AI accelerators. They'd just take the cheap & easy route of using N7 or similar.

40% difference in light of US sanctions is pretty damn impressive.
We don't know how it translates to other models. If inferencing is mostly limited by memory bandwidth, then an architecture with much less compute horsepower wouldn't be so disadvantaged. Hence, why AMD's RX 7900 XTX managed to hang with the RTX 4090 on some of DeepSeek's models, in spite of having only 37% as many raw TOPS. Both have about the same memory bandwidth, so maybe that's what it really needs?
 
  • Like
Reactions: phead128
Not really. AI performance tends to be highly-proportional to transistor count. You can't really compensate for that by going nuts with clock speed. Scaling out the number of chips also hits limits, beyond which it becomes very costly and inefficient to continue.

I think that depends on the price of Ascend 910C per unit vs. H100's per unit.

.H100's is sold at $25,000 - $30,000 per unit, while it costs Nvidia only $3,320 to manufacture. (>1000% profitability).

If Ascend 910C is sold even a fraction of that price, then the price-per-performance would be quite competitive with H100s, excluding the larger datacenter footprint, additional cooling solutions, and electricity use.