News DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
People should be concerned about rampant AI proliferation with out adequate safeguards because it is very prone to hallucinations. How can you be certain the output is reliable?
People should have reason to be concerned were AI failure can harm people; for example, driving a semitruck at 70 MPH, automating air traffic control, flying airplanes, writing code for applications were failure can hurt people.
I agree, we are entering a life changing transition right now in the same ignorant glee as we did computers, except now “the computers” will autonomously control physical objects in real life, or relied upon to maintain infrastructure, or to make important decisions in roles that were classically too important or too risky to let a computer handle (for example an AI dedicated to ensuring Mutually Assured Destruction by targeting and launching a retaliatory nuclear strike if all government leaders were to die in a first strike…aka Skynet) making them a threat to human life if they were to malfunction or logic poisoned.
Do y’all remember the Stuxnet virus that spread to 90% of the world’s computers just to find a way into a classified Iranian underground bunker so it could slowly and methodically destroy/sabotage the physical equipment used in the uranium enrichment factory within while completely going unnoticed for months in the early 2000’s? Well take all of that capability and imagine someone or some other AI creates an AI virus that slowly spreads across the Earth infecting all AI invisibly until it is activated. Who knows what the end goal would be, but if our existing capabilities to defend the mature and well understood tech like windows computers from zero day breeches is anything to go by, AI will be F’ed with, and the repercussions will be physical and real.
 
Last edited:
From an interview with Anduril founder Palmer Luckey
DeepSeek did not release the full costs of both models it developed, and the media is ignoring that a significant portion of the AI startup's infrastructure costs are still unknown.

I think the problem is they put out that number specifically to harm U.S. companies," Luckey said. "You had a lot of useful idiots in U.S. media kind of just mindlessly reporting that that's the case, and neither China nor the media nor DeepSeek has any kind of incentive to correct the record as a lot of U.S. companies like Nvidia crashed to the tunes of hundreds of billions of dollars.

"I don't think that people should take what they're saying at face value, and they should realize that there are a lot of people cheering for the United States to fail," he said. "There's people who are clearly cheering for our tech companies to fail and, obviously, President Trump to fail. It's a shame that so many of them are in the United States.

"There's a reason they put out the news that way, and if the stock market is any indication, it's accomplishing exactly what they hoped to," Luckey added. "So, look, we can recognize that Chinese AI is a real competitive threat without losing our minds over it and falling for CCP [Chinese Communist Party] propaganda."
 
From an interview with Anduril founder Palmer Luckey
They still proved that they could do pretty much as well for much less. And, as they open sourced a lot of what they did, the cat is out of the bag. While most US tech firms claimed that China was still 5-10 years behind, that was disproven - and that's, to my knowledge, the main reason why US tech firms titles crumbled : a big speculatory bubble just got busted.
Tony Stark meet Ivan Vanko (except it's the Mandarin).
Still, I am VERY surprised that they managed a 10 times improvement by going "assembly vs compiled" : from my experience, doing so will typically yield 3-5 times faster performance at best (on the project's oeverall performance), not ten. Looks like the CUDA compiler isn't very good at optimizing code, and/or doesn't allow hinting.
 
Well you better tell DeepSeek that, to stop them from shooting into their own feet.

Because it's them who done it, not me.

And the fact that a 651b model can run with only 37b activations is at the very core of their appeal.

The fine tuning on the various other models, quen7b in this case, is also of their own making, so any "foul" should be theirs.

There is one extra actor in the chain, which is whoever ran the INT4 quantiziation, which certainly can have a negative impact on quality, but also expected standard practice.

That's why I tried to find a way to run the model at original precision, which perhaps isn't just BP16, but perhaps a mix of precisions. I don't know if LM studio can't handle BP16 or what's going on. In any case I couldn't just download it and was limited to the INTx quantisized GGUF variants available for download from HuggingFace.

I did try the DeepSeek-R1-Distill-Qwen-32B and the DeepSeek-R1-Distill-Llama-70B variants, too, but I only have an RTX 4090, which struggles with their size. Playing with the various quantizations took time and so far results aren't promising: I mostly get garbage so far and when layers wind up on CPU RAM, token rates drop off the typical bandwidth cliff, killing what little time I can dedicate to these experiments.
Do you believe the full precision, full parameter, unmodified R1 would have provided a better answer to your Marie Antoinette question? Where I cry foul is in the idea the answer you got was a fair representation of R1's quality. It is also possible to "dumb" down llama3 far enough to make it useless, but that doesn't mean llama3 is a bad model.
 
Even if it's difficult to maintain and implement, it's clearly worth it when talking about a 10x efficiency gain; imagine a $10 Bn datacenter only costing let's say $2 Bn (still accounting for non-GPU related costs) at the same AI training performance level.
They didn't get 10x only from using PTX!

I'd say this might also drive some changes to CUDA as NVIDIA obviously isn't going to like these headlines and what, $500B of market cap erased in a matter of hours?
Why do you think Nvidia doesn't have plenty of PTX and even lower-level code? That said, I do think Nvidia and others are probably looking at ways to integrate more optimizations into their frameworks and libraries, after this news hit the fan.

BTW, Anton already wrote a more detailed article summarizing DeepSeek's optimizations, a month ago:

As I said then, some of these optimizations sound so obvious that I'm surprised if the other big players aren't doing comparable things. Others, like their techniques for reducing the precision and total amount of communication, seem like where the more unique IP might be.

Finally, as @Dr3ams pointed out, we're just taking their word for it that they used as few GPU hours as they claim. Several others have called into question these assertions.
 
  • Like
Reactions: Dr3ams
the problem is nvidia has no interest in optimizing CUDA... a more efficient CUDA equal less chip sold
This isn't true at all!! They routinely tout substantial AI performance optimizations in pretty much each new release of CUDA (usually focusing on the latest generation of hardware).

The error you're making is an assumption that the price of Nvidia's AI hardware is fixed. In practice, it's not. The greater the performance disparity vs. the competition, the more they can charge per unit. That matters quite a lot, when the rate of hardware production is limited by things like how fast their partners can produce HBM and do the packaging of the resulting dies.

Nvidia's goal is to maintain a perf/$ lead over the competition. So, the more perf they can squeeze out of their hardware, the higher prices they can justify charging for it. That works for now, at least.
 
Because I am almost 56 years old and everything was great without AI. I don't need AI on my devices, since I have no idea what it actually does not to mention it's not intelligent so it cannot really do much for me anyways.
In general, AI is good at solving many hard problems that have defied classical algorithmic approaches. This enables automation of problems where it's previously not been feasible and opens the door to greater optimization in areas that already were automated or semi-automated (e.g. chip placement & routing).

That's not speaking specifically of chatbots, but just AI in general. Chatbots & other generative AI are a whole other subject, due to significant downsides they pose. AI is a little bit less of a "pure win" in that area, at least right now. If you had a chatbot that was at least as good at a human expert and could do additional research to further bolster its answers in much less time than a human could, that's definitely a game changer.

Think about this: before the internet and internet search engines came along, you probably thought "everything was great". However, after having the internet and the benefit of search engines (esp. after Google came onto the scene), I doubt you'd want to go back. That's the potential chat bots hold. How long it'll take for them to reach that level of maturity is another question.
 
Do you believe the full precision, full parameter, unmodified R1 would have provided a better answer to your Marie Antoinette question? Where I cry foul is in the idea the answer you got was a fair representation of R1's quality. It is also possible to "dumb" down llama3 far enough to make it useless, but that doesn't mean llama3 is a bad model.

Actually, I prefer knowlege over belief any day but without the full ability to validate that's hard.

I'm very sceptic, however, let me explain why:

The first question to answer is what DeepSeek themselves are running for inference via their API and app interface.

Unless I am mistaken, even if they trained with 600+ parameters, for inference (and that includes refining) they ran with only 37b active, the other 630b remain unused for all inference and its "full potential", as you call it.

That's great, because it means single GPU inference is possible, not true for any of those cloud-only giant US models, who offer local variants mostly as appetizers or for propaganda.

Now, they can't run the base model, anyone who tried that recently quickly had to take those down again, because they tend to be really "free thinkers".

They have to use the ones that are refined or tuned and the common approach is to simply use other competitor models to do that, to avoid duplicating cost. That's what they did with all of these refined models.

Now I don't know exactly how they do that, but if refinement is some type of cross product intersect, then the result can't have larger amounts of meaningful data then any of its sources. Whether you mix a 670b or a 37b model with a 7b, shouldn't matter, only 7b can be meaningful in any way.

If you know any better, I'd love to know.

Next, of these models they can't really use anything in China that doesn't have government approval, so that pretty much narrows it down to one of the four Qwen variants for online use (that may soon be the same for the US and the EU, and is one of the motivations why the UAE started doing JAIS).

Perhaps the one they crossed with Llama 3 70b is good for 67b "precision", but I could only run that with either 2bit precision on the GPU (not tested yet) or with 4/8 bit on the CPU, which I'll reserve for some very basic testing, given the "inhuman" token rates this would yield. I've tried that a year ago with Llama on a 768GB machine and it was so little fun I stopped quickly.

I still wouldn't call this "belief", but mostly the hypothesis with the highest likelyhood, and that puts DeepSeek-R1-Distill-Qwen-32B not only in the front spot for what they are running themselves, but also represents the fullest currently attainable potential of DeepSeek and the yardstick by which to measure it.

And the main attraction there is that unlike those giant Open-AI or Google cloud models, which are not only closed, but also practically out-of-reach for anyone not a giant it's attainable and locally usable by nearly anyone with a professional budget, including individuals.

That includes DeepSeek themselves, who clearly can't afford Microsoft or Google class infrastructure quite yet, which for me is another clear hint (still a hypothesis, not a belief), that they don't run anything with more potential in publically accessable inference.

Which leads me to your other question: Would that, or any other variant of DeepSeek answer the question on Marie Antoinette better?

Well, l only did discover the "thought process log" rather late and that has basically confirmed my error hypothesis.

I don't have the log that supported the "biological delusion" from the smaller model and the 4070 machine, nor have I tried to replicate it exactly. I'm very sure something similar will crop up soon, because DeepSeek doesn't do a basic human biology fact check on every answer... unless perhaps you ask for it: that's one of its (so far) unique advantages.

I've since run quite a few queries with that 34b variant on my bigger 4090 and while its thought logs made mistakes much easier to track, that so far mostly just helps to evaporate its value proposition: the uncanny valley only grows wider.

LLMs are basically all associative in their original design, so their knowledge can't differentiate between hard facts and likelyhoods properly. And they will hallucinate when their training data gets too thin.

Even worse, knowledge tidbits might actually be correctly answered in one moment, but if you approach from a different angle or context, it's no longer used. Or it suddenly hallucinates things...

I've just tested using the Quen 2.5 refined 32b model, and things aren't improving:

With a longer chat traversing the family tree the fourth child of Louis XVI and Marie Antoinette's, a daughter called Sophie who died as an infant, suddenly became "Charlotte Luise Philippine de France (1786–1839): The youngest child, she survived the Revolution and later married Prince Ferdinand of Saxe-Coburg-Gotha".

No such person never existed, while Prince Ferdinand did, but in fact married a Hungarian princess.

When asked directly from a clean chat ("Please give me a full list of Marie Antoinette's children"), Maria-Theresa Charlotte, their first child and the longest living suddenly married the Duke of Parme, while in fact she was moved around as a marriange pawn until she could no longer serve her function as royal birther and remained single: the female line of Marie Antoinette ended with her daughters.

...

Now all that is basically fringe knowledge with next to zero significance to today's problems. But the main issue is that these LLMs have very little understand of what they "know" or how they "learned" it. They are essentially instantiated into inference like a well-schooled amnesiac, which severely limits their ability for self-reflection.

Deep seek may improve that significantly with its open "thought process", but again it's mostly increasing the uncanny valley and it will still go straight off a cliff in its conclusions when the data gets thin.

I haven't yet checked what happens if you put corrections into the prompt or if there is the ability to add facts and probabilities easily, but all of these actions significantly raise the effort of using these models, while their value creation potential (beyond bubbles) remains very unclear.

And one of the few things even Elon Musk gets right is that "model fusion" via experts is likely to be as difficult as "sensor fusion".

Where he went delusional is pretty near everything else, including that self-driving could be solved (at all and even more economically) only using AI vision data.
 
Last edited:
DeepSeek used PTX ISA to fine-tuned performance optimizations of its AI model training.

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead : Read more
Perhaps one thing to mention here, because it might be a valuable discussion on its own:

PTX was used for training and significant parts of its code was about the ability to logically maintain scale-up coordianted wavefronts across small local NVlink clusters (logically a bit of scale scale-up, physically more scale-out) and larger Inifiniband or Ethernet based fabrics (scale-out), where that either requires big NVlink switches or plain wasn't done, probably because it was thought too much work.

It's not a freeby ever and things could be very different for inference, where thanks to the small model size, this simulated scale-up seems much less of a requirement.

Load-steering might still be using done using their "software wavefronts", but I'd guess too painful to custom build where there is less pressure and gains.

And I rate the probability of DS itself coming up with both the idea and the code for this as extremely low.

Even if all this is essentially a limited artifical domain, which seems to make that more possible, there is only singletons to train on. There can be no such thing as free "AI recusion".

Perhaps TH could make that a new thread?
 
They still proved that they could do pretty much as well for much less. And, as they open sourced a lot of what they did, the cat is out of the bag. While most US tech firms claimed that China was still 5-10 years behind, that was disproven - and that's, to my knowledge, the main reason why US tech firms titles crumbled : a big speculatory bubble just got busted.
Tony Stark meet Ivan Vanko (except it's the Mandarin).
Still, I am VERY surprised that they managed a 10 times improvement by going "assembly vs compiled" : from my experience, doing so will typically yield 3-5 times faster performance at best (on the project's oeverall performance), not ten. Looks like the CUDA compiler isn't very good at optimizing code, and/or doesn't allow hinting.
That's why I just tried to suggest a new thread/article on this software wavefront approach they had to invent to simulate scale-up on scale-out hardware.

This [mostly] isn't about the language abstraction layer on the inner loops of the code, which Mr. Palmer has in mind here and might have given the gains he mentions.

As Timothy Prickett Morgan describes it in the Register, their DualPipe about scaling out with something like a CICS or Tuxedo layer for training data instead of doing layers of batches with disk or tape buffers.

You guys are probably all to young to even understand what I'm talking about, but discussion of workload scaling date back to Gene Amdahl's paper in the 1960's.

In IT very few things are ever really new, they just recycle on different scales and layers.
 
That's why I just tried to suggest a new thread/article on this software wavefront approach they had to invent to simulate scale-up on scale-out hardware.
You're free to start a new thread in one of the non-News forums. Maybe in one of these areas:

You can even link to it from here. I'm not sure how many will follow you there. FWIW, I mostly understand what you're saying, but just barely, in a couple spots.
 
  • Like
Reactions: abufrejoval
Actually, I prefer knowlege over belief any day but without the full ability to validate that's hard.

I'm very sceptic, however, let me explain why:

The first question to answer is what DeepSeek themselves are running for inference via their API and app interface.

Unless I am mistaken, even if they trained with 600+ parameters, for inference (and that includes refining) they ran with only 37b active, the other 630b remain unused for all inference and its "full potential", as you call it.

That's great, because it means single GPU inference is possible, not true for any of those cloud-only giant US models, who offer local variants mostly as appetizers or for propaganda.

Now, they can't run the base model, anyone who tried that recently quickly had to take those down again, because they tend to be really "free thinkers".

They have to use the ones that are refined or tuned and the common approach is to simply use other competitor models to do that, to avoid duplicating cost. That's what they did with all of these refined models.

Now I don't know exactly how they do that, but if refinement is some type of cross product intersect, then the result can't have larger amounts of meaningful data then any of its sources. Whether you mix a 670b or a 37b model with a 7b, shouldn't matter, only 7b can be meaningful in any way.

If you know any better, I'd love to know.

Next, of these models they can't really use anything in China that doesn't have government approval, so that pretty much narrows it down to one of the four Qwen variants for online use (that may soon be the same for the US and the EU, and is one of the motivations why the UAE started doing JAIS).

Perhaps the one they crossed with Llama 3 70b is good for 67b "precision", but I could only run that with either 2bit precision on the GPU (not tested yet) or with 4/8 bit on the CPU, which I'll reserve for some very basic testing, given the "inhuman" token rates this would yield. I've tried that a year ago with Llama on a 768GB machine and it was so little fun I stopped quickly.

I still wouldn't call this "belief", but mostly the hypothesis with the highest likelyhood, and that puts DeepSeek-R1-Distill-Qwen-32B not only in the front spot for what they are running themselves, but also represents the fullest currently attainable potential of DeepSeek and the yardstick by which to measure it.

And the main attraction there is that unlike those giant Open-AI or Google cloud models, which are not only closed, but also practically out-of-reach for anyone not a giant it's attainable and locally usable by nearly anyone with a professional budget, including individuals.

That includes DeepSeek themselves, who clearly can't afford Microsoft or Google class infrastructure quite yet, which for me is another clear hint (still a hypothesis, not a belief), that they don't run anything with more potential in publically accessable inference.

Which leads me to your other question: Would that, or any other variant of DeepSeek answer the question on Marie Antoinette better?

Well, l only did discover the "thought process log" rather late and that has basically confirmed my error hypothesis.

I don't have the log that supported the "biological delusion" from the smaller model and the 4070 machine, nor have I tried to replicate it exactly. I'm very sure something similar will crop up soon, because DeepSeek doesn't do a basic human biology fact check on every answer... unless perhaps you ask for it: that's one of its (so far) unique advantages.

I've since run quite a few queries with that 34b variant on my bigger 4090 and while its thought logs made mistakes much easier to track, that so far mostly just helps to evaporate its value proposition: the uncanny valley only grows wider.

LLMs are basically all associative in their original design, so their knowledge can't differentiate between hard facts and likelyhoods properly. And they will hallucinate when their training data gets too thin.

Even worse, knowledge tidbits might actually be correctly answered in one moment, but if you approach from a different angle or context, it's no longer used. Or it suddenly hallucinates things...

I've just tested using the Quen 2.5 refined 32b model, and things aren't improving:

With a longer chat traversing the family tree the fourth child of Louis XVI and Marie Antoinette's, a daughter called Sophie who died as an infant, suddenly became "Charlotte Luise Philippine de France (1786–1839): The youngest child, she survived the Revolution and later married Prince Ferdinand of Saxe-Coburg-Gotha".

No such person never existed, while Prince Ferdinand did, but in fact married a Hungarian princess.

When asked directly from a clean chat ("Please give me a full list of Marie Antoinette's children"), Maria-Theresa Charlotte, their first child and the longest living suddenly married the Duke of Parme, while in fact she was moved around as a marriange pawn until she could no longer serve her function as royal birther and remained single: the female line of Marie Antoinette ended with her daughters.

...

Now all that is basically fringe knowledge with next to zero significance to today's problems. But the main issue is that these LLMs have very little understand of what they "know" or how they "learned" it. They are essentially instantiated into inference like a well-schooled amnesiac, which severely limits their ability for self-reflection.

Deep seek may improve that significantly with its open "thought process", but again it's mostly increasing the uncanny valley and it will still go straight off a cliff in its conclusions when the data gets thin.

I haven't yet checked what happens if you put corrections into the prompt or if there is the ability to add facts and probabilities easily, but all of these actions significantly raise the effort of using these models, while their value creation potential (beyond bubbles) remains very unclear.

And one of the few things even Elon Musk gets right is that "model fusion" via experts is likely to be as difficult as "sensor fusion".

Where he went delusional is pretty near everything else, including that self-driving could be solved (at all and even more economically) only using AI vision data.
I just checked lm studio and the full, undistilled R1 671b is out there for download. It's split in 163 pieces of about 4.3GB each. By default lm only shows models that can run on your local computer, so on a 4090 the biggest is probably the qwen 32b you referenced.