News DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

The Historical Fidelity · Jan 29, 2025

anoldnewb said:
People should be concerned about rampant AI proliferation with out adequate safeguards because it is very prone to hallucinations. How can you be certain the output is reliable?
People should have reason to be concerned were AI failure can harm people; for example, driving a semitruck at 70 MPH, automating air traffic control, flying airplanes, writing code for applications were failure can hurt people.

I agree, we are entering a life changing transition right now in the same ignorant glee as we did computers, except now “the computers” will autonomously control physical objects in real life, or relied upon to maintain infrastructure, or to make important decisions in roles that were classically too important or too risky to let a computer handle (for example an AI dedicated to ensuring Mutually Assured Destruction by targeting and launching a retaliatory nuclear strike if all government leaders were to die in a first strike…aka Skynet) making them a threat to human life if they were to malfunction or logic poisoned.
Do y’all remember the Stuxnet virus that spread to 90% of the world’s computers just to find a way into a classified Iranian underground bunker so it could slowly and methodically destroy/sabotage the physical equipment used in the uranium enrichment factory within while completely going unnoticed for months in the early 2000’s? Well take all of that capability and imagine someone or some other AI creates an AI virus that slowly spreads across the Earth infecting all AI invisibly until it is activated. Who knows what the end goal would be, but if our existing capabilities to defend the mature and well understood tech like windows computers from zero day breeches is anything to go by, AI will be F’ed with, and the repercussions will be physical and real.

Dr3ams · Jan 29, 2025

From an interview with Anduril founder Palmer Luckey

DeepSeek did not release the full costs of both models it developed, and the media is ignoring that a significant portion of the AI startup's infrastructure costs are still unknown.

I think the problem is they put out that number specifically to harm U.S. companies," Luckey said. "You had a lot of useful idiots in U.S. media kind of just mindlessly reporting that that's the case, and neither China nor the media nor DeepSeek has any kind of incentive to correct the record as a lot of U.S. companies like Nvidia crashed to the tunes of hundreds of billions of dollars.

"I don't think that people should take what they're saying at face value, and they should realize that there are a lot of people cheering for the United States to fail," he said. "There's people who are clearly cheering for our tech companies to fail and, obviously, President Trump to fail. It's a shame that so many of them are in the United States.

"There's a reason they put out the news that way, and if the stock market is any indication, it's accomplishing exactly what they hoped to," Luckey added. "So, look, we can recognize that Chinese AI is a real competitive threat without losing our minds over it and falling for CCP [Chinese Communist Party] propaganda."

mitch074 · Jan 29, 2025

Dr3ams said:
From an interview with Anduril founder Palmer Luckey

They still proved that they could do pretty much as well for much less. And, as they open sourced a lot of what they did, the cat is out of the bag. While most US tech firms claimed that China was still 5-10 years behind, that was disproven - and that's, to my knowledge, the main reason why US tech firms titles crumbled : a big speculatory bubble just got busted.
Tony Stark meet Ivan Vanko (except it's the Mandarin).
Still, I am VERY surprised that they managed a 10 times improvement by going "assembly vs compiled" : from my experience, doing so will typically yield 3-5 times faster performance at best (on the project's oeverall performance), not ten. Looks like the CUDA compiler isn't very good at optimizing code, and/or doesn't allow hinting.

jp7189 · Jan 29, 2025

abufrejoval said:
Well you better tell DeepSeek that, to stop them from shooting into their own feet.

Because it's them who done it, not me.

And the fact that a 651b model can run with only 37b activations is at the very core of their appeal.

The fine tuning on the various other models, quen7b in this case, is also of their own making, so any "foul" should be theirs.

There is one extra actor in the chain, which is whoever ran the INT4 quantiziation, which certainly can have a negative impact on quality, but also expected standard practice.

That's why I tried to find a way to run the model at original precision, which perhaps isn't just BP16, but perhaps a mix of precisions. I don't know if LM studio can't handle BP16 or what's going on. In any case I couldn't just download it and was limited to the INTx quantisized GGUF variants available for download from HuggingFace.

I did try the DeepSeek-R1-Distill-Qwen-32B and the DeepSeek-R1-Distill-Llama-70B variants, too, but I only have an RTX 4090, which struggles with their size. Playing with the various quantizations took time and so far results aren't promising: I mostly get garbage so far and when layers wind up on CPU RAM, token rates drop off the typical bandwidth cliff, killing what little time I can dedicate to these experiments.

Do you believe the full precision, full parameter, unmodified R1 would have provided a better answer to your Marie Antoinette question? Where I cry foul is in the idea the answer you got was a fair representation of R1's quality. It is also possible to "dumb" down llama3 far enough to make it useless, but that doesn't mean llama3 is a bad model.

bit_user · Jan 29, 2025

DS426 said:
Even if it's difficult to maintain and implement, it's clearly worth it when talking about a 10x efficiency gain; imagine a $10 Bn datacenter only costing let's say $2 Bn (still accounting for non-GPU related costs) at the same AI training performance level.

They didn't get 10x only from using PTX!

DS426 said:
I'd say this might also drive some changes to CUDA as NVIDIA obviously isn't going to like these headlines and what, $500B of market cap erased in a matter of hours?

Why do you think Nvidia doesn't have plenty of PTX and even lower-level code? That said, I do think Nvidia and others are probably looking at ways to integrate more optimizations into their frameworks and libraries, after this news hit the fan.

BTW, Anton already wrote a more detailed article summarizing DeepSeek's optimizations, a month ago:

https://www.tomshardware.com/tech-i...ptimizations-highlight-limits-of-us-sanctions

As I said then, some of these optimizations sound so obvious that I'm surprised if the other big players aren't doing comparable things. Others, like their techniques for reducing the precision and total amount of communication, seem like where the more unique IP might be.

Finally, as @Dr3ams pointed out, we're just taking their word for it that they used as few GPU hours as they claim. Several others have called into question these assertions.

dotpoz · Jan 29, 2025

the problem is nvidia has no interest in optimizing CUDA... a more efficient CUDA equal less chip sold

vanadiel007 · Jan 29, 2025

Gaidax said:
Why?

Because I am almost 56 years old and everything was great without AI. I don't need AI on my devices, since I have no idea what it actually does not to mention it's not intelligent so it cannot really do much for me anyways.

bit_user · Jan 29, 2025

dotpoz said:
the problem is nvidia has no interest in optimizing CUDA... a more efficient CUDA equal less chip sold

This isn't true at all!! They routinely tout substantial AI performance optimizations in pretty much each new release of CUDA (usually focusing on the latest generation of hardware).

The error you're making is an assumption that the price of Nvidia's AI hardware is fixed. In practice, it's not. The greater the performance disparity vs. the competition, the more they can charge per unit. That matters quite a lot, when the rate of hardware production is limited by things like how fast their partners can produce HBM and do the packaging of the resulting dies.

Nvidia's goal is to maintain a perf/$ lead over the competition. So, the more perf they can squeeze out of their hardware, the higher prices they can justify charging for it. That works for now, at least.

bit_user · Jan 29, 2025

vanadiel007 said:
Because I am almost 56 years old and everything was great without AI. I don't need AI on my devices, since I have no idea what it actually does not to mention it's not intelligent so it cannot really do much for me anyways.

In general, AI is good at solving many hard problems that have defied classical algorithmic approaches. This enables automation of problems where it's previously not been feasible and opens the door to greater optimization in areas that already were automated or semi-automated (e.g. chip placement & routing).

That's not speaking specifically of chatbots, but just AI in general. Chatbots & other generative AI are a whole other subject, due to significant downsides they pose. AI is a little bit less of a "pure win" in that area, at least right now. If you had a chatbot that was at least as good at a human expert and could do additional research to further bolster its answers in much less time than a human could, that's definitely a game changer.

Think about this: before the internet and internet search engines came along, you probably thought "everything was great". However, after having the internet and the benefit of search engines (esp. after Google came onto the scene), I doubt you'd want to go back. That's the potential chat bots hold. How long it'll take for them to reach that level of maturity is another question.

abufrejoval · Jan 30, 2025

jp7189 said:
Do you believe the full precision, full parameter, unmodified R1 would have provided a better answer to your Marie Antoinette question? Where I cry foul is in the idea the answer you got was a fair representation of R1's quality. It is also possible to "dumb" down llama3 far enough to make it useless, but that doesn't mean llama3 is a bad model.

Actually, I prefer knowlege over belief any day but without the full ability to validate that's hard.

I'm very sceptic, however, let me explain why:

The first question to answer is what DeepSeek themselves are running for inference via their API and app interface.

Unless I am mistaken, even if they trained with 600+ parameters, for inference (and that includes refining) they ran with only 37b active, the other 630b remain unused for all inference and its "full potential", as you call it.

That's great, because it means single GPU inference is possible, not true for any of those cloud-only giant US models, who offer local variants mostly as appetizers or for propaganda.

Now, they can't run the base model, anyone who tried that recently quickly had to take those down again, because they tend to be really "free thinkers".

They have to use the ones that are refined or tuned and the common approach is to simply use other competitor models to do that, to avoid duplicating cost. That's what they did with all of these refined models.

Now I don't know exactly how they do that, but if refinement is some type of ~~cross product~~ intersect, then the result can't have larger amounts of meaningful data then any of its sources. Whether you mix a 670b or a 37b model with a 7b, shouldn't matter, only 7b can be meaningful in any way.

If you know any better, I'd love to know.

Next, of these models they can't really use anything in China that doesn't have government approval, so that pretty much narrows it down to one of the four Qwen variants for online use (that may soon be the same for the US and the EU, and is one of the motivations why the UAE started doing JAIS).

Perhaps the one they crossed with Llama 3 70b is good for 67b "precision", but I could only run that with either 2bit precision on the GPU (not tested yet) or with 4/8 bit on the CPU, which I'll reserve for some very basic testing, given the "inhuman" token rates this would yield. I've tried that a year ago with Llama on a 768GB machine and it was so little fun I stopped quickly.

I still wouldn't call this "belief", but mostly the hypothesis with the highest likelyhood, and that puts DeepSeek-R1-Distill-Qwen-32B not only in the front spot for what they are running themselves, but also represents the fullest currently attainable potential of DeepSeek and the yardstick by which to measure it.

And the main attraction there is that unlike those giant Open-AI or Google cloud models, which are not only closed, but also practically out-of-reach for anyone not a giant it's attainable and locally usable by nearly anyone with a professional budget, including individuals.

That includes DeepSeek themselves, who clearly can't afford Microsoft or Google class infrastructure quite yet, which for me is another clear hint (still a hypothesis, not a belief), that they don't run anything with more potential in publically accessable inference.

Which leads me to your other question: Would that, or any other variant of DeepSeek answer the question on Marie Antoinette better?

Well, l only did discover the "thought process log" rather late and that has basically confirmed my error hypothesis.

I don't have the log that supported the "biological delusion" from the smaller model and the 4070 machine, nor have I tried to replicate it exactly. I'm very sure something similar will crop up soon, because DeepSeek doesn't do a basic human biology fact check on every answer... unless perhaps you ask for it: that's one of its (so far) unique advantages.

I've since run quite a few queries with that 34b variant on my bigger 4090 and while its thought logs made mistakes much easier to track, that so far mostly just helps to evaporate its value proposition: the uncanny valley only grows wider.

LLMs are basically all associative in their original design, so their knowledge can't differentiate between hard facts and likelyhoods properly. And they will hallucinate when their training data gets too thin.

Even worse, knowledge tidbits might actually be correctly answered in one moment, but if you approach from a different angle or context, it's no longer used. Or it suddenly hallucinates things...

I've just tested using the Quen 2.5 refined 32b model, and things aren't improving:

With a longer chat traversing the family tree the fourth child of Louis XVI and Marie Antoinette's, a daughter called Sophie who died as an infant, suddenly became "Charlotte Luise Philippine de France (1786–1839): The youngest child, she survived the Revolution and later married Prince Ferdinand of Saxe-Coburg-Gotha".

No such person never existed, while Prince Ferdinand did, but in fact married a Hungarian princess.

When asked directly from a clean chat ("Please give me a full list of Marie Antoinette's children"), Maria-Theresa Charlotte, their first child and the longest living suddenly married the Duke of Parme, while in fact she was moved around as a marriange pawn until she could no longer serve her function as royal birther and remained single: the female line of Marie Antoinette ended with her daughters.

...

Now all that is basically fringe knowledge with next to zero significance to today's problems. But the main issue is that these LLMs have very little understand of what they "know" or how they "learned" it. They are essentially instantiated into inference like a well-schooled amnesiac, which severely limits their ability for self-reflection.

Deep seek may improve that significantly with its open "thought process", but again it's mostly increasing the uncanny valley and it will still go straight off a cliff in its conclusions when the data gets thin.

I haven't yet checked what happens if you put corrections into the prompt or if there is the ability to add facts and probabilities easily, but all of these actions significantly raise the effort of using these models, while their value creation potential (beyond bubbles) remains very unclear.

And one of the few things even Elon Musk gets right is that "model fusion" via experts is likely to be as difficult as "sensor fusion".

Where he went delusional is pretty near everything else, including that self-driving could be solved (at all and even more economically) only using AI vision data.

abufrejoval · Jan 30, 2025

Admin said:
DeepSeek used PTX ISA to fine-tuned performance optimizations of its AI model training.

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead : Read more

Perhaps one thing to mention here, because it might be a valuable discussion on its own:

PTX was used for training and significant parts of its code was about the ability to logically maintain scale-up coordianted wavefronts across small local NVlink clusters (logically a bit of scale scale-up, physically more scale-out) and larger Inifiniband or Ethernet based fabrics (scale-out), where that either requires big NVlink switches or plain wasn't done, probably because it was thought too much work.

It's not a freeby ever and things could be very different for inference, where thanks to the small model size, this simulated scale-up seems much less of a requirement.

Load-steering might still be using done using their "software wavefronts", but I'd guess too painful to custom build where there is less pressure and gains.

And I rate the probability of DS itself coming up with both the idea and the code for this as extremely low.

Even if all this is essentially a limited artifical domain, which seems to make that more possible, there is only singletons to train on. There can be no such thing as free "AI recusion".

Perhaps TH could make that a new thread?

abufrejoval · Jan 30, 2025

mitch074 said:
They still proved that they could do pretty much as well for much less. And, as they open sourced a lot of what they did, the cat is out of the bag. While most US tech firms claimed that China was still 5-10 years behind, that was disproven - and that's, to my knowledge, the main reason why US tech firms titles crumbled : a big speculatory bubble just got busted.
Tony Stark meet Ivan Vanko (except it's the Mandarin).
Still, I am VERY surprised that they managed a 10 times improvement by going "assembly vs compiled" : from my experience, doing so will typically yield 3-5 times faster performance at best (on the project's oeverall performance), not ten. Looks like the CUDA compiler isn't very good at optimizing code, and/or doesn't allow hinting.

That's why I just tried to suggest a new thread/article on this software wavefront approach they had to invent to simulate scale-up on scale-out hardware.

This [mostly] isn't about the language abstraction layer on the inner loops of the code, which Mr. Palmer has in mind here and might have given the gains he mentions.

As Timothy Prickett Morgan describes it in the Register, their DualPipe about scaling out with something like a CICS or Tuxedo layer for training data instead of doing layers of batches with disk or tape buffers.

You guys are probably all to young to even understand what I'm talking about, but discussion of workload scaling date back to Gene Amdahl's paper in the 1960's.

In IT very few things are ever really new, they just recycle on different scales and layers.

bit_user · Jan 30, 2025

abufrejoval said:
That's why I just tried to suggest a new thread/article on this software wavefront approach they had to invent to simulate scale-up on scale-out hardware.

You're free to start a new thread in one of the non-News forums. Maybe in one of these areas:

You can even link to it from here. I'm not sure how many will follow you there. FWIW, I mostly understand what you're saying, but just barely, in a couple spots.

abufrejoval · Jan 30, 2025

bit_user said:
You're free to start a new thread in one of the non-News forums. Maybe in one of these areas:

I guess the bigger issue is that when it comes to infrastructure secret sauce on moneymakers, nobody is talking.

And nobody needs more wild guessing.

jp7189 · Jan 30, 2025

abufrejoval said:
Actually, I prefer knowlege over belief any day but without the full ability to validate that's hard.

I'm very sceptic, however, let me explain why:

The first question to answer is what DeepSeek themselves are running for inference via their API and app interface.

Unless I am mistaken, even if they trained with 600+ parameters, for inference (and that includes refining) they ran with only 37b active, the other 630b remain unused for all inference and its "full potential", as you call it.

That's great, because it means single GPU inference is possible, not true for any of those cloud-only giant US models, who offer local variants mostly as appetizers or for propaganda.

Now, they can't run the base model, anyone who tried that recently quickly had to take those down again, because they tend to be really "free thinkers".

They have to use the ones that are refined or tuned and the common approach is to simply use other competitor models to do that, to avoid duplicating cost. That's what they did with all of these refined models.

Now I don't know exactly how they do that, but if refinement is some type of ~~cross product~~ intersect, then the result can't have larger amounts of meaningful data then any of its sources. Whether you mix a 670b or a 37b model with a 7b, shouldn't matter, only 7b can be meaningful in any way.

If you know any better, I'd love to know.

Next, of these models they can't really use anything in China that doesn't have government approval, so that pretty much narrows it down to one of the four Qwen variants for online use (that may soon be the same for the US and the EU, and is one of the motivations why the UAE started doing JAIS).

Perhaps the one they crossed with Llama 3 70b is good for 67b "precision", but I could only run that with either 2bit precision on the GPU (not tested yet) or with 4/8 bit on the CPU, which I'll reserve for some very basic testing, given the "inhuman" token rates this would yield. I've tried that a year ago with Llama on a 768GB machine and it was so little fun I stopped quickly.

I still wouldn't call this "belief", but mostly the hypothesis with the highest likelyhood, and that puts DeepSeek-R1-Distill-Qwen-32B not only in the front spot for what they are running themselves, but also represents the fullest currently attainable potential of DeepSeek and the yardstick by which to measure it.

And the main attraction there is that unlike those giant Open-AI or Google cloud models, which are not only closed, but also practically out-of-reach for anyone not a giant it's attainable and locally usable by nearly anyone with a professional budget, including individuals.

That includes DeepSeek themselves, who clearly can't afford Microsoft or Google class infrastructure quite yet, which for me is another clear hint (still a hypothesis, not a belief), that they don't run anything with more potential in publically accessable inference.

Which leads me to your other question: Would that, or any other variant of DeepSeek answer the question on Marie Antoinette better?

Well, l only did discover the "thought process log" rather late and that has basically confirmed my error hypothesis.

I don't have the log that supported the "biological delusion" from the smaller model and the 4070 machine, nor have I tried to replicate it exactly. I'm very sure something similar will crop up soon, because DeepSeek doesn't do a basic human biology fact check on every answer... unless perhaps you ask for it: that's one of its (so far) unique advantages.

I've since run quite a few queries with that 34b variant on my bigger 4090 and while its thought logs made mistakes much easier to track, that so far mostly just helps to evaporate its value proposition: the uncanny valley only grows wider.

LLMs are basically all associative in their original design, so their knowledge can't differentiate between hard facts and likelyhoods properly. And they will hallucinate when their training data gets too thin.

Even worse, knowledge tidbits might actually be correctly answered in one moment, but if you approach from a different angle or context, it's no longer used. Or it suddenly hallucinates things...

I've just tested using the Quen 2.5 refined 32b model, and things aren't improving:

With a longer chat traversing the family tree the fourth child of Louis XVI and Marie Antoinette's, a daughter called Sophie who died as an infant, suddenly became "Charlotte Luise Philippine de France (1786–1839): The youngest child, she survived the Revolution and later married Prince Ferdinand of Saxe-Coburg-Gotha".

No such person never existed, while Prince Ferdinand did, but in fact married a Hungarian princess.

When asked directly from a clean chat ("Please give me a full list of Marie Antoinette's children"), Maria-Theresa Charlotte, their first child and the longest living suddenly married the Duke of Parme, while in fact she was moved around as a marriange pawn until she could no longer serve her function as royal birther and remained single: the female line of Marie Antoinette ended with her daughters.

...

Now all that is basically fringe knowledge with next to zero significance to today's problems. But the main issue is that these LLMs have very little understand of what they "know" or how they "learned" it. They are essentially instantiated into inference like a well-schooled amnesiac, which severely limits their ability for self-reflection.

Deep seek may improve that significantly with its open "thought process", but again it's mostly increasing the uncanny valley and it will still go straight off a cliff in its conclusions when the data gets thin.

I haven't yet checked what happens if you put corrections into the prompt or if there is the ability to add facts and probabilities easily, but all of these actions significantly raise the effort of using these models, while their value creation potential (beyond bubbles) remains very unclear.

And one of the few things even Elon Musk gets right is that "model fusion" via experts is likely to be as difficult as "sensor fusion".

Where he went delusional is pretty near everything else, including that self-driving could be solved (at all and even more economically) only using AI vision data.

I just checked lm studio and the full, undistilled R1 671b is out there for download. It's split in 163 pieces of about 4.3GB each. By default lm only shows models that can run on your local computer, so on a 4090 the biggest is probably the qwen 32b you referenced.

Rob1C · Jan 31, 2025

There certainly are some unexpected adoptions of this technology:
https://blogs.windows.com/windowsde...pilot-pcs-powered-by-windows-copilot-runtime/

Some accessible reviews:
https://www.nextplatform.com/2025/0...ai-model-on-a-lot-less-and-crippled-hardware/

https://stratechery.com/2025/deepseek-faq/

and a mind-numbing in-depth analysis:

https://planetbanatt.net/articles/deepseek.html

It's like 40 years ago, when instead of writing code in BASIC you could write in 8080 assembly language; it wasn't much easier but it certainly was fast and allowed some great optimizations.

That didn't stop people from wanting 80186, 80286 and 80386 processors, until the 80486 came along. Same thing here, glad it can run on my Android watch; but I'm still going to want a workstation.

Still, hand clap for China.

mitch074 · Jan 31, 2025

abufrejoval said:
That's why I just tried to suggest a new thread/article on this software wavefront approach they had to invent to simulate scale-up on scale-out hardware.

This [mostly] isn't about the language abstraction layer on the inner loops of the code, which Mr. Palmer has in mind here and might have given the gains he mentions.

As Timothy Prickett Morgan describes it in the Register, their DualPipe about scaling out with something like a CICS or Tuxedo layer for training data instead of doing layers of batches with disk or tape buffers.

You guys are probably all to young to even understand what I'm talking about, but discussion of workload scaling date back to Gene Amdahl's paper in the 1960's.

In IT very few things are ever really new, they just recycle on different scales and layers.

Indeed, I'm not THAT old - I only got in IT in early 1990's, but that's old enough to know that you usually get a jump in performance, people use and abuse it and forget about optimisation, then there's a usage type that requires loads more performance and some hacker (in the original sense of the word) in a garage, basement etc. somewhere goes back to machine language and causes a "revolution" - often by using an obscure technique found in a whitepaper or reinventing an obsolete technique... Not the first time, certainly not the last.

AutoKeybo · Jan 31, 2025

Thanks for this article Anton! AutoKeybo now runs DeepSeek.

jhoger · Jan 31, 2025

I am a time travelling programmer visiting you from the 1980s. We regularly coded all or portions of our programs in machine/assembly language to get maximum performance from the hardware. It was necessary. We didn't think we were geniuses for doing it, it is simply a way to get sufficient performance from the hardware you and your customers have (versus the hardware you want!). Higher level language code is easier to write and more maintainable. The sweet spot is usually a mix of both.

Asking the question whether to spend money on a bit better engineering versus buying more GPUs... it's the GPUs, the setup, the electricity, the time, etc. It's a lot of costs!

If you write key portions in lower level PTX or SASS versus CUDA, that's just what you need to do. The engineering isn't going to cost that much more, and if you think you can get cheap engineers to write this stuff, come on. It's not like any of this cutting edge work should be done by newbs.

And it all scales since next year your engineers will all be almost free AI virtual engineers anyway right? Spend the money now.

If the government was smart they'd tack on a hefty GPU tax to nudge your behavior in the right direction. But maybe it isn't needed. DeepSeek seems to have woken up everyone but Trump. I mean Musk.

bit_user · Jan 31, 2025

jhoger said:
I am a time travelling programmer visiting you from the 1980s. We regularly coded all or portions of our programs in machine/assembly language to get maximum performance from the hardware. It was necessary. We didn't think we were geniuses for doing it,

You weren't. The 8086 had only 8 general purpose registers (GPRs) and DRAM latency was quite low, compared with clock speeds and instruction latencies. Those CPUs were scalar and non-pipelined, as well. In modern CPUs, the number of GPRs has been expanded to 32 (except for x86, which still has 16), making register allocation much more of a PITA. And if modern CPUs get a L2 or L3 cache miss, the performance impact can be substantial.

I spent 3 years programming a pipelined, in-order RISC core with 32 registers and every time we had to modify a bit of optimized code, having to go back and redo the register allocation was a major headache, because sometimes you would then have to reschedule things, too. Because it was pipelined and in-order, there were also branch delay slots to manage. We had hard-realtime constraints, so if your code didn't run in a certain number of cycles, it was considered broken.

jhoger said:
Higher level language code is easier to write and more maintainable. The sweet spot is usually a mix of both.

The sweet spot is to know how to use your HLL effectively, and not to touch any assembly. Even when vectorizing code, the hard part is transforming your algorithms and data structures to be vector-friendly. Once you've done that, you can just use intrinsics, if the compiler doesn't autovectorize it effectively.

jhoger said:
Asking the question whether to spend money on a bit better engineering versus buying more GPUs... it's the GPUs, the setup, the electricity, the time, etc. It's a lot of costs!

You might remember the name Michael Abrash, an assembly language and code optimization guru from the 80's & 90's. His criteria for when to stop optimizing was simple: you keep optimizing your code until the customer is happy. Optimizing beyond that is a pointless waste of time that could be better spend adding features, fixing bugs, or working on other products.

There's a slightly different criteria for optimizing stuff where the customer is never going to be completely satisfied, which include server workloads and also stuff like non-realtime rendering.

jhoger said:
If you write key portions in lower level PTX or SASS versus CUDA, that's just what you need to do. The engineering isn't going to cost that much more,

Yes, but how much it costs depends very much on how much you need to write and how complex it is. The PTX for these SMs sounds similar to what I programmed, except that adding SIMD to it is a whole other degree of freedom you have to wrap your mind around. The reason you'd do it is to get the timing exactly right and optimally overlap your loads, stores, and Tensor core pipeline, while avoiding register bank conflicts. That's a lot to juggle, and there are probably other considerations I'm not even thinking of.

Then, when a new hardware generation launches, the timings, constraints, and capabilities have all changed, so you get to do it all over again.

jhoger said:
If the government was smart they'd tack on a hefty GPU tax to nudge your behavior in the right direction. But maybe it isn't needed. DeepSeek seems to have woken up everyone

Nvidia's server GPUs are already so expensive and have had such long wait times, there's been plenty of incentive to optimize the code for them. I'm skeptical a further tax would've changed anything, but it sure would've been unpopular.

As for DeepSeek maybe you haven't seen this?

https://www.tomshardware.com/tech-i...eepseek-smuggled-nvidia-ai-gpus-via-singapore

abufrejoval · Feb 1, 2025

bit_user said:
You weren't. The 8086 had only 8 general purpose registers (GPRs) and DRAM latency was quite low, compared with clock speeds and instruction latencies. Those CPUs were scalar and non-pipelined, as well. In modern CPUs, the number of GPRs has been expanded to 32 (except for x86, which still has 16), making register allocation much more of a PITA. And if modern CPUs get a L2 or L3 cache miss, the performance impact can be substantial.

I spent 3 years programming a pipelined, in-order RISC core with 32 registers and every time we had to modify a bit of optimized code, having to go back and redo the register allocation was a major headache, because sometimes you would then have to reschedule things, too. Because it was pipelined and in-order, there were also branch delay slots to manage. We had hard-realtime constraints, so if your code didn't run in a certain number of cycles, it was considered broken.

I seem to remember that the earliest variants of the MIPS architecture had pipelines that wouldn't just stall when their slots weren't properly filled with instruction, but lock up.

The design argument was that would have costs extra transistors and they wanted every transtor only going towards more compute. So they charged the compiler with ensuring proper pipeline feeding (adding explicit NOPs when really nothing useful could be done) and also argued that due to the orthogonal architecture and RISC design, as well as the smart opimising compilers, assembly language programmers couldn't really do better anyway.

Unfortunately, the compilers wouldn't always get it right, either (perhaps they hallucinated?), resulting in locked up systems in what was considered a mid-range Unix environment.

They very quickly changed their hardware design and added those extra transistors to manage the pipeline.

It's nearly impossible to imagine that today, when transistors are so abundand.

DSPs have always had to put performance per transistor over easy programming, but then one might argue that they were more like reconfigureable ASICs, or like FPGAs. GPUs really have more of a DSP progeny and general puppose has always been about very costly compromises.

bit_user · Feb 1, 2025

abufrejoval said:
I seem to remember that the earliest variants of the MIPS architecture had pipelines that wouldn't just stall when their slots weren't properly filled with instruction, but lock up.

TBH, I don't recall what the cores I programmed would do if you did something weird, like putting a branch instruction in the delay slot of another branch instruction.

Other than breaking a rule like that, it didn't really matter whether you "filled" the slots or not - the CPU just executed whatever comes next in memory as if you intended them to go in the delay slots, regardless of whether you really did. So, if a conditional branch was followed by instructions you didn't intend to be executed when the branch was taken - too bad, it was your bug.

abufrejoval said:
It's nearly impossible to imagine that today, when transistors are so abundand.

Yeah, it amazes me how much complexity can be baked into hardwired circuits, these days.

abufrejoval said:
DSPs have always had to put performance per transistor over easy programming, but then one might argue that they were more like reconfigureable ASICs, or like FPGAs.

The ones I've programmed were all VLIW and thankfully I never had to program them in assembly (although I would inspect the compiler output, for key loops).

I'm not sure where you get the FPGA analogy. IMO, those are off in another universe. Maybe, if you're programming both via high-level tools, they might seem similar.

abufrejoval said:
GPUs really have more of a DSP progeny and general puppose has always been about very costly compromises.

Yeah, both are very specialized and have weird properties you don't find in general-purpose CPU cores (e.g. direct-mapped local memory, very weak memory models).

It's pretty wild that you can now target standards-compliant C++ for modern GPUs. When they first started supporting programmable shaders, the shader programs were highly-constrained in size and things like branch depth. For a while, it seemed like GPUs would increasingly favor programmable shaders over fixed-function units, but that basically hit a wall and has even started to roll back, with things like RT cores.

Rob1C · Feb 3, 2025

bit_user said:
You weren't. The 8086 had only 8 general purpose registers (GPRs) and DRAM latency was quite low, compared with clock speeds and instruction latencies. Those CPUs were scalar and non-pipelined, as well. In modern CPUs, the number of GPRs has been expanded to 32 (except for x86, which still has 16), making register allocation much more of a PITA. And if modern CPUs get a L2 or L3 cache miss, the performance impact can be substantial.

I spent 3 years programming a pipelined, in-order RISC core with 32 registers and every time we had to modify a bit of optimized code, having to go back and redo the register allocation was a major headache, because sometimes you would then have to reschedule things, too. Because it was pipelined and in-order, there were also branch delay slots to manage. We had hard-realtime constraints, so if your code didn't run in a certain number of cycles, it was considered broken.

Back in the 8080 days we cheated, which wasn't cheating; since you could do it.

Simply use defines for registers, and then they can be renamed easily.

Run out of registers, simply poke the value into the position in memory holding an immediate load using self-modifying code.

https://dercuano.github.io/notes/direct-addressing.html#addtoc_6
https://en.wikipedia.org/wiki/Sethi–Ullman_algorithm

If we were writing to screen memory we had to do all the calculations during the frame
and write updates to the screen during the vertical blanking (to avoid flickering).

https://wiki.osdev.org/Video_Signals_And_Timing#Display_Composition

These techniques became much easier when we had the luxury of a 360K floppy drive.

Prior to that it was cassette tape, and before that toggle switches; that was annoying when it hung and you had to reload the program, which taught us to make no mistakes.

bit_user · Feb 3, 2025

Rob1C said:
Back in the 8080 days we cheated, which wasn't cheating; since you could do it.

Simply use defines for registers, and then they can be renamed easily.

Same. We just used the C preprocessor before running the assembler. There are better preprocessors, like M4, but I guess they went with what more people were familiar with.

What was the real problem is when you changed one small part of a function that needed like 1 extra register. To free up another register in that spot, you might have to redo your register allocation for practically the whole function, sometimes even having to reorder instructions (we couldn't afford spilling any registers to SRAM, in the hot path code).

Rob1C said:
Run out of registers, simply poke the value into the position in memory holding an immediate load using self-modifying code.

LOL, yeah self-modifying code tends to work poorly with CPUs having a dedicated instruction cache. I seem to recall one of the first Intel CPUs to feature separate instruction and data caches broke a lot of self-modifying code. These days, you typically cannot even perform writes to the memory segment containing the executable code.

Rob1C said:
If we were writing to screen memory we had to do all the calculations during the frame
and write updates to the screen during the vertical blanking (to avoid flickering).

Were you writing like Atari or arcade games?

https://www.amazon.com/Racing-Beam-Computer-Platform-Studies/dp/026201257X

Rob1C said:
Prior to that it was cassette tape, and before that toggle switches; that was annoying when it hung and you had to reload the program, which taught us to make no mistakes.

I once worked with a old ex-DEC guy, who said they'd use toggle switches to enter some debug or diagnostic program on new machines. He said he did it so many times that he had the sequence memorized, and it was like a couple hundred instructions or so!

I'm very grateful to have started my involvement with computing in the hard drive era, although I've spent plenty of time in my life waiting for floppy drives.

Rob1C · Feb 4, 2025

bit_user said:
Same. We just used the C preprocessor before running the assembler. There are better preprocessors, like M4, but I guess they went with what more people were familiar with.

What was the real problem is when you changed one small part of a function that needed like 1 extra register. To free up another register in that spot, you might have to redo your register allocation for practically the whole function, sometimes even having to reorder instructions (we couldn't afford spilling any registers to SRAM, in the hot path code).

LOL, yeah self-modifying code tends to work poorly with CPUs having a dedicated instruction cache. I seem to recall one of the first Intel CPUs to feature separate instruction and data caches broke a lot of self-modifying code. These days, you typically cannot even perform writes to the memory segment containing the executable code.

Were you writing like Atari or arcade games?

https://www.amazon.com/Racing-Beam-Computer-Platform-Studies/dp/026201257X

I once worked with a old ex-DEC guy, who said they'd use toggle switches to enter some debug or diagnostic program on new machines. He said he did it so many times that he had the sequence memorized, and it was like a couple hundred instructions or so!

I'm very grateful to have started my involvement with computing in the hard drive era, although I've spent plenty of time in my life waiting for floppy drives.

It was a few years before console games (like Pong) and the Home Computer , and several years before "C" and the 5 MB HDD (which cost over U$5500 in today's money).

News DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

Reputable

Reputable

Splendid

Distinguished

Titan

Distinguished

Distinguished

Titan

Titan

Honorable

Honorable

Honorable

Titan

Honorable

Distinguished

Distinguished

Splendid

Titan

Honorable

Titan

Distinguished

Titan

Distinguished

Share this page