Discussion CPU instruction set explanation thread

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Could you use a translation layer, or would it just be simpler to rewrite it from the beginning?

It's a radically different language and while theoretically a translation layer could be used, it would be very inefficient.

So quick class on scalar vs vector computing.

Scalar instructions are instructions done on single values at a time. They consist of things like "load X data into memory", "store Y data into memory", "compare X value against Y value", "if previously comparison was true, go to this memory location", "add value X to memory address Y", "subtract value X from memory address Y", and so forth. It's all logical operations that define almost all computing, very much a single continuous line of logic. You want to add a value to 4000 elements, you do it one at a time in a loop.

Vector instructions are instructions on multiple values at a time. They consist of things like "add X value to A,B,C,D registers", "multiple A and B, then add result to memory location W,X,Y,Z", along with a whole host of floating point and calculous operations. This is useful for doing math on large datasets.

CPU's are designed to be extremely good at scalar instructions. We have secondary extensions to assist with vector instructions when necessary but their bread and butter is scalar operations which represent over 90% of all computing workloads. GPU's are designed to be extremely good at vector instructions, and while they have scalar instructions they are kinda slow and only there to assist with figuring out which datasets to execute on. So view CPU's are having 6~14 core that are good at single lots of scalar instructions. GPU's as having a thousand cores good at doing massive amounts of vector instructions.

With this in mind, code written for CPU's does not work very well on GPU's. Instead you need entirely separate code written specifically to take advantage of the GPU's vector nature.
 
  • Like
Reactions: Order 66
It's theoretically possible (shader model 3 is technically Turing complete) but whichever way you did it (CPU code on GPU or GPU code on CPU) it would be painfully slow, not just because of the translation, but because of the fundamentally different approaches each takes - CPUs are good at doing lots of different things one after another, GPUs are good at doing a lot of the same thing all at once, so regardless of whether you can translate the instructions, the process the CPU instructions describe will be ill suited to running on a GPU and vice versa.
 
  • Like
Reactions: Order 66
It's a radically different language and while theoretically a translation layer could be used, it would be very inefficient.

So quick class on scalar vs vector computing.

Scalar instructions are instructions done on single values at a time. They consist of things like "load X data into memory", "store Y data into memory", "compare X value against Y value", "if previously comparison was true, go to this memory location", "add value X to memory address Y", "subtract value X from memory address Y", and so forth. It's all logical operations that define almost all computing, very much a single continuous line of logic. You want to add a value to 4000 elements, you do it one at a time in a loop.

Vector instructions are instructions on multiple values at a time. They consist of things like "add X value to A,B,C,D registers", "multiple A and B, then add result to memory location W,X,Y,Z", along with a whole host of floating point and calculous operations. This is useful for doing math on large datasets.

CPU's are designed to be extremely good at scalar instructions. We have secondary extensions to assist with vector instructions when necessary but their bread and butter is scalar operations which represent over 90% of all computing workloads. GPU's are designed to be extremely good at vector instructions, and while they have scalar instructions they are kinda slow and only there to assist with figuring out which datasets to execute on. So view CPU's are having 6~14 core that are good at single lots of scalar instructions. GPU's as having a thousand cores good at doing massive amounts of vector instructions.

With this in mind, code written for CPU's does not work very well on GPU's. Instead you need entirely separate code written specifically to take advantage of the GPU's vector nature.
Ok. I understand why something meant to run on a CPU can't really run on a GPU and vice versa, but my line of thinking goes like this, if something is meant to run on the CPU, but it would be more efficient, (and not designed) to run on the GPU, why can't the GPU "just figure it out" so to speak? For example, if I saw a very long but otherwise simple math problem (10+5+10+5) the CPU would do it like 10+5=15+10=25+5=30, whereas the GPU might do it like 10+5=15+15=30 (I think)
 
Ok, here's an example. Let's say you wanted to add 3 to a list of numbers (in reality you'd be doing something more complicated, but the general principle is the same).

If you were writing it for a CPU it would be something like this:
  1. Are there any numbers left? If not, stop.
  2. Get the next number.
  3. Add 3 to it.
  4. Store the result somewhere.
  5. Go back to 1.
If you were writing it for a GPU it would be more like this:
  1. Get the whole list of numbers.
  2. Add three to them.
  3. Give back the new list.
If you just translated the instructions from the CPU to the GPU a) you're not taking advantage of the fact it could do the whole thing in one go, and b) sending data to and from the GPU is waaaay slower than sending data to and from the CPU so doing it over and over is extremely inefficient.
 
Ok, here's an example. Let's say you wanted to add 3 to a list of numbers (in reality you'd be doing something more complicated, but the general principle is the same).

If you were writing it for a CPU it would be something like this:
  1. Are there any numbers left? If not, stop.
  2. Get the next number.
  3. Add 3 to it.
  4. Store the result somewhere.
  5. Go back to 1.
If you were writing it for a GPU it would be more like this:
  1. Get the whole list of numbers.
  2. Add three to them.
  3. Give back the new list.
If you just translated the instructions from the CPU to the GPU a) you're not taking advantage of the fact it could do the whole thing in one go, and b) sending data to and from the GPU is waaaay slower than sending data to and from the CPU.
that makes sense. Is there no way that a program could automatically adjust it's own code to be optimized to run on either a CPU or GPU? I know that there is not with our current technology, but is it theoretically possible?
 
  • Like
Reactions: jnjnilson6
that makes sense. Is there no way that a program could automatically adjust it's own code to be optimized to run on either a CPU or GPU? I know that there is not with our current technology, but is it theoretically possible?
I mean sure, in theory , but bear in mind that was a ridiculously oversimplified example - you'd potentially be looking at thousands of lines of code.
 
  • Like
Reactions: Order 66
I mean sure, in theory , but bear in mind that was a ridiculously oversimplified example - you'd potentially be looking at thousands of lines of code.
Yeah, that’s fair. I don’t know how you would go about doing it, since there would be so much work involved, and what would be the point, why translate something when you can just design the program to not need it in the first place.
 
that makes sense. Is there no way that a program could automatically adjust it's own code to be optimized to run on either a CPU or GPU? I know that there is not with our current technology, but is it theoretically possible?

What you are talking about is called Dynamic Recompilation. It can be very powerful going from one scaler set to another or one vector set to another, but trying to cross them is just asking for pain. This is because both are such radically different approaches to a problem that the entire design and logic flow is tailored to the strength and weakness of that approach.

So no, you can't really run Crysis on a GPU.
 
  • Like
Reactions: Order 66
So no, you can't really run Crysis on a GPU.
You mean on the CPU? or what do you mean by run on a GPU? how exactly do GPUs that don't support RT run it? I also seem to remember hearing that Roblox does rendering on the CPU. How does that work without integrated graphics. I know that you could code it so that it runs, but I don't understand how it doesn't run extremely poorly.
 
You mean on the CPU? or what do you mean by run on a GPU? how exactly do GPUs that don't support RT run it? I also seem to remember hearing that Roblox does rendering on the CPU. How does that work without integrated graphics. I know that you could code it so that it runs, but I don't understand how it doesn't run extremely poorly.
In the olden days all games did the rendering on the CPU - there were no other options, and even when 3D accelerators started to appear CPU rendering was still an option, and 3D accelerators only handled the final "draw triangles" stage anyway and the CPU still did everything else. (I think Wipeout was the first mainstream 3D PC game released that required 3D acceleration, but it's hard to pin down exactly). A game like Quake managed to run perfectly well on a CPU that's several thousand times slower than anything from the last few years!
 
In the olden days all games did the rendering on the CPU - there were no other options, and even when 3D accelerators started to appear CPU rendering was still an option, and 3D accelerators only handled the final "draw triangles" stage anyway and the CPU still did everything else. (I think Wipeout was the first mainstream 3D PC game released that required 3D acceleration, but it's hard to pin down exactly). A game like Quake managed to run perfectly well on a CPU that's several thousand times slower than anything from the last few years!
Yes, but how? I don't understand how a 3d scene can be rendered on a CPU without integrated graphics.
 
Right, but as far as instruction sets go, the languages of CPUs and GPUs is very different (as others have mentioned here) so, my question is, how exactly does the CPU even know what to do with a 3d workload? I know there are translation layers, but they tend to be horribly inefficient. side note: I hate vague answers, they drive me nuts.
 
Right, but as far as instruction sets go, the languages of CPUs and GPUs is very different (as others have mentioned here) so, my question is, how exactly does the CPU even know what to do with a 3d workload? I know there are translation layers, but they tend to be horribly inefficient. side note: I hate vague answers, they drive me nuts.
A 3d workload is not a 3d workload for the CPU (or the GPU) , it's just floating point calculations. Floating point are basically numbers that are not integer numbers but have decimal points. The GPU has just more units that can handle floating point and their instruction set is more tailor made towards that.

Doing it on the CPU would just use different instructions and would take a lot longer because CPUs are not custom made for this job.
 
A 3d workload is not a 3d workload for the CPU (or the GPU) , it's just floating point calculations. Floating point are basically numbers that are not integer numbers but have decimal points. The GPU has just more units that can handle floating point and their instruction set is more tailor made towards that.

Doing it on the CPU would just use different instructions and would take a lot longer because CPUs are not custom made for this job.
My other question is how GPUs that don't support RT, how do they do it, usually in software? I realize that it's a lot of math, but why is running RT so incredibly taxing on GPUs that don't support it? Lack of specialized instructions, or something else?
 
RT cores do one thing, and they do it fast: check if a line intersects a box or triangle. It takes a couple of dozen floating point calculations to do "normally", but because RT cores don't (and can't) do anything else they can be engineered to do it really really quickly. Without the RT cores the GPU has to perform all those calculations one at a time which a) is slower, and b) takes resources that could be used for something else.
 
When it comes to software rendering, another way to look at things is that at the end of the day, all that calculation just produces a bitmap. Pixel 1 RGB value, Pixel 2 RGB value, and so on. Doing it for millions of pixels hundreds of times a second, while accounting for all the things that make a 3D scene a 3D scene.

The programming to achieve that behind the scenes for CPUs and GPUs is just different. Also why every time you upgraded your computer in the 90s you doubled or tripled, or more, your performance. CPUs and GPUs were getting vastly more dense and had more features and instruction sets. Whereas today we are pushing the limits of silicon and the GPUs are having to get bigger and bigger, and now multi-chip, and CPUs has to go multi-core.
 
  • Like
Reactions: Order 66
RT cores do one thing, and they do it fast: check if a line intersects a box or triangle. It takes a couple of dozen floating point calculations to do "normally", but because RT cores don't (and can't) do anything else they can be engineered to do it really really quickly. Without the RT cores the GPU has to perform all those calculations one at a time which a) is slower, and b) takes resources that could be used for something else.
Not to mention, older GPUs that don’t support RT, have a less than ideal amount of VRAM needed to brute force it.
 
Whereas today we are pushing the limits of silicon and the GPUs are having to get bigger and bigger, and now multi-chip, and CPUs has to go multi-core.
I mentioned in another thread that I suspect that future generations of cards are going to be less about adding more raw processing power, and more about improving the AI side of things - why add a few thousand more CUDA cores to scrape out an extra 5% performance when a few hundred Tensor cores can effectively double it? I know some people complain that they're not "real" pixels or frames and that it's a cheat, but pretty much every aspect of a modern rasteriser is a cheat already - parallax occlusion mapping, tessellation, screen space effects, they're all just as "fake" as DLSS!
 
I know GPUs have the option to do compute, but I’m not sure how that exactly works.
The way GPUs perform computational tasks is pretty similar to the way they perform interactive rendering. You need to copy the assets (models, textures, etc.) over the PCIe bus (if dGPU) and into their memory.

Then instruct them what to do with it. These instructions include commands relating to how they should interpret the data you just transferred. The commands are either calls into an existing API, custom program snippets, or usually a mix of both.

Ultimately, even calls into an existing API will turn into instructions for the GPU's execution units to execute. The instructions are similar to CPU machine instructions, but also fairly distinct to the GPU in question. AMD has kindly published open documentation of their GPUs' instruction sets, so you can compare and contrast them with CPUs:

So, what it boils down to is that it's a lot of work to make a program harness the power of a GPU. There are some tools, frameworks, and libraries which try to streamline the process by automating some portion of that stuff, but suffice to say that GPU acceleration usually needs to be designed and implemented into a program and it tends to be a non-trivial effort.

Wikipedia has an example of using OpenCL to compute a FFT, which includes all the elements I mentioned above:
 
Last edited:
  • Like
Reactions: Order 66
You mean on the CPU? or what do you mean by run on a GPU? how exactly do GPUs that don't support RT run it? I also seem to remember hearing that Roblox does rendering on the CPU. How does that work without integrated graphics. I know that you could code it so that it runs, but I don't understand how it doesn't run extremely poorly.

It's a joke from the earlier days when people would say "but can it run crysis" when talking about some new technology.

Using a CUDA / OpenCL translation layer it's possible to run scalar x86 code on something like a nVidia GPU using, but wow the performance penalty would be massive. Translating a French essay into interpretive dance then into English massive.

And you most certainly can render 3D scenes using a pure CPU, it's just really slow. There are two main ways to render a three dimensional scene onto a two dimensional plane (your screen), Rasterization and Ray Tracing. Rasterization use's geometry to calculate the angles that every object would be visible from any point of view, then can rotate and stretch texture data, then apply lighting values to that data and finally render that into per-pixel RGB values. This would be very very slow on a CPU, SSE2 / AVX would speed it up but it would still be slow.

I present Doom, the super popular 3D First Person Shooter done entirely in software and running on DOS.

e6eeb88dea6502f81ba38b13fa309ad0.jpg


All that geometry and texture / lighting data crunching is a perfect use case for dedicated vector coprocessors. They used to sell them in the 70's, 80's and 90's.

https://en.wikipedia.org/wiki/Vector_processor

Eventually someone made a special purpose vector coprocessor that was extremely good at doing the math related to rasterization. It was over a hundred times more efficient then a general purpose scalar CPU, nVidia was one such company with the Riva 128 / TNT / TNT2 along with ATI, Permedia and some others. Later nVidia added Transformation and Lighting (TnL) support and called the product a Geforce Graphics Processing Unit (GPU). It was a big jump and went from vector co-processor to a pure vector processor.
 
  • Like
Reactions: Order 66