Discussion CPU instruction set explanation thread

Order 66

Grand Moff
Apr 13, 2023
2,166
912
2,570
I was looking at this thread, https://forums.tomshardware.com/thr...ench-6-benchmark.3835713/page-2#post-23197213 and while I enjoy watching the debate progress, I don't really understand anything about instruction sets (in this case AVX), or really anything beyond the hardware aspects of PCs in general. I know people will say to google it, but I am really looking for it to be dumbed down a little bit, so that I can grasp the idea in the first place. I think I would enjoy the debate between @TerryLaze and @bit_user if I understood a bit more of what they were talking about.
 
  • Like
Reactions: jnjnilson6
Instructions are basically litle programs inside the CPU that do commonly used things so that coders don't have to write these things in ones and zeroes, they just tell the CPU to execute whatever "program" they want.
For the first CPUs they were simple things like hold several numbers in memory add them up move them around and so on, now it's crazy things like AVX and AI.
The whole of these instructions is the instruction set so all of these little programs that you don't need to write from the ground up every time.
Instead of this
@ 1110|00|0|0100|0|0001|0000|000000000010
you only write this
ADD R0, R1, R2

 
  • Like
Reactions: Order 66
Instructions are basically litle programs inside the CPU that do commonly used things so that coders don't have to write these things in ones and zeroes, they just tell the CPU to execute whatever "program" they want.
For the first CPUs they were simple things like hold several numbers in memory add them up move them around and so on, now it's crazy things like AVX and AI.
The whole of these instructions is the instruction set so all of these little programs that you don't need to write from the ground up every time.
Instead of this
@ 1110|00|0|0100|0|0001|0000|000000000010
you only write this
ADD R0, R1, R2

Ok, but what exactly is AVX and what is it used for? I seem to remember hearing something about how cinebench used to struggle on amd cpus because of something to do with avx. I don’t remember exactly.
 
I'm going to suggest you try diving in, head first, and look at the Intel reference docs. The one we've been referencing is as good a place as any, to start. Each instruction has a description which says what it does:

For instance, ADDPD says:

"Adds two, four or eight packed double precision floating-point values from the first source operand to the second
source operand, and stores the packed double precision floating-point result in the destination operand."

A more convenient reference of x86 vector instructions is here:

If you need a more gentle introduction to assembly language, I guess the Wikipedia page is a good place to start?

You can find some web sites which allow you to type assembly language into your browser window and will then execute it, so you can see what it does. I wish they had stuff like that, back when I learned assembly language!
: )
 
  • Like
Reactions: Order 66
AVX instructions are "Single Instruction, Multiple Data" instructions that can perform vector operations on a bunch of numbers in one go. Useful for 3D geometry calculations, cryptography, video encoding, or anything else that requires crunching a lot numbers in specific ways; in essence they're similar to how GPUs work, and in many instances using a GPU is preferred now because they can handle far more calculations at once, but if the quantity of numbers you need to deal with is relatively small then using AVX saves the extra overhead of shuffling the data back and forth between CPU and GPU.
 
AVX instructions are "Single Instruction, Multiple Data" instructions that can perform vector operations on a bunch of numbers in one go.
Although they're frequently referred to as "SIMD", that's only one of the programming models they support. Strict SIMD doesn't involve horizontal (i.e. cross-lane) operations, although every x86 vector extension (even the original MMX) had some of these.

they're similar to how GPUs work,
It's perhaps notable that GPUs generally don't support horizontal operations.
 
  • Like
Reactions: Order 66
Although they're frequently referred to as "SIMD", that's only one of the programming models they support. Strict SIMD doesn't involve horizontal (i.e. cross-lane) operations, although every x86 vector extension (even the original MMX) had some of these.


It's perhaps notable that GPUs generally don't support horizontal operations.
wasn't there a thing with Cinebench where intel had a better AVX implementation,and thus was performing better than similar AMD CPUs? I seem to remember hearing about that. Also, I know that Cinebench r23 doesn't have a GPU test, but could a GPU render that scene better than a CPU, thus getting a higher score than any CPUs? I might be way off base with that.
 
could a GPU render that scene better than a CPU, thus getting a higher score than any CPUs
Based on my experience with the Cycles renderer in Blender, yes. It uses AVX for CPU rendering, but if I render on the GPU it's roughly 10-20x faster (R7 5800X Vs 4070 Ti). (There is a GPU version of Cinebench, but I've never tried it.)
 
I was looking at this thread, https://forums.tomshardware.com/threads/intels-latest-lower-powered-cpus-give-ryzen-rivals-a-run-for-their-money-—-core-i9-14900t-beats-ryzen-9-7900-in-geekbench-6-benchmark.3835713/page-2#post-23197213 and while I enjoy watching the debate progress, I don't really understand anything about instruction sets (in this case AVX), or really anything beyond the hardware aspects of PCs in general. I know people will say to google it, but I am really looking for it to be dumbed down a little bit, so that I can grasp the idea in the first place. I think I would enjoy the debate between @TerryLaze and @bit_user if I understood a bit more of what they were talking about.


Instruction set is the language of the CPU, in this case x86 with 64 bit extensions. Everything else are special instructions that have hardware dedicated to executing them. Most of these are mathematical in nature, like adding value X to four other values simultaneously. FMA is a really big one as it lets you simultaneously multiply two values then add a third and have this function happen on several elements all at once.

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

Taken all together, these kinds of instructions let you do complex vector math quickly instead of having to do everything one step at a time on general purpose hardware. Try to imagine GPU's are massive thousand core+ CPU's that specialize in vector math. Like adding a lighting bitmask to 2,073,600 picture elements (pixels). That would take forever doing one pixel at a time, but a GPU can do all of them in a single cycle using vector instructions.
 
  • Like
Reactions: Order 66
Instruction set is the language of the CPU, in this case x86 with 64 bit extensions. Everything else are special instructions that have hardware dedicated to executing them. Most of these are mathematical in nature, like adding value X to four other values simultaneously. FMA is a really big one as it lets you simultaneously multiply two values then add a third and have this function happen on several elements all at once.

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

Taken all together, these kinds of instructions let you do complex vector math quickly instead of having to do everything one step at a time on general purpose hardware.
Thanks for the response. I’ve always wondered if Cinebench r23 would be faster on a GPU, but of course there is no way to test a GPU with cinebench r23. What exactly is stopping a GPU from being able to do the benchmark? It’s just rendering a 3d scene, so it should be able to run on a GPU, or am I missing something?
 
  • Like
Reactions: bit_user
FMA is a really big one as it lets you simultaneously multiply two values then add a third and have this function happen on several elements all at once.
Dot product is another big one. It computes the sum-of-products from two vector operands. It's key in convolutions and matrix multiplication (both matrix * vector and matrix * matrix) and a big deal in machine learning.

It's an example of a horizontal operation, since it involves cross-lane communication. Here, vector instructions are being used to treat vectors as first class data types in classical control flow, as opposed to using them as SIMD engines.

Try to imagine GPU's are massive thousand core+ CPU's that specialize in vector math.
Whoah, let's not get carried away, here! GPU "cores" are vaguely like scalar CPU cores. Or, another way of looking at it is that each Zen 4 or Golden Cove CPU core is like about 48 GPU "cores", since each GPU "core" is essentially a 32-bit SIMD lane and both of those CPU cores have 6x 256-bit issue ports capable of issuing various AVX instructions. IIRC, Intel has an edge, since 4 of those ports are multiply-capable, while only two of Zen 4's are.

Like adding a lighting bitmask to 2,073,600 picture elements (pixels). That would take forever doing one pixel at a time, but a GPU can do all of them in a single cycle using vector instructions.
Just to fact-check that, GPUs do most of their graphics using fp32. IIRC, their integer support is relegated to scalars. If we consider how many fp32 operations they can issue per cycle, the RTX 4090 has 16384 "CUDA cores", so there's your answer - not 2 million! Similarly, AMD's RX 7900 XTX has 6144 "shaders", which are essentially the same thing. So, these GPUs would have the raw, per-cycle fp32 compute power somewhere equivalent to 341 or 128 CPU cores, respectively. Since CPUs can support all-core clocks about double what GPUs can, you could actually halve those numbers to get an approximate relation in absolute terms.

One aspect I find interesting is to consider the ratio of compute-to-bandwidth, in these GPUs. Even with a massive ~1 TB/s of memory bandwidth, that still means the Nvidia flagship can theoretically perform 72.5 fp32 ops per byte read or written from/to memory. Similarly, the AMD GPU can perform about 48.6 fp32 ops per byte. And keep in mind that a fp32 number is 4 bytes.

So, that means that, for all the bandwidth they have, these GPUs are very lopsided towards compute. Now, consider all the math involved in lighting a pixel: interpolating surface normal vectors, computing dot products, distance calculations, and applying transfer functions to account for material specularity and light source directionality. Then, use those terms to scale light source & surface color.

So, it's not like the GPU is always just twiddling its thumbs while waiting for memory. However, that lopsided compute/bandwidth bias also illustrates why big AI & HPC GPUs pack HBM stacks, to support several times that amount of bandwidth. AI is even more bandwidth-hungry than graphics!
 
  • Like
Reactions: Order 66
Thanks, but what exactly is AVX used for? I seem to remember that cinebench uses it, and that AMD CPUs were worse in Cinebench because of it. (I think, I don't remember exactly)
That was a bit before AVX back then it was sse/sse2, and intel's compiler, that cinebench is compiled with, used intel cpu ids to determine the sse capabilities of the CPU, and on CPUs that didn't return an cpuid with known sse capabilities, so any cpu not made by intel but also possibly future intel cpus, would get the normal codepath and not the intel optimized one.
https://www.agner.org/optimize/blog/read.php?i=49
 
  • Like
Reactions: Order 66
Thanks for the response. I’ve always wondered if Cinebench r23 would be faster on a GPU, but of course there is no way to test a GPU with cinebench r23. What exactly is stopping a GPU from being able to do the benchmark? It’s just rendering a 3d scene, so it should be able to run on a GPU, or am I missing something?

The reason Cinebench doesn't run on GPU's is because it's compiled to run on x86 and GPU's do not speak x86. Instruction Set Architectures (ISA's) are the fundamental languages that processors speak. In order for code to execute it needs to either be compiled on that processor's target language or there needs to be interpreter that translates it. The language that GPU's speak is CUDA / OpenCL / HLSL. Well technically each GPU speaks it's own language but the hardware manufactures provide an API interface that speaks the aforementioned standard languages.
 
The reason Cinebench doesn't run on GPU's is because it's compiled to run on x86 and GPU's do not speak x86. Instruction Set Architectures (ISA's) are the fundamental languages that processors speak. In order for code to execute it needs to either be compiled on that processor's target language or there needs to be interpreter that translates it. The language that GPU's speak is CUDA / OpenCL / HLSL. Well technically each GPU speaks it's own language but the hardware manufactures provide an API interface that speaks the aforementioned standard languages.
I know Cinebench 2024 has a GPU test, but I would love to see a 4090 complete the r23 render and destroy most CPUs in terms of score. (theoretically, if such a translation layer existed, and it wouldn't completely kill performance) I don't know how you would even go about making it run on GPUs though.
 
The reason Cinebench doesn't run on GPU's is because it's compiled to run on x86 ...
That's a slightly odd way of putting it, but also kind of moot as Cinebench 2024 does include a GPU version. Just tried it, the scene takes a shade under 20 seconds to complete on my GPU and about six and a half minutes on my CPU (which is around the same speed increase I get in Blender).
 
That's a slightly odd way of putting it, but also kind of moot as Cinebench 2024 does include a GPU version. Just tried it, the scene takes a shade under 20 seconds to complete on my GPU and about six and a half minutes on my CPU (which is around the same speed increase I get in Blender).
I was referring to cinebench r23. I would love to see the speed increase between CPU and GPU if it was possible to run r23 on a GPU.
 
I was referring to cinebench r23. I would love to see the speed increase between CPU and GPU if it was possible to run r23 on a GPU.
If there were a GPU-based version then with a bit of a back-of-the envelope calculation (in Maya and 3DS Max the 4090 is about 5-10% faster than my 4070 Ti, and based on my r23 score and assuming it would show the same 21x increase as in 2024) a 4090 would get a score of around 350,000...
 
  • Like
Reactions: Order 66
I know Cinebench 2024 has a GPU test, but I would love to see a 4090 complete the r23 render and destroy most CPUs in terms of score.
Blender has both CPU and GPU backends. I believe they should be equivalent in quality (i.e. the Nvidia GPU backend uses CUDA, not Direct 3D, OpenGL, or Vulkan). Lately, Intel and AMD also have the backend running on their GPUs. So, you could compare those benchmark scores (i.e. CPU vs. up to 3 different types of GPU compute implementations).
 
  • Like
Reactions: Order 66
That's a slightly odd way of putting it, but also kind of moot as Cinebench 2024 does include a GPU version. Just tried it, the scene takes a shade under 20 seconds to complete on my GPU and about six and a half minutes on my CPU (which is around the same speed increase I get in Blender).

The version he was using does not have GPU native compute code in it. Since the question was about instruction set architecture, it is important to highlight that all these things are very different languages.
 
  • Like
Reactions: Order 66
The version he was using does not have GPU native compute code in it. Since the question was about instruction set architecture, it is important to highlight that all these things are very different languages.
Yes, we know that; this is purely a "what if" scenario - if we lived in an alternate universe where r23 had a GPU compute option, what sort of score would an RTX 4090 achieve - with figures extrapolated very loosely based on similar direct CPU to GPU comparisons, my own r23 and 2024 results, and speed differentials between my GPU and the 4090 in similar workloads. (Based on the theoretical score, it would take 2-3 seconds for a 4090 to render the r23 benchmark.) It is essentially meaningless but none the less still a reasonably valid calculation given that the r23 workload is something that can be written to run on a GPU; obviously it would be no use to then use this extrapolation to claim a GPU would be X times faster for any workload, though - it's only (hypothetically) valid for this highly specific case.
 
The version he was using does not have GPU native compute code in it. Since the question was about instruction set architecture, it is important to highlight that all these things are very different languages.
Would it be possible to make cinebench r23 run on a GPU? I know GPUs have the option to do compute, but I’m not sure how that exactly works. By compute, I’m more referring to the non graphical compute aspect ( if that’s even the right way of putting it).
 
Would it be possible to make cinebench r23 run on a GPU? I know GPUs have the option to do compute, but I’m not sure how that exactly works. By compute, I’m more referring to the non graphical compute aspect ( if that’s even the right way of putting it).
Not "as is", no. The GPU version produces the same end result, but the underlying code is entirely different. Imagine it's like CDs and vinyl - they're both round and make music, but the way the music is stored and the inner workings of the machine that plays it are not even remotely similar.
 
Not "as is", no. The GPU version produces the same end result, but the underlying code is entirely different. Imagine it's like CDs and vinyl - they're both round and make music, but the way the music is stored and the inner workings of the machine that plays it are not even remotely similar.
Could you use a translation layer, or would it just be simpler to rewrite it from the beginning?