Discussion CPU instruction set explanation thread

palladin9479 · Feb 7, 2024

NorbertPlays said:
I mentioned in another thread that I suspect that future generations of cards are going to be less about adding more raw processing power, and more about improving the AI side of things - why add a few thousand more CUDA cores to scrape out an extra 5% performance when a few hundred Tensor cores can effectively double it? I know some people complain that they're not "real" pixels or frames and that it's a cheat, but pretty much every aspect of a modern rasteriser is a cheat already - parallax occlusion mapping, tessellation, screen space effects, they're all just as "fake" as DLSS!

Eh it depends. Using advanced pattern detection (that's all "AI" is) to interpolate additional rendering data is fine. Using that same technique to artificially advance a frame counter to market "performance" is definitely a cheat. My qualms has been marketing leaning on hype and the general ignorance of the public to sell products. Frame generation will never more then a gimmick because you are trying to render the future and that's not possible (with modern quantum physics), so you will always have weird artifacts and latency issues. Pattern detection upscaling on the other hand is an extremely useful tool, especially since display resolutions are going up much faster then graphics processing power. Doubling the screen resolution quadruples the processing requirements. If x is the required performance for 1920x1080 (2,073,600 pixels), you need 4x for 3840x2160 (8,294,400 pixels) and 16x for 7680x4320 (33,177,600 pixels). This means maintaining decent frame rates is going to become an absolute nightmare if not outright impossible without some sort of upscaling technology. An advanced pattern based upscaling algorithm isn't trying to guess the future, it's trying to guess what a 1080p rendered image would look like at 2160p, or a 2160p rendered image at 4320p.

bit_user · Feb 7, 2024

Order 66 said:
I would agree, I think it’s funny, but I really do hate vague answers, because it leads me to ask a million more questions.

Okay, but I mean if you really want to dig into how 3D graphics rendering works, the math is a fundamental part of it. Not only are there entire books on it, but every intro book on game development and 3D graphics APIs will tend to devote at least a chapter or appendix to reviewing it. Heck, a quick websearch just turned up this application note from FreeScale - a semiconductor company - so engineers trying to use its chips would have the refresher they need:

https://www.nxp.com/docs/en/application-note/AN4132.pdf

Keep in mind that people started doing computer graphics before specialized hardware existed for it. The first example of a ray-traced image was done on a VAX, like 45 years ago.

bit_user · Feb 7, 2024

palladin9479 said:
Using advanced pattern detection (that's all "AI" is) to interpolate additional rendering data is fine. Using that same technique to artificially advance a frame counter to market "performance" is definitely a cheat.

I don't see how you can say spatial interpolation is okay but temporal interpolation is a "cheat". I agree with @NorbertPlays - the only thing that matters is the end result. In that regard, the frames only need to look good enough, sustain a high enough rate, and arrive at sufficiently low latency to qualify. Since 3D graphics is already a giant approximation, I don't buy into the idea of trying to draw a line in the sand at one form of interpolation and not others.

Now, someone might say the latency of frame gen is too high, when the framerate of the input stream drops too low, and I would consider that a legitimate complaint. IMO, the main problem of frame gen is that it works best when you need it the least (and vice versa)!

palladin9479 said:
Frame generation will never more then a gimmick because you are trying to render the future and that's not possible

My understanding of DLSS3 and FRS3 is that they're temporal interpolation - not extrapolation. Hence, the latency penalty.

I can imagine future iterations of frame gen technology trying to feed an extrapolator with cheap-to-compute hints, so that it can do more accurate extrapolation and thereby avoid any latency penalty. Or, even if you still use frame gen for interpolated frames, if you have early hints, maybe you can start generating them before the subsequent frame is finished rendering.

NorbertPlays · Feb 7, 2024

We're still really in the baby stages of frame generation, but moving forward I can maybe see a hybrid approach being used - "important" parts (characters , enemies, etc) get rendered normally each frame, and motion estimation and optical flow are used for everything else, kind of like VRS but instead of lowering the spatial resolution of parts of a scene you lower the temporal resolution. As for visual artifacts, we've already seen a massive improvement in upscaling quality since the PS4 Pro checkerboard days and the pace of development of assorted image (re)generation techniques is astonishing!

Order 66 · Feb 7, 2024

bit_user said:
I don't see how you can say spatial interpolation is okay but temporal interpolation is a "cheat". I agree with @NorbertPlays - the only thing that matters is the end result. In that regard, the frames only need to look good enough, sustain a high enough rate, and arrive at sufficiently low latency to qualify. Since 3D graphics is already a giant approximation, I don't buy into the idea of trying to draw a line in the sand at one form of interpolation and not others.

Now, someone might say the latency of frame gen is too high, when the framerate of the input stream drops too low, and I would consider that a legitimate complaint. IMO, the main problem of frame gen is that it works best when you need it the least (and vice versa)!

My understanding of DLSS3 and FRS3 is that they're temporal interpolation - not extrapolation. Hence, the latency penalty.

I can imagine future iterations of frame gen technology trying to feed an extrapolator with cheap-to-compute hints, so that it can do more accurate extrapolation and thereby avoid any latency penalty. Or, even if you still use frame gen for interpolated frames, if you have early hints, maybe you can start generating them before the subsequent frame is finished rendering.

I feel like the latency penalty for frame gen isn’t that bad especially if you’re playing games with a controller.

NorbertPlays · Feb 7, 2024

bit_user said:
The first example of a ray-traced image was done on a VAX, like 45 years ago

My mind is still boggled by the CGI in TRON - not only did they not have specialised hardware for 3D, they didn't even have a graphical display - the entire thing was done by laying things out on graph paper, entering a bunch of coordinates into some custom written software as raw numbers, and hoping it looked like what they wanted when it was eventually rendered incredibly slowly directly onto a negative!

Order 66 · Feb 7, 2024

NorbertPlays said:
My mind is still boggled by the CGI in TRON - not only did they not have specialised hardware for 3D, they didn't even have a graphical display - the entire thing was done by laying things out on graph paper, entering a bunch of coordinates into some custom written software as raw numbers, and hoping it looked like what they wanted when it was eventually rendered incredibly slowly directly onto a negative!

What?! I never knew that.

NorbertPlays · Feb 7, 2024

Order 66 said:
What?! I never knew that.

View: https://youtu.be/Tm4i6D3XXBQ?si=RMx2oEUvEcu1XUmv

bit_user · Feb 7, 2024

NorbertPlays said:
My mind is still boggled by the CGI in TRON

Yes, agree 100%.

Another interesting factoid about Tron is that it did poorly at the box office. It was probably the first example in history of a film where cutting-edge CGI couldn't compensate for its other flaws.

I remember being somewhat in awe of its visual effects, before I had the slightest clue how they were made. I had never seen anything remotely like it.

NorbertPlays said:
- not only did they not have specialised hardware for 3D, they didn't even have a graphical display

I'm not sure I heard about the lack of a display. Any idea how the generated the film prints?

NorbertPlays said:
the entire thing was done by laying things out on graph paper, entering a bunch of coordinates into some custom written software as raw numbers, and hoping it looked like what they wanted

I actually did something similar, the first time I ever used PoV-Ray. I drew out the scene on graph paper and entered in the geometry into the text files it used as input. I'd usually draw the scenes during the day, at school, then make the text files when I got home and let the renders run overnight.

One of the first things I tried was to put a light source in front of the camera, as I was really curious to know what they looked like! Imagine my surprised when I saw nothing!
: D

bit_user · Feb 7, 2024

palladin9479 said:
So view CPU's are having 6~14 core that are good at single lots of scalar instructions. GPU's as having a thousand cores good at doing massive amounts of vector instructions.

You're being inconsistent about the notion of what constitutes a GPU "core". If you take Nvidia's view, what they talk about as a "core" is each SIMD lane (i.e. scalar processor). As I previously said, if you use this definition, then each Golden Cove or Zen 4 CPU core would be equivalent to about 48 of Nvidia's "cores".

Using a more classical CPU definition of a core, a GPU like the RTX 4090 only has about 512 blocks that are comparable to a CPU core. That's because Nvidia uses a construct called a Streaming Multiprocessor (SM), each of which contains 4 partitions, and the RTX 4090 has 128 of these. Each partition has the full contingent of execution units and logic needed to execute independently of the others:

nvidia-ada-lovelace-gpu-architecture-streaming-multiprocessor.png

Source: https://www.nvidia.com/en-us/geforce/news/rtx-40-series-vram-video-memory-explained/

512 is still an awful lot of cores, but these are much simpler, in-order cores designed not only to be area-efficient but also energy-efficient. That's the only way they can pack so many onto a single die and find enough power to crank them & their SIMD units all up to 2.2 GHz.

bit_user · Feb 7, 2024

NorbertPlays said:
A game like Quake managed to run perfectly well on a CPU that's several thousand times slower than anything from the last few years!

Quake needed a 486-66DX2 to be remotely playable, but really wanted a Pentium. So, a 66 MHz CPU, with a single pipeline vs. modern CPUs like an i3-12100 that boost to 4.3 GHz and has a 6-way decoder (but real IPC is a bit lower). If we take average IPC of about 4 and assume the IPC of the i486 is about 0.333, that gives you another 12x speedup. So, before we consider the increased core count or any SIMD extensions, we're at a performance ratio of about 782:1.

So, the only way I think you get to "several thousand" is by factoring in multi-core and SIMD extensions. Also, while my IPC figure for the i486 might've been low for integer performance, it was probably too high for basic FPU instructions.

BTW, Descent was the first game I saw which had perspective-correct texture mapping and was playable on a 486 (at 320x200 resolution). It ran a bit faster than Quake, but then it had simpler models and lighting.

One thing that made Quake so neat is it had pre-baked ambient lighting, which they computed using radiosity, on a big workstation. That also meant that the GPU-accelerated version required GPUs & API implementations capable of multi-texturing. Quake also used Z-buffering for character rendering, which I'm pretty sure Descent did not.

Order 66 · Feb 7, 2024

bit_user said:
Quake needed a 486-66DX2 to be remotely playable, but really wanted a Pentium. So, a 66 MHz CPU, with a single pipeline vs. modern CPUs like an i3-12100 that boost to 4.3 GHz and has a 6-way decoder (but real IPC is a bit lower). If we take average IPC of about 4 and assume the IPC of the i486 is about 0.333, that gives you another 12x speedup. So, before we consider the increased core count or any SIMD extensions, we're at a performance ratio of about 782:1.

So, the only way I think you get to "several thousand" is by factoring in multi-core and SIMD extensions. Also, while my IPC figure for the i486 might've been low for integer performance, it was probably too high for basic FPU instructions.

FPU, floating point unit?

bit_user · Feb 7, 2024

Order 66 said:
FPU, floating point unit?

Floating-point arithmetic is a computer number format that's like scientific notation. For instance, how you might write 4.396 x 10^-3 instead of 0.004396.

The benefits of floating-point over integers are that it has a greater range and can represent non-integral values. The downside is that it's inexact and has less precision, except around zero. It also has weird properties like addition and multiplication not being associative!

Because it's more complex, it requires more circuitry to execute. That means it has a larger silicon footprint and uses more energy. In fact, these scale roughly as a square of the mantissa, in modern implementations. This helps explain why AI-optimized processors tend to prefer low-precision floating-point number formats.

NorbertPlays said:
Sometimes really weird maths!

LOL, I've been there & done that!

About 20 years ago, I wrote lots of routines where I bit-hacked the IEEE754 fp32 format. These days, I honestly wonder how it compares with optimized CPU implementations.

BTW, I think GPUs tend to implement transcendental operations in a way that allows you to tradeoff execution time vs. accuracy. IIRC, AMD's 3DNow had some instructions like that.

Order 66 · Feb 7, 2024

bit_user said:
It also has weird properties like addition and multiplication not being associative!

Remind me what associative means in this context? I really should know this considering I’m still in school, but it’s been a few years since I’ve learned it.

bit_user · Feb 7, 2024

NorbertPlays said:
RT cores do one thing, and they do it fast: check if a line intersects a box or triangle. It takes a couple of dozen floating point calculations to do "normally", but because RT cores don't (and can't) do anything else they can be engineered to do it really really quickly. Without the RT cores the GPU has to perform all those calculations one at a time which a) is slower, and b) takes resources that could be used for something else.

RT cores do something else: BVH traversal. BVH is short for Bounding Volume Hierarchy. This involves testing lots of bounding volumes (boxes or spheres, like you said) and then conditionally fetching & testing the next set. This probably involves enough conditional control-flow that it's not easy to do efficiently via SIMD.

534px-Example_of_bounding_volume_hierarchy.svg.png

The point of BVH is to drastically cut down on the number of intersection tests you need to do, in order to determine which object a given ray intersects first.

Another thing I believe newer GPU hardware does is to accelerate building or modifying a BVH. Because, if a scene is dynamic, then the BVH needs to be updated or rebuilt, as well. And most games don't occur in a static world!

Order 66 · Feb 7, 2024

bit_user said:
In fact, these scale roughly as a square of the mantissa, in modern implementations

I’m so confused by this.

bit_user · Feb 7, 2024

Order 66 said:
Remind me what associative means in this context? I really should know this considering I’m still in school, but it’s been a few years since I’ve learned it.

The associative property is what you learned in algebra that holds:

(x + y) + z = x + (y + z)

In other words, the order of operations doesn't affect the result. Algebra depends on that.

Order 66 said:
I’m so confused by this.

The mantissa is the "4.396" part, in the number 4.396 x 10^-3. The point is that the more precision you have in that part of a floating-point number, the larger the circuitry gets. It doesn't just increase linearly, but quadradically.

It's basically a way of saying that higher-precision arithmetic is disproportionately more expensive. You might otherwise think a 64-bit number should be twice as expensive to multiply as 32-bit, but it actually takes about four times as much.

bit_user · Feb 7, 2024

BTW, @Order 66 if you're that interested in this stuff, maybe consider trying to find a good "Intro to 3D Game Programming" course, video series, or (gasp!) book.

It's fun to learn stuff and gratifying to actually build something of your own!

NorbertPlays · Feb 8, 2024

bit_user said:
So, the only way I think you get to "several thousand" is by factoring in multi-core and SIMD extensions.

Given the focus of this thread on multicore workloads, multicore performance was definitely included in my accounting!

bit_user said:
It's fun to learn stuff and gratifying to actually build something of your own!

I think it was all a lot simpler to learn when I started back in the 80s; once I'd figured out how to make a cube spin I had basically covered everything there was to know - how to move points in 3D space, and how to convert 3D space to screen space - and everything else was just an extension or refinement. These days everything is hidden behind API layers and gets a lot more involved, and all the tutorials I can find seem to gloss over the basic "how 3D actually works" and jump straight into OpenGL or DirectX which are a lot to take in if you're a complete noob...

TerryLaze · Feb 8, 2024

palladin9479 said:
This means maintaining decent frame rates is going to become an absolute nightmare if not outright impossible without some sort of upscaling technology. An advanced pattern based upscaling algorithm isn't trying to guess the future, it's trying to guess what a 1080p rendered image would look like at 2160p, or a 2160p rendered image at 4320p.

The fault in that is that 4k already is too much resolution for a comfortable desktop experience so going above that will not be met with a lot of acceptance, at least that's what I think, I have a hard enough time looking even at 1080p without reading glasses.
1440 is the sweet spot I believe for most people and GPUs are already capable of pushing enough pixels at that resolution without AI, sure it's a good thing for cheaper cards and people that want to save money but for the high end it's not going to add anything.

Now, couch gaming is a different thing but that's way more console territory than it is PC.

Order 66 · Feb 8, 2024

bit_user said:
BTW, @Order 66 if you're that interested in this stuff, maybe consider trying to find a good "Intro to 3D Game Programming" course, video series, or (gasp!) book.

Thanks, I am fascinated by anything and everything with technology, which is part of the reason why I have started so many discussion threads.

Order 66 · Feb 8, 2024

@bit_user, I've kinda always been in awe of the amount of knowledge that you have. How long did it take you to learn all of this? I've been into computers (mainly strictly hardware) for about 5 years, and while I've learned a lot, I realize just how much I still have to learn.

Order 66 · Feb 8, 2024

bit_user said:
BTW, @Order 66 if you're that interested in this stuff, maybe consider trying to find a good "Intro to 3D Game Programming" course, video series, or (gasp!) book.

It's fun to learn stuff and gratifying to actually build something of your own!

I've thought about it, and actually tried to build simple games (mainly in Roblox), but my problem is that I get frustrated when I can't figure out how to do something, and when I try researching the problem, if I don't find anything, I kinda just give up. I would love to do it, but I have issues with staying with it. I realize that it's a personal problem, but I still get very frustrated with it. Also, I realize that you're talking about something different, but I just thought I would highlight my limited experience with game development so far.

Order 66 · Feb 8, 2024

TerryLaze said:
The fault in that is that 4k already is too much resolution for a comfortable desktop experience so going above that will not be met with a lot of acceptance, at least that's what I think, I have a hard enough time looking even at 1080p without reading glasses.
1440 is the sweet spot I believe for most people and GPUs are already capable of pushing enough pixels at that resolution without AI, sure it's a good thing for cheaper cards and people that want to save money but for the high end it's not going to add anything.

Now, couch gaming is a different thing but that's way more console territory than it is PC.

I realize that 4k is too much for most users, but I have terrible eyesight, so as a result, I have to sit closer to the screen or zoom things in, but the problem I have is that seeing pixels drives me nuts. So I kinda need higher resolutions than most, and I'm currently using a 1080p 22in monitor, the pixel density is decent, but it's only 1080p and 22 inches.

Order 66 · Feb 8, 2024

palladin9479 said:
With this in mind, code written for CPU's does not work very well on GPU's. Instead you need entirely separate code written specifically to take advantage of the GPU's vector nature.

It does work though? How much of a performance penalty is it?

Discussion CPU instruction set explanation thread

Splendid

Titan

Titan

Proper

Grand Moff

Proper

Grand Moff

Proper

Titan

Titan

Titan

Grand Moff

Titan

Grand Moff

Titan

Grand Moff

Titan

Titan

Proper

Titan

Grand Moff

Grand Moff

Grand Moff

Grand Moff

Grand Moff

Share this page