FMA is a really big one as it lets you simultaneously multiply two values then add a third and have this function happen on several elements all at once.
Dot product is another big one. It computes the sum-of-products from two vector operands. It's key in convolutions and matrix multiplication (both matrix * vector and matrix * matrix) and a big deal in machine learning.
It's an example of a horizontal operation, since it involves cross-lane communication. Here, vector instructions are being used to treat vectors as first class data types in classical control flow, as opposed to using them as SIMD engines.
Try to imagine GPU's are massive thousand core+ CPU's that specialize in vector math.
Whoah, let's not get carried away, here! GPU "cores" are vaguely like
scalar CPU cores. Or, another way of looking at it is that each Zen 4 or Golden Cove CPU core is like about 48 GPU "cores", since each GPU "core" is essentially a 32-bit SIMD lane and both of those CPU cores have 6x 256-bit issue ports capable of issuing various AVX instructions. IIRC, Intel has an edge, since 4 of those ports are multiply-capable, while only two of Zen 4's are.
Like adding a lighting bitmask to 2,073,600 picture elements (pixels). That would take forever doing one pixel at a time, but a GPU can do all of them in a single cycle using vector instructions.
Just to fact-check that, GPUs do most of their graphics using fp32. IIRC, their integer support is relegated to scalars. If we consider how many fp32 operations they can issue per cycle, the RTX 4090 has 16384 "CUDA cores", so there's your answer -
not 2 million! Similarly, AMD's RX 7900 XTX has 6144 "shaders", which are essentially the same thing. So, these GPUs would have the raw,
per-cycle fp32 compute power somewhere equivalent to 341 or 128 CPU cores, respectively. Since CPUs can support all-core clocks about double what GPUs can, you could actually halve those numbers to get an approximate relation in absolute terms.
One aspect I find interesting is to consider the ratio of compute-to-bandwidth, in these GPUs. Even with a massive ~1 TB/s of memory bandwidth, that still means the Nvidia flagship can theoretically perform 72.5 fp32 ops per
byte read
or written from/to memory. Similarly, the AMD GPU can perform about 48.6 fp32 ops per byte. And keep in mind that a fp32 number is 4 bytes.
So, that means that, for all the bandwidth they have, these GPUs are
very lopsided towards compute. Now, consider all the math involved in lighting a pixel: interpolating surface normal vectors, computing dot products, distance calculations, and applying transfer functions to account for material specularity and light source directionality. Then, use those terms to scale light source & surface color.
So, it's not like the GPU is always just twiddling its thumbs while waiting for memory. However, that lopsided compute/bandwidth bias also illustrates why big AI & HPC GPUs pack HBM stacks, to support several times that amount of bandwidth. AI is even more bandwidth-hungry than graphics!