digitalgriffin
Splendid
So low level data race conditions are now the responsibility of the driver team. That makes me shudder. I see a lot of crashes in the initial driver releases.
Not quite. Let's review the quote:So low level data race conditions are now the responsibility of the driver team.
They're talking about the ISA - these are registers read & written by code that's executing on the GPU, itself. The compiler in question is then your shader compiler, OpenCL, etc. that generates that code.probably the most invasive change is the removal of the register scoreboard logic from the hardware, which means that the EU will no longer guarantee data coherency between register reads and writes, and will require the compiler to synchronize dependent instructions anytime there is a potential data hazard.
Not quite. Let's review the quote:
They're talking about the ISA - these are registers read & written by code that's executing on the GPU, itself. The compiler in question is then your shader compiler, OpenCL, etc. that generates that code.
What they're saying is that if instruction A takes a certain number of clock cycles to write its result back to the register file, then instruction B must be delayed at compile-time to wait that number of cycles before reading the result. Before this change, if instruction B reads the result of instruction A, the Execution Unit it would stall the thread until instruction A finishes. After this change, if instruction B follows too soon after instruction A, it will see the old value of that register, prior to what A is about to write there.
This is nothing special. A lot of modern CPUs work this way (well, mostly embedded/DSP/etc.). It has two main advantages:
It's just another headache for hand-written assembly language, but compilers have been managing this sort of thing for ages.
- It simplifies the CPU design by removing the scoreboard logic (as mentioned).
- It effectively increases the size of the register file, since you don't "tie up" destination registers of instructions that are still in flight.
Smart move, IMO. If anything, I'm surprised it wasn't like that from the beginning. GPUs have always sacrificed ease of programmability for the sake of speed, efficiency, and scalability.
First, GPUs are in-order, for the sake of energy-efficiency and simplicity - both of which enable them to scale much larger. Also, CPUs have two major problems that GPUs don't. First, CPUs execute relatively few threads and are therefore dependent on finding and exploiting concurrency within each thread. GPU code is intrinsically divided into thousands of threads (as dictated by their programming model), saving GPUs from that extra overhead and book-keeping.Yes but not ready resources is what OOE was supposed to solve along with cache coherency.
In modern chips, a more realistic instruction latency would be like 1 cycle for integer instructions and 3 cycles for a single-precision floating point. The outlier is usually division and any transcendental functions.If a step 1: write A takes 16 clocks, and step 2: read from A and put it in B was the next instruction, rearrange the step 2 instructions to B executes 15 clocks later.
In modern chips, a more realistic instruction latency would be like 1 cycle for integer instructions and 3 cycles for a single-precision floating point. The outlier is usually division and any transcendental functions.
It's not a race condition, since the latencies are deterministic and the instructions are statically-scheduled.Well the tick count kind of wasn't the point. The point was to illustrate stalls from data race conditions (either internal or external)
There's a scheduler that assigns threads to the EUs (equivalent to AMD's CUs), but that's scheduling threads to Execution Units (akin to CPU cores), not scheduling instructions within a thread.I honestly thought the scheduler handed simple OOE before shipping off to the CU.
Since GPU programs have so many threads, their solution for latency-hiding is just to suspend the thread and execute a different one.When I was writing out block diagrams for a GPU chiplette design, I put an OOE in the scheduler to account for data that might be on a separate chiplette cache.
That's unfortunate, because their APUs typically have only x8 lanes exposed for an external GPU. So, that means you're limited to a GPU running at PCIe 3.0 x8, which is slow enough to have a measurable (if small) performance impact, instead of PCIe 4.0 x8, which is roughly equivalent to PCIe 3.0 x16.AMD decided not to put pcie4 on their new Renoir APUs.