News Intel Gen12 Graphics Linux Patches Reveal New Display Feature for Tiger Lake

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

bit_user

Polypheme
Ambassador
So low level data race conditions are now the responsibility of the driver team.
Not quite. Let's review the quote:
probably the most invasive change is the removal of the register scoreboard logic from the hardware, which means that the EU will no longer guarantee data coherency between register reads and writes, and will require the compiler to synchronize dependent instructions anytime there is a potential data hazard.
They're talking about the ISA - these are registers read & written by code that's executing on the GPU, itself. The compiler in question is then your shader compiler, OpenCL, etc. that generates that code.

What they're saying is that if instruction A takes a certain number of clock cycles to write its result back to the register file, then instruction B must be delayed at compile-time to wait that number of cycles before reading the result. Before this change, if instruction B reads the result of instruction A, the Execution Unit it would stall the thread until instruction A finishes. After this change, if instruction B follows too soon after instruction A, it will see the old value of that register, prior to what A is about to write there.

This is nothing special. A lot of modern CPUs work this way (well, mostly embedded/DSP/etc.). It has two main advantages:
  1. It simplifies the CPU design by removing the scoreboard logic (as mentioned).
  2. It effectively increases the size of the register file, since you don't "tie up" destination registers of instructions that are still in flight.
It's just another headache for hand-written assembly language, but compilers have been managing this sort of thing for ages.

Smart move, IMO. If anything, I'm surprised it wasn't like that from the beginning. GPUs have always sacrificed ease of programmability for the sake of speed, efficiency, and scalability.
 
Not quite. Let's review the quote:

They're talking about the ISA - these are registers read & written by code that's executing on the GPU, itself. The compiler in question is then your shader compiler, OpenCL, etc. that generates that code.

What they're saying is that if instruction A takes a certain number of clock cycles to write its result back to the register file, then instruction B must be delayed at compile-time to wait that number of cycles before reading the result. Before this change, if instruction B reads the result of instruction A, the Execution Unit it would stall the thread until instruction A finishes. After this change, if instruction B follows too soon after instruction A, it will see the old value of that register, prior to what A is about to write there.

This is nothing special. A lot of modern CPUs work this way (well, mostly embedded/DSP/etc.). It has two main advantages:
  1. It simplifies the CPU design by removing the scoreboard logic (as mentioned).
  2. It effectively increases the size of the register file, since you don't "tie up" destination registers of instructions that are still in flight.
It's just another headache for hand-written assembly language, but compilers have been managing this sort of thing for ages.

Smart move, IMO. If anything, I'm surprised it wasn't like that from the beginning. GPUs have always sacrificed ease of programmability for the sake of speed, efficiency, and scalability.

Yes but not ready resources is what OOE was supposed to solve along with cache coherency. If a step 1: write A takes 16 clocks, and step 2: read from A and put it in B was the next instruction, rearrange the step 2 instructions to B executes 15 clocks later.
 

bit_user

Polypheme
Ambassador
Yes but not ready resources is what OOE was supposed to solve along with cache coherency.
First, GPUs are in-order, for the sake of energy-efficiency and simplicity - both of which enable them to scale much larger. Also, CPUs have two major problems that GPUs don't. First, CPUs execute relatively few threads and are therefore dependent on finding and exploiting concurrency within each thread. GPU code is intrinsically divided into thousands of threads (as dictated by their programming model), saving GPUs from that extra overhead and book-keeping.

The other big problem GPUs avoid is the need for backwards compatibility. Basically, all of their code is just-in-time compiled for the specific uArch of the GPU, whereas a CPU needs to be able to take legacy code and schedule it to effectively utilize its resources. Otherwise, you have a "Pentium Pro" situation, where users were unhappy that all their legacy 16-bit code ran slower on the newer chip and couldn't be expected to recompile it all in 32-bit mode.

So, with OoO out of the way, let's take another look at register scoreboarding. In pre-Gen12, an instruction result is bound to its destination register at the time the instruction issues. Starting with Gen12, the result isn't bound until the instruction retires. This creates a scheduling hazard that the compiler must deal with. However, (the) compiler(s) probably already knew about it, since trying to use a result before it's ready would've previously resulted in a stall, yielding slower code. So, they already had incentive to try and avoid doing that. And, like I said, it's nothing compilers haven't already been managing for decades.

Now, the reason why I say this is independent of out-of-order execution is that an architecture with deterministic, non-data-dependent instruction latencies can be scheduled at compile-time at least as effectively as at run-time. The only wild card is memory accesses, which quite likely just stall the entire thread. And while stalling a thread would be bad on a CPU, the execution units of recent Intel GPUs have 7-way SMT and only 2-way dispatch. So, they only need a couple of those 7 threads not to be stalled on reads, at any given time, in order to keep the EU busy. In contrast, CPUs are much worse off, because they're usually only 2- or 4-way SMT (if at all), and have much wider dispatch (i.e. more pipelines to feed).

If a step 1: write A takes 16 clocks, and step 2: read from A and put it in B was the next instruction, rearrange the step 2 instructions to B executes 15 clocks later.
In modern chips, a more realistic instruction latency would be like 1 cycle for integer instructions and 3 cycles for a single-precision floating point. The outlier is usually division and any transcendental functions.

Since you're interested, I'd encourage you to give these a quick read:

IMO, the Gen 9 (Skylake) doc is better, since it spends about 3 pages describing the execution unit (section 5.3), while the Gen11 doc cut it down to to just about one (section 4.3.2).
 
In modern chips, a more realistic instruction latency would be like 1 cycle for integer instructions and 3 cycles for a single-precision floating point. The outlier is usually division and any transcendental functions.

Well the tick count kind of wasn't the point. The point was to illustrate stalls from data race conditions (either internal or external)

Thank you for the papers. I'll be sure to look into them.

I honestly thought the scheduler handed simple OOE before shipping off to the CU. When I was writing out block diagrams for a GPU chiplette design, I put an OOE in the scheduler to account for data that might be on a separate chiplette cache. Based on the instruction set I figured it wouldn't have to be that deep of a look ahead/branch predictor. So I guessed wrong there.


And yes, I knew shader programs/compute programs were compiled dynamically during runtime. It's the same reason IL languages (Java/.NET) do it. To ensure maximum speed and compatibility across machines with different compatibilities. And this is why driver optimizations often cache the compiled shader in advance, and keep it in video card memory to improve performance.
 

bit_user

Polypheme
Ambassador
Well the tick count kind of wasn't the point. The point was to illustrate stalls from data race conditions (either internal or external)
It's not a race condition, since the latencies are deterministic and the instructions are statically-scheduled.

I honestly thought the scheduler handed simple OOE before shipping off to the CU.
There's a scheduler that assigns threads to the EUs (equivalent to AMD's CUs), but that's scheduling threads to Execution Units (akin to CPU cores), not scheduling instructions within a thread.

When I was writing out block diagrams for a GPU chiplette design, I put an OOE in the scheduler to account for data that might be on a separate chiplette cache.
Since GPU programs have so many threads, their solution for latency-hiding is just to suspend the thread and execute a different one.
 

bit_user

Polypheme
Ambassador
AMD decided not to put pcie4 on their new Renoir APUs.
That's unfortunate, because their APUs typically have only x8 lanes exposed for an external GPU. So, that means you're limited to a GPU running at PCIe 3.0 x8, which is slow enough to have a measurable (if small) performance impact, instead of PCIe 4.0 x8, which is roughly equivalent to PCIe 3.0 x16.

So, they could have cancelled out the downside of having just x8 GPU lanes, but instead I think they were probably looking towards the class of motherboards that most APU users would use it with. Since most of those boards will probably use PCIe 3.0, for cost reasons, that's probably why they decided to leave 4.0 out of the APU.

Hopefully, they'll reverse that decision, in the 5000-series APUs.