The researchers achieved this result by creating a quality-aware work-stealing (QAWS) scheduler.
The article skipped a fundamental step that's needed to appreciate what is being scheduled.
The way this works is that the authors propose that programs be re-factored into a set of high-level Virtual Operations (aka VOPs). These are sub-divided into one or more High-Level Ops (HLOPs), which are the basic scheduling atoms, in this system. Each HLOP would have an equivalent implementation compiled for each type of hardware engine (CPU, GPU, AI engine, etc.) that you want it to be able to utilize. The runtime scheduler is then deciding which engine should execute a given HLOP, based on the state of those queues.
For further details, see
Section 3 of the paper:
As the article suggests, this is very invasive to how software is written. The paper suggests that application code might not have to change, so long as major runtime libraries can incorporate the changes. Some examples of places where you might see benefits are:
- programs using AI, where work can be flexibly divided up between CPU, GPU, and AI accelerator.
- portions of video game code, such as physics, where you have similar flexibility over where some parts of the computation are processed.
Key limitations are:
- The main control still happens on the CPU.
- Some code must be restructured to use HLOPS and they must compile for the various accelerators.
- Code which is mostly serial won't benefit from being able to run on the GPU, so these operations must be things that are already good candidates (and likely sometimes subject to) GPU implementations.
Finally, the article prototypes this on an old Nvidia SoC, similar to the one powering the Nintendo Switch. This is a shared-memory architecture, which is probably why the article makes almost no mention of data movement or data locality. In such a system as they describe, you could account for data movement overhead, when making scheduling decisions, but that would put a lot more burden on the scheduler and ideally require applications to queue up operations far enough in advance that the scheduler has adequate visibility.
In the end, I'd say I prefer a pipeline-oriented approach, where you define an entire processing graph and let the runtime system assign portions of it to the various execution engines (either statically or dynamically). This gives the scheduler greater visibility into both processing requirements and data locality. It's also a good fit for how leading AI frameworks decompose the problem. In fact, I know OpenVINO has support for hybrid computation, and probably does something like this.
It's an interesting paper, but I don't know if it's terribly ground-breaking or if it will be very consequential. But it's always possible that Direct3D or some other framework decides to utilize this approach to provide the benefits of hybrid computation. This could be important as we move into a world where AI (inferencing) workloads are expected to run on a variety of PCs, including some with AI accelerators and many without.