@JarredWaltonGPU , I fully agree with your take. I also share your trepidation around highly-proprietary rendering technologies.
IMO, it's not as if there can be no alternative. What I'd like to see is an open source 3D engine independently implement something conceptually similar to Ray Reconstruction that can (theoretically, at least) run on any hardware - although it will probably require some kind of deep learning accelerator, in practice (which are found in both Alchemist and RDNA3).
I assume Nvidia has built a patent wall around this family of techniques, but perhaps there's enough room for innovation that someone can find a way around it.
Yeah, I've wondered about whether DirectML just isn't robust enough, or what the holdup is. I mean, in the AI space, there's so much stuff built around PyTorch and Nvidia tech. Now there's a port of Automatic1111 Stable Diffusion to DirectML, but even that isn't really as straightforward as you'd think. Like, I've been going back and forth with AMD trying to figure out how to make their instructions work with a standardized testing approach.
[UPDATE]: The Automatic1111-directML branch now supports Microsoft Olive under the Automatic1111 WebUI interface, which allows for generating optimized models and running them all under the Automatic1111 WebUI, without a separate branch needed to optimize for AMD platforms. The original blog...
community.amd.com
That's supposed to get Olive and ONNX running with AMD GPUs. And it works... sort of. Here's what I've passed along to AMD:
----------------
First, runwayml/stable-diffusion-v1-5 works for a single image generation. However, if I try to change batch size or batch count, it fails with an error. The Olive models get optimized for 512x512 images, so doing two or more images at a time breaks that. But batch count just seems like a bug in the code. It should repeat the generation X number of times. (I'm trying to do 24 total images, which allows me to do either 24 x 1, 12 x 2, 8 x 3, 6 x 4, 4 x 6, or 3 x 8 to optimize GPU utilization.) With AMD right now, that would mean 24 x 1 is my only option, and that likely isn't optimal, but I had to do the same with Nod.ai.
Second, I use v2-1_768-ema-pruned.safetensors normally, which should just be in "stabilityai/stable-diffusion-2-1" — instead of "runwayml/stable-diffusion-v1-5" or "stabilityai/stable-diffusion-xl-base-1.0". But it doesn't seem to work right. I get brown images for 512x512 or 768x768. Also, I normally do 512x512 and 768x768 using the same model, but the Olive/ONNX stuff doesn't seem to allow that?
Finally, while individual 512x512 image generation times seem to be better than what I got with Nod.ai, any attempt at 768x768 images has been questionable at best. I can do 768x768 image generation (with a warning that it's using a different size than what the model was trained on), but the time to generate a single image on the 7700 XT was 36–37 seconds. That's 1.62–1.67 images per minute. Nod.ai got 2.81 images per minute on the 7700 XT — using SDv2-1. For Nod.ai, I got 8.71 images per minute at 512x512, while A1111 with Olive/ONNX gives 10.53 images per minute. But again, it might be apples and oranges since I'm doing different model versions and only a single image rather than a batch of 24.
----------------
The point is that, while this is all supposed to be DirectML-based, all sorts of stuff keeps breaking in various ways. There's a bunch of stuff basically being done to tune AMD GPU performance for very specific workloads right now. In theory, if all the bugs get sorted out, maybe that no longer happens and we could have DirectML working just as well as ROCm and CUDA. In practice? Yeah, I'll believe it when I see it. (It took Intel about nine months to get their OpenVINO variant of Automatic1111 working more or less properly across different models and output settings.)
So while there's supposedly universal frameworks for doing deep learning and such, we're absolutely missing the standards and drivers and whatever else to make them usable, especially in real-time gaming. Because DLSS as an example probably has to run the whole sequence in a matter of milliseconds on each frame. If I had to guess, we're years away from DirectML being usable for that sort of work. Even Intel, with basically no market share, did its own completely proprietary implementation of upscaling with XeSS, presumably because there wasn't a good API to build around.