News Stable Diffusion Benchmarked: Which GPU Runs AI Fastest

AndrewJacksonZA · Jun 9, 2023

How about an LLM showdown, Jarred? Especially in terms of renting and scaling, as indicated here with Gaudi?
https://www.tomshardware.com/news/aws-uses-intel-habana-gaudi-for-llvm

JarredWaltonGPU · Jun 20, 2023

EMsky said:
If it is possible to generate 2x images at the same time, then two 2080's could match the performance of a 4080. are you able to test the 20xx series with dual gpus, as well as, the 30xx and 40xx series? updating this article with the 4060 might be an option, too. Thanks.

I don't have any duplicate GPUs for most of the cards I don't think, and only RTX 3090/3090 Ti and some RTX 20-series stuff support NVLink. I would think there's a way to just have a project like Stable Diffusion use multiple GPUs without SLI, NVLink, or other connectors, but it would have to programmed into the repository. By default I think it just selects the first GPU? To be honest, I've never even tried doing two GPUs in recent history, not since Nvidia basically declared SLI dead with the RTX 30-series.

vaderx8498 · Jun 21, 2023

Hi @JarredWaltonGPU, thanks a lot for the article! I just tried on my rig here (6800 XT) and got two distinct result:

using the nod-ai shark, which its seems vulkan based, i got around 9.2 it/s as per screenshot here.
using the automatic1111, which is rocm based (running on ubuntu), i got similar score like yours, around 3.5-4 it/s.

So I'm wondering why the vulkan backend is better than the rocm one 🤔

JarredWaltonGPU · Jun 22, 2023

vaderx8498 said:
Hi @JarredWaltonGPU, thanks a lot for the article! I just tried on my rig here (6800 XT) and got two distinct result:

using the nod-ai shark, which its seems vulkan based, i got around 9.2 it/s as per screenshot here.

using the automatic1111, which is rocm based (running on ubuntu), i got similar score like yours, around 3.5-4 it/s.

So I'm wondering why the vulkan backend is better than the rocm one 🤔

Likely optimizations are lacking. I haven't retested lately, but AMD sent out a guide with the RX 7600 launch where it suggested using the Nod.ai version. Which is funny, because I started using Nod.ai's release months before AMD. 🙃

xabrol · Jul 17, 2023

Can you retest the Intel ARC A770 16GB and make sure you have the new Intel Extension for TensorFlow installed?

Running TensorFlow Stable Diffusion on Intel® Arc™ GPUs

The newly released Intel® Extension for TensorFlow plugin allows TF deep learning workloads to run on GPUs, including Intel® Arc™ discrete graphics.

www.intel.com

mickeyi · Aug 1, 2023

hello @JarredWaltonGPU

, can u retest SDXL 1.0 on the same GPUs? I think things have changed quite a bit

Thanks

darky_mtp · Sep 14, 2023

Hello.

I think the boost promised by AMD is here.
7.734 IT/s on Radeon 6950XT (AMD RocM / Linux)

JarredWaltonGPU · Sep 14, 2023

darky_mtp said:
Hello.

I think the boost promised by AMD is here.
7.734 IT/s on Radeon 6950XT (AMD RocM / Linux)

I'm running a bunch of updated numbers. Latest Nod.ai (stable release, anyway — the daily automatic builds are still slower with one that I just checked yesterday) now does quite a bit better. I just need to retest a bunch of GPUs. I'm about half-way there (whoa, livin' on a prayer).

darky_mtp · Sep 14, 2023

I have to make more tests but I think A1111 + AMD GPUs under Linux are much faster than Nod.ai/Windows.

JarredWaltonGPU · Sep 14, 2023

darky_mtp said:
I have to make more tests but I think A1111 + AMD GPUs under Linux are much faster than Nod.ai/Windows.

Entirely possible, though these days I suspect the difference has shrunk quite a bit. That's partly because I'm also quite sure that Nod.ai is pretty heavily invested in making AMD GPUs look as good as possible. I say that because I have tried to test Nod.ai with Nvidia, and the results were universally poor at the time I last tried it, and also because I have tried the latest A1111 instructions for running it on AMD under Windows, and performance was also worse than Nod.ai.

Anyway, I'm also sure that there are more optimized variants of SD than A1111 for Nvidia GPUs. Especially if we wanted to start looking for tuned versions that use the FP8 mode on the tensor cores, which would be a potentially easy doubling of performance. But the good thing is that Nod.ai, OpenVINO, and the base Automatic1111 instructions can all get up and running with a minimum of hassle under Windows.

darky_mtp · Sep 16, 2023

A1111/Windows/AMD is not using RocM but OpenML. It works but it's ~1/10 speed from RocM.
Nod.ai/Windows/AMD seems to work better but with no XL models support, you miss a lot new models.

Firedrops · Nov 4, 2023

Please update this - a lot of backend optimizations from all companies this year, plus support in A1111/SDNext/etc.

JarredWaltonGPU · Nov 9, 2023

For anyone still following this thread, I have (finally) updated all the testing for all the pertinent GPUs and published a new article, redirecting the old one (because that's what our SEO team tells us is best). So, here's the new article:

Stable Diffusion Benchmarks: 45 Nvidia, AMD, and Intel GPUs Compared

Which graphics card offers the fastest AI performance?

www.tomshardware.com

AndrewJacksonZA · Nov 9, 2023

Thanks Jarred, I'll give it a read. 👍

darky_mtp · Nov 18, 2023

AMD results are totally flawed, Windows was used instead of Linux :-/
DirectML is much slower than RocM.

JarredWaltonGPU · Nov 18, 2023

darky_mtp said:
AMD results are totally flawed, Windows was used instead of Linux :-/
DirectML is much slower than RocM.

ROCm doesn’t work with all AMD GPUs. Also, images per minute is not directly comparable to iterations per second, as the latter omits a lot of the time taken. I tried to get ROCm working with the 7900 XTX recently under Linux and eventually gave up in frustration… again.

bit_user · Nov 18, 2023

JarredWaltonGPU said:
ROCm doesn’t work with all AMD GPUs.

According to this, their ROCm support for RDNA3 is still ongoing, with a suggestion that ROCm 6.0 will be the one to watch for:

AMD Releases ROCm 5.7 GPU Compute Stack - Phoronix

www.phoronix.com

JarredWaltonGPU · Nov 20, 2023

bit_user said:
According to this, their ROCm support for RDNA3 is still ongoing, with a suggestion that ROCm 6.0 will be the one to watch for:

AMD Releases ROCm 5.7 GPU Compute Stack - Phoronix

www.phoronix.com

I have asked AMD about this directly, though I haven't really received a straight answer yet. DirectML seems to be the "preferred" way to do SD on AMD GPUs, at least for now, though performance on RDNA 2 doesn't look great. I think ROCm under Linux is supposed to do much better than DirectML on RDNA 2.

But again, I note that getting things running under Linux is more difficult in general. At one point I had Automatic1111 working with ROCm on RDNA2 (this was early this year). Something changed along the way, and when I recently tried running Ubuntu I could not get things working. Linux kernels had changed, ROCm didn't install properly, whatever. I lost a day of work trying to make it function.

Firedrops · Nov 26, 2023

JarredWaltonGPU said:
For anyone still following this thread, I have (finally) updated all the testing for all the pertinent GPUs and published a new article, redirecting the old one (because that's what our SEO team tells us is best). So, here's the new article:

Stable Diffusion Benchmarks: 45 Nvidia, AMD, and Intel GPUs Compared

Which graphics card offers the fastest AI performance?

www.tomshardware.com

I love you Jared.

JarredWaltonGPU said:
I tried to get ROCm working ... and eventually gave up in frustration… again.

A tale as old as time. I'm certain that millions of dev hours over the past decade have been lost chasing after this mirage. AMD has probably made more self-congratulatory and blatantly false ROCm press announcements than the total amount of systems/users that have ever gotten it working.

Somehow Intel got their OpenVINO unconditionally working within a year of their dGPU release.

darky_mtp said:
AMD results are totally flawed, Windows was used instead of Linux :-/
DirectML is much slower than RocM.

Theoretically this makes sense, but practically I agree with using Windows for benchmarking. It is (currently) what the overwhelming majority of people use, and where all GPU companies are focusing their driver support, which recently includes some SD/LLM optimizations.

bit_user · Nov 26, 2023

Firedrops said:
Somehow Intel got their OpenVINO unconditionally working within a year of their dGPU release.

OpenVINO existed before that. I first used it on a Skylake iGPU, back in early 2021. The main reason they could get it working so quickly is that Intel had an open source GPU software stack for at least as long as AMD, IIRC. Intel has also done a better job maintaining OpenCL support, which OpenVINO benefited from.

Firedrops said:
Theoretically this makes sense, but practically I agree with using Windows for benchmarking. It is (currently) what the overwhelming majority of people use, and where all GPU companies are focusing their driver support,

Not if you're talking about AI. For that, the OS of choice is & always has been Linux. Windows only enjoys better driver support if you're gaming.

News Stable Diffusion Benchmarked: Which GPU Runs AI Fastest

Distinguished

Senior GPU Editor

Senior GPU Editor

Senior GPU Editor

Senior GPU Editor

Distinguished

Senior GPU Editor

Distinguished

Senior GPU Editor

Polypheme

Senior GPU Editor

Distinguished

Polypheme

Share this page