Stable Diffusion Benchmarks: 45 Nvidia, AMD, and Intel GPUs Compared

Bikki

Reputable
Jun 23, 2020
70
39
4,560
Thanks so much for this, truly generative model is consumer gpu next big thing besides gaming.
Meta LLama 2 should be next in the pipe
 
Thanks so much for this, truly generative model is consumer gpu next big thing besides gaming.
Meta LLama 2 should be next in the pipe
I've poked at LLaMa stuff previously with text generation, but what I need is a good UI and method of benchmarking that can run on AMD, Intel, and Nvidia GPUs and leverage the appropriate hardware. Last I looked, most (all) of the related projects were focused on Nvidia, but there are probably some alternatives I haven't seen.

What I really need is the equivalent across GPU vendor projects that will use LLaMa, not the model itself. Running under Windows 11 would be ideal. If you have any suggestions there, let me know.
 
  • Like
Reactions: thisisaname
Nov 10, 2023
1
0
10
We've tested all the modern graphics cards in Stable Diffusion, using the latest updates and optimizations, to show which GPUs are the fastest at AI and machine learning inference.

Stable Diffusion Benchmarks: 45 Nvidia, AMD, and Intel GPUs Compared : Read more

As a SD user stuck with a AMD 6-series hoping to switch to Nv cards, I think:

1. It is Nov 23 already if people buy a new card with SD in mind now, they absolutely should consider SDXL and even somewhat plan for "the version after SDXL" and so omitting it in a benchmark report is like wasting your own time and effort. Like, making a detailed benchmarking on Counterstrike but not Cyberpunk in 2023,
Of course the old cards can't run it so I think maybe a separate SDXL report on just the >=12GB latest gens cards? The tests should be a basic 1024x1024 one, plus another larger dimension one that kind of simulate a potential future SD version. Some other such test elsewhere shows that the 4060 ti 16GB will be faster than the 4070 in such vram heavy operation, and I have been hoping to see more tests like that to confirm.

2. I can understand not mentioning AMD with Olive, which is a quick optimizations but with many limitations and required extra preps on the models. However AMD on Linux with ROCm support most of the stuff now with few limitations and it runs way faster than AMD on Win DirectML, so it should worth a mention. (I prefer to switch to Nv soon though)
 
As a SD user stuck with a AMD 6-series hoping to switch to Nv cards, I think:

1. It is Nov 23 already if people buy a new card with SD in mind now, they absolutely should consider SDXL and even somewhat plan for "the version after SDXL" and so omitting it in a benchmark report is like wasting your own time and effort. Like, making a detailed benchmarking on Counterstrike but not Cyberpunk in 2023,
Of course the old cards can't run it so I think maybe a separate SDXL report on just the >=12GB latest gens cards? The tests should be a basic 1024x1024 one, plus another larger dimension one that kind of simulate a potential future SD version. Some other such test elsewhere shows that the 4060 ti 16GB will be faster than the 4070 in such vram heavy operation, and I have been hoping to see more tests like that to confirm.

2. I can understand not mentioning AMD with Olive, which is a quick optimizations but with many limitations and required extra preps on the models. However AMD on Linux with ROCm support most of the stuff now with few limitations and it runs way faster than AMD on Win DirectML, so it should worth a mention. (I prefer to switch to Nv soon though)
Too many things are "broken" with SDXL right now to reliably test it on all of the different GPUs, as noted in the text. TensorRT isn't yet available, and the DirectML and OpenVINO forks may also be iffy. I do plan on testing it, but it's easy enough to use regular SD plus a better upscaler (SwinIR_4x is a good example) if all you want is higher resolutions. But SDXL will hopefully produce better results as well. Anyway, just because some people have switched to SDXL doesn't make it irrelevant, as part of the reason for all these benchmarks is to give a reasonable look at general AI inference performance. SD has been around long enough that it has been heavily tuned on all architectures; SDXL is relatively new by comparison.

Regarding AMD with Olive, you do realize that this is precisely what the linked DirectML instructions use, right? I didn't explicitly explain that, as interested parties following the link will have the necessary details. AMD's latest instructions are to use the DirectML fork, and I'd be surprised if ROCm is actually much faster at this point. If you look at the theoretical FP16 performance, I'm reasonably confident the DirectML version gets most of what is available. ROCm also has limitations in which GPUs are supported, at least last I checked (which has been a while).
 
how did you get to 24 images per minute on 2080 super?
Maybe read the article?

"Getting things to run on Nvidia GPUs is as simple as downloading, extracting, and running the contents of a single Zip file. But there are still additional steps required to extract improved performance, using the latest TensorRT extensions. Instructions are at that link, and we've previous tested Stable Diffusion TensorRT performance against the base model without tuning if you want to see how things have improved over time."

So you have to do the extra steps to get the TensorRT extension installed and configured in the UI, then pre-compile static sizes, normally a batch size of 8 with 512x512 resolution.
 
  • Like
Reactions: Order 66
May 5, 2023
1
0
10
Hello i want to ask so Intel Arc gpus works well using the automatic1111 openVINO version right ? does Intel Arc gpus still able to run SD using directml like amd gpus does ? thank you
 
Dec 7, 2023
1
0
10
Thanks for the update on this and the massive testing! I am considering a purchase of a new GPU, since I am still using a GTX 1070. You can imagine that using A1111 is a pain... I am thinking about an RTX 4070 Ti or an RX 7900XT. Since I am not a profassional user who is relying on a high throughput (10.9 pics per minute in 768x768 compared to 1 pic per minute currently still sounds pretty fast to me), I wonder if the AMD with it's 20GB VRAM compared to the 12GB at the nVidia would be the better choice since the VRAM, especially for upcoming models, would be a limiting factor.

Would love to hear your thoughts on this.
 
Last edited:
Thanks for the update on this and the massive testing! I am considering a purchase of a new GPU, since I am still using a GTX 1070. You can imagine that using A1111 is a pain... I am thinking about an RTX 4070 Ti or an RX 7900XT. Since I am not a profassional user who is relying on a high throughput (10.9 pics per minute in 768x768 compared to 1 pic per minute currently still sounds pretty fast to me), I wonder if the AMD with it's 20GB VRAM compared to the 12GB at the nVidia would be the better choice since the VRAM, especially for upcoming models, would be a limiting factor.

Would love to hear your thoughts on this.
The VRAM situation for AI workloads and Stable Diffusion in particular is a bit interesting. Part of the whole TensorRT optimization path seems to change the data in such a way that less VRAM is required. Or perhaps it's just optimizing the code paths so that not as much VRAM is needed.

Even without TensorRT, I know that Automatic1111 with xformers generally runs much better on Nvidia than the DirectML stuff does on AMD — in terms of both speed as well as VRAM use. The RX 7900 XT as an example got 9.5 images per minute for 768x768 generation with my most recent testing. Using just xformers, that's basically tied with the RTX 4070 Ti (9.8 img/min), while TensorRT could do 15.9 img/min.

What that doesn't tell you is how easy/difficult it was to get certain things to run at all. Any Nvidia RTX GPU, whether with the base configuration or xformers or TensorRT, was merely a case of letting the models pre-compile and then it worked. Even an RTX 2060 6GB card could do 768x768 generation with no trouble. In fact, you could do multiple concurrent 768x768 generations on that card — I found that four 768x768 images at the same time still provided better overall throughput. AMD's 8GB GPUs simply failed to run at all with RX 6600-class (though they worked previously with Nod.ai's slightly different "studio" download).

I know AMD (and Nvidia, and Intel) is working to improve performance across a variety of AI workloads. ROCm is part of that, and there was a bunch of stuff discussed yesterday about running AI on AMD. But much of that was focused on data center CDNA 2/3 stuff, so I'm not certain how things will pan out for Windows users of consumer hardware. There was talk about the XDNA stuff as well (NPUs in Phoenix/Hawk/Strix Point APUs), but not once did AMD mention Stable Diffusion, which would have been a big selling point IMO.

My assumption is that a lot of the AMD AI stuff is still a work in progress. At some point, I strongly suspect we'll see some code that will allow Stable Diffusion to use the NPUs in AMD's latest laptop chips, and that could be really cool. Of course, those are INT8 processors, not FP16, so I don't know if that will make a big difference in quality of output.

TLDR: The "safe" bet for AI stuff right now is still Nvidia. For just Stable Diffusion, yes, you can get decent performance from RX 7900 XT. But there are lots of other LLMs out there, used for other tasks, and it feels like 99% of the public projects use Nvidia. Theoretically, porting those to use AMD GPUs should be possible, but you won't just be able to download and run in my experience, at least not right now. Maybe ROCm 6 will and the AMD Ryzen AI software will change all that. I have not had the time to do much more than poke at the surface.
 
  • Like
Reactions: Elegant spy

froggx

Distinguished
Sep 6, 2017
89
35
18,570
Your graph for "GPU Theoretical Max FP16 TFlops" for GPU shader cores is using straight FP32 values. You need to double the performance for Intel, AMD, and Turing GPUs.
C4r4C1E.png


You have different numbers for the 2080ti in the graph vs what you say in the article. The "interesting behaviour" you mention isn't reflected in the graph as it shows the RTX 3070 Ti trouncing the RTX 2080 Ti by 61%.
iqosiKg.png



The second graph in the gallery for "FP16 TFLOPS using Tensor/Matrix/Shader Cores (No Sparsity) has the wrong title (Stable diffusion Images per minute). It's also rounding to 2 decimal places rather than 1 like in the other graph of the gallery. I'm gonna guess you accidentally loaded the Stable Diffusion graph, which reused the settings, then changed the numbers but not the title,
HeAH66T.png



As people have mentioned, they learn from your articles and use them to make buying decisions. I'm one of those people and I enjoy reading the stuff you write as I've always been into GPUs. I apologize if I'm coming off as overbearing here, but you kinda need to get these details right or it gets harder to trust ANY of your data.
 
Your graph for "GPU Theoretical Max FP16 TFlops" for GPU shader cores is using straight FP32 values. You need to double the performance for Intel, AMD, and Turing GPUs.
C4r4C1E.png


You have different numbers for the 2080ti in the graph vs what you say in the article. The "interesting behaviour" you mention isn't reflected in the graph as it shows the RTX 3070 Ti trouncing the RTX 2080 Ti by 61%.
iqosiKg.png



The second graph in the gallery for "FP16 TFLOPS using Tensor/Matrix/Shader Cores (No Sparsity) has the wrong title (Stable diffusion Images per minute). It's also rounding to 2 decimal places rather than 1 like in the other graph of the gallery. I'm gonna guess you accidentally loaded the Stable Diffusion graph, which reused the settings, then changed the numbers but not the title,
HeAH66T.png



As people have mentioned, they learn from your articles and use them to make buying decisions. I'm one of those people and I enjoy reading the stuff you write as I've always been into GPUs. I apologize if I'm coming off as overbearing here, but you kinda need to get these details right or it gets harder to trust ANY of your data.
Ugh… now I need to go and look at that. I had some old charts and I may have simply inadvertently used FP32 results for the GPU Shader compute chart. Funny that it took this long for anyone to notice! Anyway, it was supposed to provide a look at theoretical vs real-world compute, but clearly I got some of the math wrong.
 
  • Like
Reactions: froggx

froggx

Distinguished
Sep 6, 2017
89
35
18,570
Ugh… now I need to go and look at that. I had some old charts and I may have simply inadvertently used FP32 results for the GPU Shader compute chart. Funny that it took this long for anyone to notice! Anyway, it was supposed to provide a look at theoretical vs real-world compute, but clearly I got some of the math wrong.

btw, great article. i'm still running a GTX 1070 and while my primary use case is gaming, i do run SD every now and then. i'm forced to use FP32 shaders in SD so it doesn't take much imagination to get an idea of just how bad it performs. i plan to upgrade GPU at some point in 2024 and even though gaming performance has been my priority, i've been wanting to get an idea of how AI performs with the acceleration provided by modern hardware and this article was exactly what i was looking for. nvidia has had a huge lead in AI compatibility for the longest time thanks to CUDA so it's good to know that AMD and Intel are getting into the game with somewhat competitive solutions.
 
btw, great article. i'm still running a GTX 1070 and while my primary use case is gaming, i do run SD every now and then. i'm forced to use FP32 shaders in SD so it doesn't take much imagination to get an idea of just how bad it performs. i plan to upgrade GPU at some point in 2024 and even though gaming performance has been my priority, i've been wanting to get an idea of how AI performs with the acceleration provided by modern hardware and this article was exactly what i was looking for. nvidia has had a huge lead in AI compatibility for the longest time thanks to CUDA so it's good to know that AMD and Intel are getting into the game with somewhat competitive solutions.
Okay, everything should be "fixed" now. Obviously, the GPU shader results are merely interesting to look at, as outside of the RDNA 2 chips none of the GPUs tested are running the computations on GPU shaders — they're all using the matrix / tensor operations.

Turing, RDNA2/3, and Arc all have double the throughput for FP16 shader operations, while Ampere and Ada have the same throughput as their FP32 shader computations. I've also dropped the grey "no sparsity" bar from the one chart and just have two options: with sparsity and without. It should make things cleaner. And the extra decimal point of precision on the one chart has been dropped (I have some chart generation code that decides based on the chart number how many decimals to show, and it wasn't updated to work properly with the non-sparsity chart).

Cheers for the heads up!
 
Dec 26, 2023
1
0
10
Hey, maybe I'm a little late to the party but I would like to request that you benchmark using a real UI, rather than that txt2waifu generator Automatic1111.

As a professional organization you should have switched off of that godforsaken platform more than 2 months ago.

Also, please don't lie about their being no tensor rt support for SDXL models. Automatic1111 is NOT stable diffusion. Their individual lack of support and awfully slow pace at development is not representative of the Stable Diffusion scene as a whole.

The sooner people realize that it being called "stable diffusion webui" is something that the developers have chosen themselves rather than an official adoption by any of the creators of Stable Diffusion, the better.

Because the officially adopted webui is ComfyUI and frankly I'm already getting 1 image per second results without using tensor RT or anything specialized besides XFormers.

So, your benchmarking results are about as useful to real users as automatic1111 itself. Not very.
 
Hey, maybe I'm a little late to the party but I would like to request that you benchmark using a real UI, rather than that txt2waifu generator Automatic1111.

As a professional organization you should have switched off of that godforsaken platform more than 2 months ago.

Also, please don't lie about their being no tensor rt support for SDXL models. Automatic1111 is NOT stable diffusion. Their individual lack of support and awfully slow pace at development is not representative of the Stable Diffusion scene as a whole.

The sooner people realize that it being called "stable diffusion webui" is something that the developers have chosen themselves rather than an official adoption by any of the creators of Stable Diffusion, the better.

Because the officially adopted webui is ComfyUI and frankly I'm already getting 1 image per second results without using tensor RT or anything specialized besides XFormers.

So, your benchmarking results are about as useful to real users as automatic1111 itself. Not very.
Okay, maybe take a chill pill and step it down a notch on the rhetoric.

First, Automatic1111 does have a "real" UI, which is ComfyUI as you point out. That is not really an important part of the testing. Prior testing used different front-ends, but the results weren't radically changed. Second, the last testing was done in early November, and at the time, SDXL was not working with TensorRT — at least not without some extra effort and code that I didn't have access to. Is it working now? I don't know, as I haven't checked, but probably — Nvidia (and AMD and Intel) are routinely submitting source code updates.

It's clear you have an axe to grind with Automatic1111, and frankly I couldn't care less about that aspect. Automatic1111 has been forked by a lot of places, and it's generally considered the default choice for many people. More importantly, Nvidia, AMD, and even Intel have all offered up specific instructions and tuning to get it working on every modern GPU. That's a critical element of my testing: Being able to run the tests on non-Nvidia GPUs. I am not doing any coding whatsoever for these tests, at least in regards to Stable Diffusion, but I've regularly touched base with the GPU companies regarding their "optimal test paths" and that's what was used here.

Is A1111 perfect? I'm sure it's not. But the "txt2waifu" insult you toss in there is meaningless invective. The models and prompts used are what ultimately determines the output, not the front-end and web interface. In this case, using SD1.5 means the results are dependent on the training of the model from HuggingFace, not A1111.

The rest of what you say is basically shouting into a vacuum. You got one image per second? Okay... on what exactly? Because you didn't provide any details whatsoever about your test settings, hardware used, or even the software you used, and that makes your statements completely useless.

Which is ironic, because you assert that my benchmarking results are useless, even though I've provided detailed instructions on what was tested and how it was tested. So, if you really want to have a dialog with me about testing, drop the attitude and provide real suggestions and advice, not ad hominem attacks — against either me or the creator of Automatic1111.

This is the "nice response" from me. Again, tone it down, and if you have recommendations, give concrete examples and links rather than invective and thinly veiled insults.
 
  • Like
Reactions: cyrusfox
Jan 3, 2024
1
0
10
Just for fun I tried and you must know that you can render an ai image with ComfyUI on a RX480 8GB using DirectML.
Is working surprising well for a such old card.
 
Interesting article but I'm a little confused as to how the 4060 TI 16gb is performing worse than the 4060 TI (I assume 8gb). I read an article on the MSI website and they benchmark the 4060 higher than the 3080, which is much higher up on your list: https://www.msi.com/blog/stable-diffusion-xl-best-value-rtx-graphics-card

Any thoughts on why this would be? I was seriously considering a 4060 but looking at your tests it doesn't seem worth it...
The 16GB card is a Gigabyte model that we purchased for review. The 4060 Ti 8GB is the Founders Edition. In our testing, we found that the 16GB card consistently runs at slightly lower clocks. Probably that's because with the same power limit, the 16GB card has to also provide power to twice as much VRAM. I don't know how much power that amounts to (maybe ~10W?), but that would also help account for the lower average clocks.

Beyond that, the testing done here optimizes for throughput and is not generally VRAM limited. So having twice as much VRAM means the 4060 Ti 16GB can run larger models and may perform better with workloads that use more than 8GB of memory, but Stable Diffusion (at least the non-XL variants) are not falling in that range.
 
  • Like
Reactions: cyrusfox and Two XS

purpleduggy

Prominent
Apr 19, 2023
167
44
610
AMD hasn't put any optimization into these AI apps. I bet if they focus on it, they would be far more competitive, the hardware can do it but no support from developers. Nvidia puts a lot of money out there making sure devs optimize for consumer CUDA, whereas consumer OpenCL compute has almost zero budget apart from proprietary private use and some open source linux devs. This is ironic as OpenCL has far more datacentre usage than CUDA. CUDA is only really for consumers to demo tech. Most of the real big projects are on OpenCL and they are not public. Even in datacentres with H100 Nvidia cards, big datasets are run with OpenCL with proprietary code. The exception was crypto mining that was big on CUDA, but only because they used consumer cards, but apart from that, almost no one uses CUDA in the datacentre because of the risk of proprietary tie in and closed source security considerations. AMD Instinct MI300X are the defacto datacentre chips right now because of Epyc Integration and they are solely OpenCL and they are sold out just like Nvidia H100. If consumers really knew how little CUDA is used in datacentre and instead regarded Nvidia cards as the fastest professionally on OpenCL they would treat OpenCL with far more importance.
 
AMD hasn't put any optimization into these AI apps. I bet if they focus on it, they would be far more competitive, the hardware can do it but no support from developers. Nvidia puts a lot of money out there making sure devs optimize for consumer CUDA, whereas consumer OpenCL compute has almost zero budget apart from proprietary private use and some open source linux devs. This is ironic as OpenCL has far more datacentre usage than CUDA. CUDA is only really for consumers to demo tech. Most of the real big projects are on OpenCL and they are not public. Even in datacentres with H100 Nvidia cards, big datasets are run with OpenCL with proprietary code. The exception was crypto mining that was big on CUDA, but only because they used consumer cards, but apart from that, almost no one uses CUDA in the datacentre because of the risk of proprietary tie in and closed source security considerations. AMD Instinct MI300X are the defacto datacentre chips right now because of Epyc Integration and they are solely OpenCL and they are sold out just like Nvidia H100. If consumers really knew how little CUDA is used in datacentre and instead regarded Nvidia cards as the fastest professionally on OpenCL they would treat OpenCL with far more importance.
OpenCL has been lagging is my understanding. Otoy Octane dropped support for OpenCL as an example. But AMD has absolutely put optimizations into play with the DirectML branch used for testing.

Like, literally! AMD had developers working with the community to implement features that are present in RDNA 3 GPUs, specifically the AI accelerator stuff. It's basically a different method of accessing the compute resources that are already present, but clearly it works because without those extra steps, AMD's performance drops about 30-40 percent.
 
  • Like
Reactions: cyrusfox

purpleduggy

Prominent
Apr 19, 2023
167
44
610
OpenCL has been lagging is my understanding. Otoy Octane dropped support for OpenCL as an example. But AMD has absolutely put optimizations into play with the DirectML branch used for testing.

Like, literally! AMD had developers working with the community to implement features that are present in RDNA 3 GPUs, specifically the AI accelerator stuff. It's basically a different method of accessing the compute resources that are already present, but clearly it works because without those extra steps, AMD's performance drops about 30-40 percent.
only publicly. the majority of nvidia H100 clusters use OpenCL but they never make the project public so anyone can see, so everyone assumes CUDA. but its actually OpenCL that Nvidia dominates in.