News Stable Diffusion Benchmarked: Which GPU Runs AI Fastest

Status
Not open for further replies.
And this is the reason why people is happily buying the 4090, even if right now it's not top dog in all AI metrics. Semi-professionals or even University labs make good use of heavy computing for robotic projects and other general-purpose AI things.

As for AMD, I was actually surprised to see how decent the 7900XTX performs, considering it's not a compute uArch, which means CDNA must be truly good at it. Even with early drivers and such.

Thanks a lot for this quick glance at SD performance across the board 😀

EDIT: After the update, it is top dog indeed, haha. Thanks again!

Regards.
 
Last edited:
We test all the modern graphics cards in Stable Diffusion and show which ones are fastest, along with a discussion of potential issues and other requirements.

Stable Diffusion Benchmarked: Which GPU Runs AI Fastest : Read more
I've been looking for this, but things went wrong here. Would have been good to ask user forum for advice.

The 2048x1152 test is more representative. Diffusing at 512x512 pixels suffers from poor shader occupancy. Increasing the batch size improves shader occupancy and scales up the performance: this is why we use a batch size of 8 or more. The cards with large numbers of shaders like 3090/4090 show their power at large batch sizes. You can report iterations per second * batch size on the charts. Testing 512x512 with a batch size of 8 would be interesting. Alternatively, the batch size could vary across cards.

In practice we generate 1000s of 512x512 images to find a good seed, so the large batch size is normal procedure.
 
  • Like
Reactions: bit_user
I am having heck of a time trying to see those graphs without a major magnifying glass. How about a zoom option??

For Firefox, the workaround is to right-click on the image and select 'open image in new tab'. Chrome/Edge should be similar.
 
  • Like
Reactions: bit_user
I think a large contributor to 4080 and 4090 underperformance is the compatibility mode operation in pythorch 1.13+cuda 11.7 (lovelace gains support in 11.8 and is fully supported in CUDA 12). On my machine I have compiled Pytorch pre-release version 2.0.0a0+gitd41b5d7 with CUDA 12 (along with builds of torchvision and xformers).

The results I got for the same code+prompts as in the article for the 4090 were
25.91 it/s (without xformers)
35.75 it/s (with xformers)
 
  • Like
Reactions: KyaraM
Thanks for the article Jarred, it's unexpected content and it's really nice to see it!

We ended up using three different Stable Diffusion projects for our testing, mostly because no single package worked on every GPU. For Nvidia, we opted for Automatic 1111's webui version. AMD GPUs were tested using Nod.ai's Shark version, while for Intel's Arc GPUs we used Stable Diffusion OpenVINO. Disclaimers are in order. We didn't code any of these tools, but we did look for stuff that was easy to get running (under Windows) that also seemed to be reasonably optimized.
Thanks for the disclaimers, they are rather important. Just looking at the charts, one would take it (like I did when I missed that paragraph!) that it's the same code base.

As for AMD, I was actually surprised to see how decent the 7900XTX performs
Same. Seeing as Nvidia has had the data science and AI landscape basically locked into CUDA (see On the state of Deep Learning outside of CUDA’s walled garden | by Nikolay Dimolarov | Towards Data Science - https://towardsdatascience.com/on-the-state-of-deep-learning-outside-of-cudas-walled-garden-d88c8bbb4342 ) even in so called "open source" projects that actively refuse anything that's not CUDA code. Unfortunately this article from mid-2019 still largely applies, with the exception of Intel pushing OneAPI.

I'm rather interested to see how the Arc architecture and Intel's Xeon accelerators with performs with properly optimised drivers and code that uses their hardware.
 
Do you have one on how to use Intel's Python distribution using their optimised oneAPI Math Kernel and Data Analytics libraries instead of the standard Python ones, please?

No I don't use them. I am using the best community developed interface automatic1111 web ui. This project has over 3367 commits. I also own RTX 3060 12 GB just purchased for AI

This stuff is unfortunately extremely complicated. You really need to be very good at Python programming and also you have to have great knowledge and experience in Machine Learning libraries.

Also I have a discord channel and people feel very bad to have purchased 6 GB RTX 3060 TI or 10 GB RTX 3080 instead of 12 GB RTX 3060 :)
 
  • Like
Reactions: AndrewJacksonZA
Thank you.

Thank you so much for the article. I think you can add information regarding how VRAM is really important for training and batch generation. Because batch generation (parallel processing multiple images) makes huge difference.

Also when you want to use upscaling algorithms, again you need more VRAM.
 
AI is just way overrated. A good article the other day about Google kicking its AI employees to the curb.

Nvidia, AMD, Intel, Microsoft and Google their whole growth model was based on this idea that companies would need datacenters and computing power for AI.

We're now down to blending images together and spambots being the target audience for AI. The promised billion $ AI industry is turning into a nothingburger.
This will age like milk
 
AI is just way overrated. A good article the other day about Google kicking its AI employees to the curb.

Nvidia, AMD, Intel, Microsoft and Google their whole growth model was based on this idea that companies would need datacenters and computing power for AI.

We're now down to blending images together and spambots being the target audience for AI. The promised billion $ AI industry is turning into a nothingburger.
Take a look at "two minute papers" youtube channel.
Copilot is magic, ChatGPT already gives answers that google can't.
Online AI do stuff that Photoshop can't do or is too cumbersome
Nvidia DLSS multiplies the power of existing hardware.

Ai is already jawdropping, and this is barely starting.
 
  • Like
Reactions: KyaraM and bit_user
AI is just way overrated. A good article the other day about Google kicking its AI employees to the curb.

Nvidia, AMD, Intel, Microsoft and Google their whole growth model was based on this idea that companies would need datacenters and computing power for AI.

We're now down to blending images together and spambots being the target audience for AI. The promised billion $ AI industry is turning into a nothingburger.
This might be the single worst take I've ever seen, regardless of the topic.
 
Take a look at "two minute papers" youtube channel.
Copilot is magic, ChatGPT already gives answers that google can't.
Online AI do stuff that Photoshop can't do or is too cumbersome
Nvidia DLSS multiplies the power of existing hardware.

Ai is already jawdropping, and this is barely starting.

Yeah Google canned their AI unit not because they aren't going to do AI, but because their AI team was failing compared to their competition. They know ChatGPT is a serious threat and they are going to spend massively on AI to defend their turf.

AI is indeed doing amazing stuff. The business service potential is just dumbfounding, it may well eliminate 5 million jobs in the next 10 years.
 
  • Like
Reactions: KyaraM and bit_user
Hi Jarred, did you use FP32 or FP16 for your benchmarks, especially for AMD?

The results you got show only about half of the it/s you can get on AMD 6000 GPUs in my experience, which suggests you used FP32. Is this consistent among all benchmarks or are some using FP32 and some FP16?
 
  • Like
Reactions: Nikolay Mihaylov
Looks like an AI version of Dr. Seuss..
To me, images like these seem like a good source of inspiration for artists who draw this sort of stuff by hand.

However, one thing I don't like about them is that I can't really immerse myself in the world shown by the image, as there are certain things that just don't make sense. When the image is drawn by a human, there's usually thought behind everything in the image and you're rewarded by scrutinizing all the various details. I'm sure there will be more sophisticated AI image & world generators that address some of these issues, but I think there's still a ways yet to go.
 
@jarred, can you add the 'zoom in' option for the benchmark graphs? TIA.
I was away for the weekend, but I added this now. Also, I will be retesting some things in light of emails I have received suggesting some ways to 'fix' the 40-series numbers.
Let me make a benchmark that may get me money from a corp, to keep it skewed !
Are you suggesting that I ran these tests and intentionally skewed the results, to get more money from... AMD? Or maybe it's Nvidia paying me? Because if so, you're wrong in all cases.
 
I've been looking for this, but things went wrong here. Would have been good to ask user forum for advice.

The 2048x1152 test is more representative. Diffusing at 512x512 pixels suffers from poor shader occupancy. Increasing the batch size improves shader occupancy and scales up the performance: this is why we use a batch size of 8 or more. The cards with large numbers of shaders like 3090/4090 show their power at large batch sizes. You can report iterations per second * batch size on the charts. Testing 512x512 with a batch size of 8 would be interesting. Alternatively, the batch size could vary across cards.

In practice we generate 1000s of 512x512 images to find a good seed, so the large batch size is normal procedure.
There are multiple factors at play, which I already described in the article, that make your suggestions effectively impossible to implement at present.
  1. Getting something that will run on all the GPUs was a key consideration — Intel GPUs in particular are a sticking point.
  2. Large batch sizes (generating multiple images concurrently) requires a lot more VRAM, which means you can't do size 8 batches on many of the cards.
  3. Even if you wanted to do larger batch sizes, it's not supported on many of the SD projects (see point one).
  4. Only Automatic 1111 (possibly some others that I haven't found) supports the higher resolution outputs.
  5. Stable Diffusion is trained on 512x512 images (1.x) or 768x768 images (2.x), so asking for outputs in a different resolution causes a lot of odd rendering issues (two-heads problem, mutant limbs, etc.)
I included the high-res chart at the end to show exactly what you mention, that the lower resolution means the fastest GPUs may not be fully utilized. But since 512x512 is the general standard, it's good to start there as a baseline. I hope in the coming months that articles like this will encourage Intel, AMD, and Nvidia to work to improve performance, and for the developers to work to improve the user interface and GPU support as well.
 
Hi Jarred, did you use FP32 or FP16 for your benchmarks, especially for AMD?

The results you got show only about half of the it/s you can get on AMD 6000 GPUs in my experience, which suggests you used FP32. Is this consistent among all benchmarks or are some using FP32 and some FP16?
As noted in the text, the RX 6000 results are very low, and the same goes for the RTX 40 results. I received an email from Nod.ai stating, "SHARK is currently running tuned models on RDNA3 and untuned models on RDNA2 and we plan to offer similar tuned models for RDNA2 in the future." Whether that's moving from FP32 to FP16 on RDNA 2, or just tuning the algorithm to extract more performance, the net result will be the same: Much better performance.

Also note what is required to get AMD's GPUs working right now. You have to use a less popular project (or run Linux), and you have to use a specific beta driver that's for AI/ML (that has known bugs). If you want to try Automatic 1111's project via Linux, you end up with ROCm stuff and that only supports Navi 21 (possibly Navi 31 now, though I haven't checked), which eliminates a bunch of GPUs from the list as well. So, AMD's support is lagging right now, so is Intel's, but hopefully things will improve and I'll be revisiting this in the coming months.
 
  • Like
Reactions: AndrewJacksonZA
Getting something that will run on all the GPUs was a key consideration — Intel GPUs in particular are a sticking point.

You should include also a high-end task and just have the benchmark read 'FAIL' for GPUs that fail it. That's extremely useful information when considering what to buy.
 
To me, images like these seem like a good source of inspiration for artists who draw this sort of stuff by hand.

However, one thing I don't like about them is that I can't really immerse myself in the world shown by the image, as there are certain things that just don't make sense. When the image is drawn by a human, there's usually thought behind everything in the image and you're rewarded by scrutinizing all the various details. I'm sure there will be more sophisticated AI image & world generators that address some of these issues, but I think there's still a ways yet to go.
A lot of the output for images depends on the input. My description was given, and I by no means make any claims to have created an 'ideal' input. Obviously, it got inspiration from Borderlands and some other stuff, because I said "steampunk." I had originally tried "Cyberpunk" but got very lame results. Basically, I wanted to generate images of post-apocalyptic cities. But if you just ask for "post-apocalyptic city landscape", you get garbage like this — which, incidentally, also illustrates the problem with targeting higher resolutions (the "two-head anomaly" has become the "multi-city anomaly".

182

So then you add a bunch of other descriptors to try and limit/improve the output, and some things work while others don't. But for benchmarking purposes, the content of the images is largely superfluous. You just need to keep the parameters consistent to make sure the results are meaningful.
 
  • Like
Reactions: bit_user
Status
Not open for further replies.