News Stable Diffusion Benchmarked: Which GPU Runs AI Fastest

JarredWaltonGPU · Jan 31, 2023

Hunter555 said:
I honestly think that the actual outcome of an image should be irrelevant (many people not even use SD for making people, you can also just use a prompt like: generic plain backgropund filled with 100 different objects randomly with randomk colors, it doesn't matter really) as I thought you were testing performance, not the results whether it looks great or not, and furthermore it's quite obvious that some lower end cards can't create anmy image of 1024 or higher due lack of vram. That being said, I still vouch for a 768x768, 1024x1024 and even a 1280x1280 performance test, and enable "low memory" card option in the settings of auto1111, which is for cards 8,9,10,11 gb vram and for the test itself do only batch size = 1, and batch count = 40

Furthermore its also interesting to see (being my favourite HARDWARE TEST site, to mention which cards actually failed the test actually, I know quite a BIG group of people looking around now and who want to see for example how far the 4070ti can take it in terms of max resolution for a single batch of images,as it is mighty fast (your test shows faster than 3090 and even faster than the 3090ti), but those 2 powerfull cards with double the amount of vram, might be able to easily do lets say 1280x1280, while the 4070ti simply gets a memory error.

Anyways, i love the post and your results, thanks for sharing, and i hope there will be an update soon with additional test results

But the problem then goes back to which version of SD I'm running. Again, Automatic 1111 on Windows only works with Nvidia cards. Period. I tested all of those, as you can see, down to the GTX 1660 Super that was so slow I didn't feel the need to go further. With --medvram (something like that), even 6GB Nvidia cards can do the 2048x1152 testing... though I have to admit that with the new version, I'm not entirely sure what's being done — is it using IMG2IMG to fill in the details at higher resolutions, or is it doing some form of upscaling? Anyway, until I can get Automatic 1111's version running on AMD and Intel cards, testing at more customized settings is the biggest problem.

But I do have the above VoltaML to try on AMD. We'll see how that goes. Maybe it will be faster than the SHARK version I've been running, but I suspect not on the 7000-series.

hexexpert · Jan 31, 2023

Yesterday I got VoltaML up and running. It is using TensorRT but this is still in an experimental branch and not yet released.
I got 88.7 it/s on my 4090! 512x512 in less than 1/3 of a second.

Hunter555 · Feb 3, 2023

JarredWaltonGPU said:
But the problem then goes back to which version of SD I'm running. Again, Automatic 1111 on Windows only works with Nvidia cards. Period. I tested all of those, as you can see, down to the GTX 1660 Super that was so slow I didn't feel the need to go further. With --medvram (something like that), even 6GB Nvidia cards can do the 2048x1152 testing...

2048x1152 would be a bit overkill, as that requires "expert" prompting in order to still get a large images with only 1 head for reasons that you have already mentioned in your news post.
However, 768x768 and 1024x1024 is very realistic, not many that I know on official the Discord of Stable Diffusion actually use 512x512 when doing realistic looking (photography-like) people as upscaling those afterwards result in poor image upscale rfesults, hence those start with a higher resolution straight from txt2img.

Anyways, I still hope for 768x768 benchmarks, becvause I'm sure the list of results will be in a totally different order

hexexpert said:
Yesterday I got VoltaML up and running. It is using TensorRT but this is still in an experimental branch and not yet released.
I got 88.7 it/s on my 4090! 512x512 in less than 1/3 of a second.

I acknowledge and can vouce for hexexpert that VoltaML is a huge performance upgrade (I'd guess around 25%) compared to xformers , but I can only speak for myself using auto1111, 4000 series card and updated CUDA, which for images of higher resolutions together with a batch of >100, will definitely make everyone happy

Hunter555 · Feb 3, 2023

oops forgot to ask, please DO consider the following cards, there is quite a run on all of these on eBay 😉

V100, P100, A100 , P6000 , A40 , A6000, A5000, A4000

bit_user · Feb 3, 2023

Hunter555 said:
oops forgot to ask, please DO consider the following cards, there is quite a run on all of these on eBay 😉

V100, P100, A100 , P6000 , A40 , A6000, A5000, A4000

Aren't those all server GPUs and crazy expensive, not to mention passively-cooled? How is that remotely relevant to almost any reader of this site?

That said, if Jarred happened to have access to one or more of them, they would be an interesting basis for comparison, if only to show the impact of memory scaling.

Hunter555 · Feb 3, 2023

bit_user said:
Aren't those all server GPUs and crazy expensive, not to mention passively-cooled? How is that remotely relevant to almost any reader of this site?

I can get a dozen RTX A4000's for $200 each on several marketplacesa with used gpu's.
I think this is ideal for people who want to get similar performance at a much lower power consumption, sop pardon me, but this IS relevent, the topic isn't gaming here, the topic is Stable Diffusion. Furthemore, a LOT more it/s than you might think, it's not even about the memory alone.

bit_user · Feb 3, 2023

Hunter555 said:
I can get a dozen RTX A4000's for $200 each on several marketplacesa with used gpu's.

Link, please?

That's a RTX 3070-class GPU that sells new for about $1000. Also, it's an actively-cooled workstation card (I thought it might be, but was too lazy to check before). Used ones seem to ebay for about $500. For great justice, some here might like to know where to buy one for a mere $200.

JarredWaltonGPU · Feb 3, 2023

bit_user said:
Link, please?

That's a RTX 3070-class GPU that sells new for about $1000. Also, it's an actively-cooled workstation card (I thought it might be, but was too lazy to check before). Used ones seem to ebay for about $500. For great justice, some here might like to know where to buy one for a mere $200.

Mostly for the OP, but:

I don't have access to any of the server cards — they're not something Nvidia samples, and we definitely don't have a budget to buy them. From what I've seen so far, Stable Diffusion cares more about compute than memory bandwidth, so performance will mostly match what you see on the GeForce side of things where applicable. But A100/H100 and the like, I have no idea. Also, with more VRAM you can do generation of larger batch sizes and improve throughput, but again that's beyond the scope of what I can test.

nonfatmatt · Feb 3, 2023

@JarredWaltonGPU I've actually had good results on AMD gpus using https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs - My stock 6650xt hits around 4-5 iterations per second on the standard 512x512 prompt. I think performance on AMD cards is heavily driver stack dependent at the moment, and it seems like AMD are making improvements to ROCm.

JarredWaltonGPU · Feb 3, 2023

nonfatmatt said:
@JarredWaltonGPU I've actually had good results on AMD gpus using https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs - My stock 6650xt hits around 4-5 iterations per second on the standard 512x512 prompt. I think performance on AMD cards is heavily driver stack dependent at the moment, and it seems like AMD are making improvements to ROCm.

You're running under Linux I assume? My initial attempts to get that working (on the RX 6000) had issues, but the Docker version worked. I'll have to see about retesting this at some point, though I'm still hoping to get Windows working better instead.

nonfatmatt · Feb 4, 2023

This is true, using stock Arch Linux. Oddly enough the 3060 and the 6650xt are almost neck and for me on Linux. I get fewer iterations on the 3060 (it maxes out at around 4.5 or so) than you do using SD on Linux vs tuned SD on Windows. It seems very sensitive to driver stacks and runners.

User_380956472 · Feb 8, 2023

bit_user said:
Aren't those all server GPUs and crazy expensive, not to mention passively-cooled? How is that remotely relevant to almost any reader of this site?

That said, if Jarred happened to have access to one or more of them, they would be an interesting basis for comparison, if only to show the impact of memory scaling.

Plenty of us are running SD on cloud GPU (runpod, gradient, colab, etc), benchmark information will really help to decide which gpu to use/rent

bit_user · Feb 8, 2023

User_380956472 said:
Plenty of us are running SD on cloud GPU (runpod, gradient, colab, etc), benchmark information will really help to decide which gpu to use/rent

Sounds expensive, but that's good to know.

Medallurgy · Feb 13, 2023

This is great information! I currently have a Titan RTX which while pretty good because it has 24GB VRAM seems to be lacking in it/s.

is there a possibility of trying to benchmark some of the workstation grade cards? Would be interested to see how well a RTX 6000 ADA stacks up to a 4090.

Fulgurant · Feb 16, 2023

Very interesting article.

For what it's worth, I ran Mr. Walton's test on two different machines running Fedora 36, and using Automatic1111's software. My result for the GTX 1060 (6 GB) was an average of 1.03 iterations per second. This is considerably faster than the article's result for the 1660 Super, which is a stronger card.

My result for the RX 6800 was an average of 6.98 iterations per second, after ten runs. This is more than twice as fast as the article's charted result, with Nod.ai on Windows. And I didn't even bother installing the MIOpen kernel stuff, the lack of which apparently slowed down my first run in the 10-batch series.

(Mr. Walton is 100% correct that installing this sort of thing is a pain in the ass. I already had Automatic1111 up and running; it just wasn't worth my time to figure out MIOpen on Fedora instead of Ubuntu, or to start the whole process over again in a Docker container. Naturally the documentation is severely lacking, and as with many Linux issues, google is no help at all.)

Anyway, in terms of performance, it looks like Linux is the way to go.

SyntheticSizzel · Mar 13, 2023

JarredWaltonGPU said:
You're running under Linux I assume? My initial attempts to get that working (on the RX 6000) had issues, but the Docker version worked. I'll have to see about retesting this at some point, though I'm still hoping to get Windows working better instead.

When you retest please use the appropriate libraries for the Arc cards (https://www.intel.com/content/www/u...tensorflow-stable-diffusion-on-intel-arc.html) if you didn't before. Unfortunately, I think it will require some code changes, but without them I'm hoping the GPU isn't being utilized much, if at all. If what you measured is the true performance of the Arc gpus, then that is really surprising (and unfortunate for Intel).

igor002020 · Apr 14, 2023

For ARC A380 I'm now getting 2.4 it/s for SD 1.4 and 2.7 it/s for SD 2.1 with the latest version of OpenVino . I'm using openvino notebooks (225 and 236). For 236 the notebook has to be slightly modified to set width/height to 512x512 (it is set to 768x768 by default).

GitHub - openvinotoolkit/openvino_notebooks: 📚 Jupyter notebook tutorials for OpenVINO™

📚 Jupyter notebook tutorials for OpenVINO™. Contribute to openvinotoolkit/openvino_notebooks development by creating an account on GitHub.

github.com

jahu00 · Apr 15, 2023

I've done some tests on my RX 6700 non-XT. When generating an image using SD 1.5 model with Euler A, 20 Step, CFG 7, 512x512 I get following results using different stacks:

- ROCm 5.33it/s
- SHARK 3.08it/s
- DirectML 1.75it/s

ROCm was on Ubuntu (and it was a pain to get it working). SHARK (Vulcan) and DirectML were on Windows. I used DirectML version of AUTOMATIC1111 for DirectML (obviously), which appears to rely on DirectML version of Torch. However I heard that there is also an older stack for DirectML called ONNX, but I haven't tested it yet.

Setting batch size to 2 reduces the reported it/s, but actually appears to marginally increase the total (probably due to concurency). I'm estimating based on render times, that for batch I get about:

- ROCm 6it/s
- SHARK doesn't work for some reason
- DirectML 2it/s

I'm probably not comparing apples to apples here, as I have to run each stack with different settings. ROCm works without extra settings. SHARK needs low vram setting to work with Euler A (other sampling methods appear to work without it, but require more steps). DirectML required "--opt-sub-quad-attention --medvram --disable-nan-check --no-half --precision full". Using different models than SD leads to different it/s stats. Models that produce better results appear to take more time to render (for example, I only get 5it/s on ROCm when using DreamShaper). Then there are also different versions of models, some come in half or full precision and possibly affect render time (and to a lesser degree output quality). Also I used ROCm 5.4.3, which might (or might not) have affected performance one way or another.

Officially ROCm is only available on Linux and works with a only a handful of AMD cards, but can be made to work with many more. There might be even a way to make ROCm work on Windows using Antares, but I have no clue how. The 7000 series doesn't appear to be supported on ROCm yet, but the SHARK stats look amazing so far (plus the card has plenty of VRAM which is useful for AI). Maybe ROCm could increase them by another 30%?

Edit: Apparently it might be possible to use xformers with ROCm if this bug gets resolved (https://github.com/ROCmSoftwarePlatform/hipify_torch/issues/39). xformers appear to increase performance on nvidia by 20-25%, so that I would expect something similar on AMD.

jahu00 · Apr 18, 2023

Having done some reasearch I found that:

- RX 7000 series is likely to be supported by ROCm 5.5
- ROCm 5.5 release candidate was spotted on the internet and if you happen to hunt it down, recompile PyTorch with it, you could in theory use it to run Stable Diffusion on RX 7000 series GPUs
- AMD is planning to bring ROCm to Windows, some rumors suggest, that they want to do it even before releasing ROCm 5.5

ROCm (and Windows support) would make the 7000 series very attractive price/performance for AI applications (not to mention that massive VRAM). SHARK has impressive results, but ROCm should be faster and other applications (like LLMs) are not yet implemented on the Vulkan stack. Hopefully someone will figure out how to fix xformers on AMD (appears to be minor software issue).

Edit: The stats people got from leaked ROCm 5.5 on RX 7900 XTX, were worse than in SHARK, namely, around 15it/s. It's quite possible 5.5 is not optimized for AI cores on the card and that will come in 5.6, so it's a bit of a bummer TBH.

baha2046 · May 1, 2023

From the result,
cost performance:
4070Ti > 7900XTX >> 4080/4090

jahu00 · May 1, 2023

12GB of VRAM on 4070TI should be enough to generate more complex images using SD (higher resolution with ControlNET). I was running out of VRAM on my RX 6700 non-XT with 10GB. I think the model was already running in FP16 mode, so there was no easy way around it.

In general, it might be possible to reduce model size further by using FP8 or even FP4 (not sure how much that would degrade the results), but I'm under the impression that current hardware can't do either efficiently (run models with precision limited to this extent). There is also no guarantee, that all new models will work well with FP16. Apparently, there are parts of SD2.1 that do visibly degrade when not run in FP32. Then again it might be possible to have mixed precision models (maybe SD 2.1 already does that).

I'm currently in a market for GPU for AI tasks and I'm not even considering cards with less VRAM than 16GB. You might be able to do basic image generation using SD on 8GB (or even less) VRAM. On the other hand, LLM models appear to be much bigger (like for example 16GB). Additionally, running LLMs also takes extra VRAM (each new "message" in the conversation eats up more memory). So even a 16GB card will be useless for running such model. Those models can be loaded with FP8, but run dog slow with this setting (at least on my card). I have no idea what effect does FP8 has on quality of LLMs.

Apparently a new impressive image generation network (Deepfloyd IF) has just dropped and in its current state, needs as much as 24GB VRAM to be fully functional.

I was considering 7900 XTX with its 24GB of VRAM and limited support for AI (for now), but if models continue to grow in size i might seriously consider buying a W6800 with its 32GB VRAM instead. Even in it's current state XTX is more powerful than W6800. XTX performance might increase in the future (with software updates), but W6800 will likely stay where it is. At least W6800 already has software stack for AI (and hopefully AMD won't drop support for 6000 series yet).

ROCm support for 7000 series will likely come around the time the W7900 comes out. If rumors are true, we will initially only see about 75% performance of what is currently possible in SHARK (that's comparable to what 3800 does). However, ROCm is the go to stack for AI on AMD cards. ROCm (via Torch) allows easy transition from CUDA, but SHARK is a different stack altogether. All the cool AI apps can run on ROCm (or most of them anyway). On the other hand, existing apps would have to be modified to use SHARK. Only future will tell if sticking with tried and tested CUDA/ROCm is the way to go or the technology behind SHARK (or something else) is the future.

P.S. There are so many stacks and hardware configurations. Some people reported up to 3it/s for 512x512 image on some M1 machines (while others reported terrible performance) and some people even reported not too shabby 8it/s on a 16GB A770 (no idea what stack).

soeur · May 5, 2023

Regarding the intel card, I tested it on this fork,
Euler a, step 100, CFG 15, 512x512, batch count 10, sub-quadratic cross attention optimization:
[4.08it/s, 5.86it/s, 5.99it/s, 6.05it/s, 6.05it/s, 6.05it/s, 6.04it/s, 6.04it/s, 6.03it/s, 6.02it/s]

on A750

bit_user · May 5, 2023

@JarredWaltonGPU this article seems to be attracting good contributions from lots of new users. Thanks for signing up & posting, all! Jarred, keep it up!
: )

EMsky · May 19, 2023

Thanks for the info. There are not too many benchmarks for AI out there yet, so I am happy to have this reference. Now I do have a question: what is the performance with NV Link and dual video cards? Lambada did NV Link tests with the 20xx series cards on deep learning, but not on any other ai test. I have seen some rumors on how stable diffusion might be able to run with multiple video cards, like running two image generations at the same time. Although, i am unsure if this is true. I do not have access to the extra video cards to test it. So i was wondering if you could run a test with these AI's in this article and see if multiple video cards will increase the performance, or if you will get multiple images generated at the same time.

If it is possible to generate 2x images at the same time, then two 2080's could match the performance of a 4080. are you able to test the 20xx series with dual gpus, as well as, the 30xx and 40xx series? updating this article with the 4060 might be an option, too. Thanks.

-Fran- · Jun 8, 2023

AMD promises improvements in Stable Diffusion with their latest drivers... Would you be willing to test it out, Jarred? 😀

https://www.amd.com/en/support/kb/release-notes/rn-rad-win-23-5-2

Regards.

News Stable Diffusion Benchmarked: Which GPU Runs AI Fastest

Splendid

Titan

Titan

Splendid

Splendid

Titan

Distinguished

Reputable

Reputable

Reputable

Reputable

Titan

Glorious

Share this page