News Stable Diffusion Benchmarked: Which GPU Runs AI Fastest

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
You should include also a high-end task and just have the benchmark read 'FAIL' for GPUs that fail it. That's extremely useful information when considering what to buy.
It's almost like I should just say what was tested and why in the text. 🤔 I mean, people accuse us of bias all the time, but just putting "FAIL" on a bunch of cards in the charts is potentially the most misleading sort of thing to do.
 
It's almost like I should just say what was tested and why in the text. 🤔 I mean, people accuse us of bias all the time, but just putting "FAIL" on a bunch of cards in the charts is potentially the most misleading sort of thing to do.

I think the article text understates it, or maybe I just misunderstand it. If Intel GPUs can't run most of the software out there, that's useful information and not made obvious enough in the article text. The way it reads currently it just looks like they are a bit on the slow side. Which if they were cheap enough might still tempt you.
 
I think the article text understates it, or maybe I just misunderstand it. If Intel GPUs can't run most of the software out there, that's useful information and not made obvious enough in the article text. The way it reads currently it just looks like they are a bit on the slow side. Which if they were cheap enough might still tempt you.
Well, the Intel GPUs had to use a specific SD project made to use OpenVINO (Intel's library), and even then they underperform. AMD's GPUs have to use projects that either run in Linux via ROCm (and are limited to Navi 21 last I tried), or use a project that requires a specific AI/ML beta driver for Windows testing. So in that sense, both AMD and Intel GPUs aren't all that dissimilar, but Intel's GPU market share (for Arc) is so small that it's hardly surprising it hasn't had much uptake. It's the chicken and egg problem for both companies:

Which GPU should you buy for AI/ML stuff? Nvidia, the undisputed market leader right now, or one of the other two? Most people doing AI/ML opt for Nvidia, which means they code for Nvidia, which means no one makes projects for AMD/Intel, which means people looking at what to buy go back to Nvidia yet again. It takes a lot of effort on the part of the hardware companies to break that cycle.
 
Please note: Updates are in progress. Or rather, retesting is in progress and updated charts will follow. The main change will be retesting Nvidia GPUs with new PyTorch libraries, which boosts performance on RTX 40-series quite a bit. I'll also be doing tests with a batch size set to 4, so basically doing four concurrent 512x512 images at the same time. This increases the memory use and GPU shader utilization and improves overall throughput by perhaps 25–30%, but it's only supported (AFAIK) in Automatic 1111's version of Stable Diffusion, meaning I can't test this way on AMD or Intel GPUs. I'm also curious about how much VRAM is required for this to work. I know it works on 12GB cards, but it may not run on 10GB or 8GB. We'll see...

I also want to run some tests with RTX 20-series, and maybe check a GTX card as well just for kicks. I'm not sure if the GTX 16-series will handle SD well at all, but it may be fun to check. :)
 
  • Like
Reactions: bit_user
We're now down to blending images together and spambots being the target audience for AI. The promised billion $ AI industry is turning into a nothingburger.

That's too generalized IMO. The success depends on the application.

AI is a great tool when used correctly. There are several things we use ML/Deep Learning for in real world applications that are much easier to maintain and much more flexible than old school algorithms.

The issue is, people are expecting waaaaay to much from current AI tech. Can it drive a car? Yeah, it can, can it drive it like a human does? Heeeellll no, but it's not going to stop people from trying and hoping it all works out.
 
  • Like
Reactions: bit_user
Which GPU should you buy for AI/ML stuff? Nvidia, the undisputed market leader right now, or one of the other two? Most people doing AI/ML opt for Nvidia, which means they code for Nvidia, which means no one makes projects for AMD/Intel, which means people looking at what to buy go back to Nvidia yet again. It takes a lot of effort on the part of the hardware companies to break that cycle.
Yeah, hence the linked article in my post up top about even the "open source" projects actively rejecting code contributions that aren't CUDA. Nvidia bought the market early on. AMD tried with rocm a while back, but it's a disaster on Windows. Intel has and is putting A LOT of money and effort into OneAPI, so I'm pretty hopeful about them.
 
  • Like
Reactions: bit_user
Nvidia bought the market early on.
They indeed funded a lot of development in this space. Perhaps more importantly, they had good Linux driver & compute support across their entire hardware range, including consumer GPUs. A lot of the initial work on deep learning happened back in the days when AMD was flirting with bankruptcy, so it's a little hard to blame them for not doing more at the time.

AMD tried with rocm a while back,
ROCm, or Radeon Open Compute, is what they call their software stack on Linux. ROCk is the kernel driver, ROCm are the device libraries, and ROCr is the userspace runtime. I think the Linux-focus is due to how development funded, which is apparently by sales of their workstation, server, and HPC accelerators. Those are the only cards they officially support, on Linux.

Anyway, they went to the trouble of creating a CUDA-compatible API layer, called HIP (Heterogeneous Interface for Portability). I guess they were worried about copyright infringement, if they directly implemented their own version of CUDA.

Intel has and is putting A LOT of money and effort into OneAPI, so I'm pretty hopeful about them.
Both Intel and AMD have tools to help convert CUDA code to their respective APIs. I've also heard about an independent 3rd party implementation of CUDA that runs atop oneAPI, but I'm not sure how far they've gotten or if its developer is still working on it.
 
ROCm, or Radeon Open Compute, is what they call their software stack on Linux. ROCk is the kernel driver, ROCm are the device libraries, and ROCr is the userspace runtime. I think the Linux-focus is due to how development funded, which is apparently by sales of their workstation, server, and HPC accelerators. Those are the only cards they officially support, on Linux.
ROCm on Linux does have support for Navi 21 (not sure on Navi 31 yet), which was how I did the higher resolution testing with the Automatic 1111 build. I haven't retried that with the 7900 series yet, because it involves swapping drives and then figuring out what I need to do with AMD's drivers. Hopefully it "just works" with the latest 23.1.1 drivers, but I'll give it a shot tomorrow... assuming I can find whatever SSD I used for the Linux installation last time! LOL

CUDA was basically bootstrapped by the supercomputing space, or in other words the US government. Now they're trying to bootstrap both AMD (ROC) and Intel (OneAPI), probably because they're sick of dealing with Nvidia proprietary stuff. Except, all the existing supercomputers that use Nvidia are still doing CUDA, and it will take years to change that. Frontier, El Capitan, and (eventually) Aurora are paving the way, though.
 
  • Like
Reactions: bit_user
With the frenzy that ChatGPT has generated, companies (and presumably individuals) are scrambling to get up to speed on AI development. I'm wondering if that would cause a surge in demand for Nvidia GPUs, especially the 4000 series. Are we going to see a jump in Nvidia GPU pricing as we did with crypto?

Nvidia made products dedicated to crypto. Will there be a market for AI-dedicated PC add-in cards?
 
With the frenzy that ChatGPT has generated, companies (and presumably individuals) are scrambling to get up to speed on AI development. I'm wondering if that would cause a surge in demand for Nvidia GPUs, especially the 4000 series. Are we going to see a jump in Nvidia GPU pricing as we did with crypto?

Nvidia made products dedicated to crypto. Will there be a market for AI-dedicated PC add-in cards?
I would not be remotely surprised if 75% of RTX 4090 cards have been purchased by companies and research groups for AI work. I have no data to back that up, other than the higher than MSRP prices of the past three months, but there have definitely been rumors to that effect. I had Ed Crisler (PR for Sapphire) on the TH Show a few weeks ago and he felt it was more like 90% were sold to AI professionals.
 
  • Like
Reactions: baboma
With the frenzy that ChatGPT has generated, companies (and presumably individuals) are scrambling to get up to speed on AI development. I'm wondering if that would cause a surge in demand for Nvidia GPUs, especially the 4000 series. Are we going to see a jump in Nvidia GPU pricing as we did with crypto?
I've long figured the AI-fueled demand for GPUs has been a somewhat consistent undercurrent that maybe hasn't gotten quite the press of the crypto boom. You do have a point that chatGPT might indeed supercharge the sector, however there are probably enough purpose-built AI accelerators on the market that not all of that demand will go towards GPUs.

Luckily for the gaming community, it seems ChatGPT is far too big either for training or inference on consumer GPUs.


According to someone in that thread, it's about 500 GB. So, you'd need at least an 8-GPU HGX H100 system to train it (or probably even inference it at any kind of decent speed).


Of course, Nvidia is currently making both H100 and their RTX 4000 models on the same process node. That means they can shift wafer allocation between the two. Late last year, they announced they would be doing just that (i.e. shifting wafers over to H100 or maybe Grace CPU production).

Nvidia made products dedicated to crypto. Will there be a market for AI-dedicated PC add-in cards?
Tensor cores came from the desire to accelerate AI workloads. Their gaming GPUs have quite a bit of AI horsepower, but it's their 100-series products that are really suited towards working with such huge models - not only due to compute horsepower and bandwidth, but also multiple NVLink ports for scaling to multi-GPU.

The only change they could really make is to cut down on the HPC-oriented compute (i.e. the fp64 stuff), to make room for even more tensor cores. As it stands, I'm not sure how much die space that stuff occupies, but I once compared the transistor counts of Vega20 with its predecessor and found less than 6% additional transistors needed to add half-rate fp64 support. That number seems small to me, but it suggests that we'll probably continue to see both HPC and AI both continue to be served by a single product line.

For reference, here's a table I recently made to compare the AI compute power of high-end consumer vs. datacenter GPU products.
MakeModelTensor TFLOPS (f16)Memory Bandwidth (TB/s)
IntelXeon 8480+
115​
0.31​
IntelXeon Max 9480
109​
1.14​
NvidiaH100
1979*​
3.35​
NvidiaRTX 4090
661*​
1.01​
NvidiaRTX 3090
142​
0.94​
IntelData Center GPU Max 1550
839​
3.20​
IntelA770
138​
0.56​
AMDInstinct MI250X
383​
3.28​
AMDRX 7900 XTX
123​
0.96​
AMDRX 6950 XT
47​
0.58​

* Nvidia has taken to rating their tensor performance to assume sparsity, which is up to 2x their performance on dense matrices. To estimate their performance on dense matrices, simply divide these figures by 2. That said, sparsity is quite common, which is why they bothered to accelerate it. So, it's not entirely unfair for them to quote the sparse number.​
 
Last edited:
I would not be remotely surprised if 75% of RTX 4090 cards have been purchased by companies and research groups for AI work.
The genius of Nvidia's commitment to supporting CUDA across their entire product range is that you can be sure a bunch of university students are buying Nvidia gaming GPUs for dual-use on both their coursework + gaming.

AMD has been missing out on this market segment, although I know they're trying hard to catch up. I've even seen a forum post by a professor who was simply exasperated from trying to support students using AMD GPUs and had taken to requiring specifically Nvidia GPUs for their courses.
 
We test all the modern graphics cards in Stable Diffusion and show which ones are fastest, along with a discussion of potential issues and other requirements.

Stable Diffusion Benchmarked: Which GPU Runs AI Fastest : Read more
I have a 4090 and get 39.5 it/s with A1111 with the sd2.1 model at 512x512 euler_a batchsize=1
There were 2 huge factors other than the obvious use of xformers that all probably knows about.
  1. pytorch is bundled with cuDNN 8.5 or older. With this I only got 13.5 it/s at batchsize=1 and could only get high throughput with batchsize=16. When I upgraded to cuCNN 8.7 I was able to get over 39 it/s. I've convinced the PyTorch folks to upgrade the cuDNN and a PR is in progress.
  2. Not everyone with a 4090 could hit 39.5it/s with my workaround. That's when I discovered that the CPU speed makes a surprising difference for something that is primarily a GPU process. I have a I9-13900K with DDR5-6400 CL32 memory. If I run A1111 on my E cores which have 75% the speed of my P cores I get only 75% of it/s and nvtop shows only 75% GPU utilization.
Finally it appear that most on Windows can't get the performance I see on Ubuntu using the manual upgrade of cuDNN. It does indeed improve performance but only up to 25 to 30 it/s depending on their setup..
I do notice that on Ubuntu there is 0% system time used for A1111. This seems to indicate that it's using a kernel bypass access to the hardware. Perhaps similar to what NVMe 2 ssd's can do.
On Windows one user reported to me seeing about 13% kernel time but I haven't been on Windows for awhile to have verified that myself.
If you have an AMD processing, even the 128 thread beast, you can't push a 4090 to the max doing serialized one image at a time generations at 512/512.
 
  • Like
Reactions: jp7189 and bit_user
I have a 4090 and get 39.5 it/s with A1111 with the sd2.1 model at 512x512 euler_a batchsize=1
There were 2 huge factors other than the obvious use of xformers that all probably knows about.
  1. pytorch is bundled with cuDNN 8.5 or older. With this I only got 13.5 it/s at batchsize=1 and could only get high throughput with batchsize=16. When I upgraded to cuCNN 8.7 I was able to get over 39 it/s. I've convinced the PyTorch folks to upgrade the cuDNN and a PR is in progress.
  2. Not everyone with a 4090 could hit 39.5it/s with my workaround. That's when I discovered that the CPU speed makes a surprising difference for something that is primarily a GPU process. I have a I9-13900K with DDR5-6400 CL32 memory. If I run A1111 on my E cores which have 75% the speed of my P cores I get only 75% of it/s and nvtop shows only 75% GPU utilization.
Finally it appear that most on Windows can't get the performance I see on Ubuntu using the manual upgrade of cuDNN. It does indeed improve performance but only up to 25 to 30 it/s depending on their setup..
I do notice that on Ubuntu there is 0% system time used for A1111. This seems to indicate that it's using a kernel bypass access to the hardware. Perhaps similar to what NVMe 2 ssd's can do.
On Windows one user reported to me seeing about 13% kernel time but I haven't been on Windows for awhile to have verified that myself.
If you have an AMD processing, even the 128 thread beast, you can't push a 4090 to the max doing serialized one image at a time generations at 512/512.
Curious. I may need to try updating to SD2.1 and see if that affects performance much, as it definitely could. I may also need to give Ubuntu a shot with Nvidia GPUs to see if that changes things much. But long-term, I don't want to be forced into using Linux for testing AI performance. It involves swapping SSDs and installing drivers and updates on Linux is never quite as straightforward as on Windows in my experience (ie, the RX 7900 cards not working right now after wasting hours on the effort is a prime example).
 
long-term, I don't want to be forced into using Linux for testing AI performance. It involves swapping SSDs and installing drivers and updates on Linux is never quite as straightforward as on Windows in my experience (ie, the RX 7900 cards not working right now after wasting hours on the effort is a prime example).
I hear you and can't argue with that, but it's unfortunate because I think the deep learning community is overwhelmingly Linux-based.

For users of software Stable Diffusion and readers of this article, I expect the bias is tilted more towards Windows.
 
Curious. I may need to try updating to SD2.1 and see if that affects performance much, as it definitely could. I may also need to give Ubuntu a shot with Nvidia GPUs to see if that changes things much. But long-term, I don't want to be forced into using Linux for testing AI performance. It involves swapping SSDs and installing drivers and updates on Linux is never quite as straightforward as on Windows in my experience (ie, the RX 7900 cards not working right now after wasting hours on the effort is a prime example).
I have a dual boot setup sharing a single Samsung 990 pro SSD.
I believe the SD2.1 model made only about a 2it/s difference.
Removing the cuDNN v8.old(?) dirvers and installing v8.7 make a huge difference. Although not as much on Windows as I and others see on Linux.
There is a thread on A1111 github and on Reddit about how to upgrade to v8.7.
I hope the PyTorch folks gets the changes merged soon. However, they are only doing it in the nightly builds for pytorch 2.0.0 which isn't GA but I've used it without problems.
All of this is a changing situation as there are folks with windows trying to figure out why they can't get to the 39.5 I see. They are not maxing out their 4090 and I hit 100% busy on my 4090.
Let me know if you need any help on the Linux setup.
 
  • Like
Reactions: bit_user
Are you saying your Raptor Lake can?

In general, you want to pipeline the work for a GPU, to minimize the amount of time it spends waiting on the CPU.
Yes, my Raptor lake at 5.7 to 5.8 GHz can push a 4090 to the max at 100% busy with single image generation.
Anything slower, including using my 4.3 GHz E-cores, can only keep the GPU 75% busy and timing shows it is only 75% as fast.
Single image serialized generation is important for video generation.
Obviously if you do batchsize=2 or higher or run multiple A1111 instances you will get all the benefits of the 4090.
As for pipelining goes.; Yes, I know that but A1111 does not have a "pipeline" button
But as an experiment I did modify A1111 to decouple the GPU work from the post generation work on the CPU and move that into a 2nd python thread.
This allows the next image generation to start asap without waiting on image save or other post generation work.
But I only improved throughput about 7% and suspended work on that when I discovered these much larger improvements.
 
  • Like
Reactions: bit_user
I would just like to say that I am pleasantly surprised at how civil and calm this conversation has been compared to pretty much anywhere else on the internet when it's been a conversation regarding Intel, Nvidia, and AMD.

12/10, four thumbs up, would comment again. ;-)
 
  • Like
Reactions: jp7189 and bit_user
AI is just way overrated. A good article the other day about Google kicking its AI employees to the curb.

Nvidia, AMD, Intel, Microsoft and Google their whole growth model was based on this idea that companies would need datacenters and computing power for AI.

We're now down to blending images together and spambots being the target audience for AI. The promised billion $ AI industry is turning into a nothingburger.


gfhfghfhfh.jpg
Funny other articles I read said they cut jobs to refocus on AI. Their deep mind ai lab is one of the most well funded in the world, and they are on the cusp of releasing sparrow ai to completely redefine Internet search. Hold tightly to your skepticism because 2023 is going to be the breakout year where ai starts to redefine the world for the average person.
 
  • Like
Reactions: bit_user
...Stable Diffusion is trained on 512x512 images (1.x) or 768x768 images (2.x), so asking for outputs in a different resolution causes a lot of odd rendering issues (two-heads problem, mutant limbs, etc.)

I honestly think that the actual outcome of an image should be irrelevant (many people not even use SD for making people, you can also just use a prompt like: generic plain backgropund filled with 100 different objects randomly with randomk colors, it doesn't matter really) as I thought you were testing performance, not the results whether it looks great or not, and furthermore it's quite obvious that some lower end cards can't create anmy image of 1024 or higher due lack of vram. That being said, I still vouch for a 768x768, 1024x1024 and even a 1280x1280 performance test, and enable "low memory" card option in the settings of auto1111, which is for cards 8,9,10,11 gb vram and for the test itself do only batch size = 1, and batch count = 40

Furthermore its also interesting to see (being my favourite HARDWARE TEST site, to mention which cards actually failed the test actually, I know quite a BIG group of people looking around now and who want to see for example how far the 4070ti can take it in terms of max resolution for a single batch of images,as it is mighty fast (your test shows faster than 3090 and even faster than the 3090ti), but those 2 powerfull cards with double the amount of vram, might be able to easily do lets say 1280x1280, while the 4070ti simply gets a memory error.

Anyways, i love the post and your results, thanks for sharing, and i hope there will be an update soon with additional test results :)
 
I honestly think that the actual outcome of an image should be irrelevant (many people not even use SD for making people, you can also just use a prompt like: generic plain backgropund filled with 100 different objects randomly with randomk colors, it doesn't matter really) as I thought you were testing performance, not the results whether it looks great or not,
I think it's important to pick a realistic example, since that affects the size and complexity of the workload. It's no use benchmarking on some trivial case that doesn't represent what actual users of the program would do with it!
 
  • Like
Reactions: AndrewJacksonZA
Status
Not open for further replies.