Stable Diffusion Benchmarks: 45 Nvidia, AMD, and Intel GPUs Compared

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Feb 23, 2024
1
0
10
This is incorrect Intel Arc 770A does not do well with stable diffusion. So many issues with it. Mine takes 45 minutes to produce one image at high res which apparently isnt normal. Even after following all instructions
 
This is incorrect Intel Arc 770A does not do well with stable diffusion. So many issues with it. Mine takes 45 minutes to produce one image at high res which apparently isnt normal. Even after following all instructions
I haven't seen issues with up to 768x768 on Intel Arc and SD1.5/SD2.1, but I have had problems getting SDXL running. I haven't had time to revisit the subject recently, however, so I don't know if anything has changed in the past few months.
 
Mar 13, 2024
2
1
10
Hello Jarred and thank you for this work !

You should definitely benchmark the AMD gpus on a linux distribution with the optmized pytorch version for linux. On my side, the RX 6950XT FE version computes around 10 it/s whereas yours only have 6.6 it/s... That's 50% more performances on my side, so you should try it before making updates :)

By the way, I am using an opensuse tumbleweed distro.
 
Hello Jarred and thank you for this work !

You should definitely benchmark the AMD gpus on a linux distribution with the optmized pytorch version for linux. On my side, the RX 6950XT FE version computes around 10 it/s whereas yours only have 6.6 it/s... That's 50% more performances on my side, so you should try it before making updates :)

By the way, I am using an opensuse tumbleweed distro.
You cannot compare IT/s versus images per minute, FYI, and you really shouldn't use the IT/s results as they are not comparable across different architectures and Stable Diffusion projects. IT/s and Img/min are not the same, because IT/s ignores some of the stages of image generation.

I do 24 images, with 50 steps per image, in whatever batch size results in maximum throughput. The setup time is not insignificant, so even if the main loop reports something like 10 it/s and thus generates an image in ~5 seconds, in reality there might be 2~3 seconds of setup and finalize time, meaning 7~8 seconds per image.

That's potentially the difference between 12 images/min and 7.5 images/min.

Now I'm not saying Linux isn't potentially faster. However, Linux and ROCm only support a smaller subset of AMD's GPUs (Navi 21, 31, 32, and 33 — no Navi 22, 23, 24 IIRC, though perhaps I'm wrong and Navi 22 is supported). And I also suspect Nvidia and Intel may be faster under Linux as well, especially if I sort through multiple SD projects to look for one that gives higher performance. But in general, I don't expect the margins to change too much.

Also, as noted in the text, the latest DirectML project used for testing AMD GPUs does not work optimally on RDNA 2. But likewise, finding the 'best' version of Nod.ai's Sharky Studio app for optimal performance takes time, and there have been various issues over time with testing different resolutions, batch sizes, SD versions, etc.

The bottom line is that RTX 40 and 30 series parts are much faster in general for AI workloads, followed by RDNA 3. RDNA 2 tends to be so far behind the others that it's more of an academic thing. Yes, you can run SD on RDNA 2 GPUs (and even RDNA, GTX, etc. if you're willing to take a long time). Relative performance will be quite low.
 
Mar 17, 2024
1
0
10
Thanks for your article Jarred. I can attest to how slow the RX 6600 XT is and I'm thinking of buying and RTX 4000 Series GPU.
 
Mar 13, 2024
2
1
10
You cannot compare IT/s versus images per minute, FYI, and you really shouldn't use the IT/s results as they are not comparable across different architectures and Stable Diffusion projects. IT/s and Img/min are not the same, because IT/s ignores some of the stages of image generation.

I do 24 images, with 50 steps per image, in whatever batch size results in maximum throughput. The setup time is not insignificant, so even if the main loop reports something like 10 it/s and thus generates an image in ~5 seconds, in reality there might be 2~3 seconds of setup and finalize time, meaning 7~8 seconds per image.

That's potentially the difference between 12 images/min and 7.5 images/min.

Now I'm not saying Linux isn't potentially faster. However, Linux and ROCm only support a smaller subset of AMD's GPUs (Navi 21, 31, 32, and 33 — no Navi 22, 23, 24 IIRC, though perhaps I'm wrong and Navi 22 is supported). And I also suspect Nvidia and Intel may be faster under Linux as well, especially if I sort through multiple SD projects to look for one that gives higher performance. But in general, I don't expect the margins to change too much.

Also, as noted in the text, the latest DirectML project used for testing AMD GPUs does not work optimally on RDNA 2. But likewise, finding the 'best' version of Nod.ai's Sharky Studio app for optimal performance takes time, and there have been various issues over time with testing different resolutions, batch sizes, SD versions, etc.

The bottom line is that RTX 40 and 30 series parts are much faster in general for AI workloads, followed by RDNA 3. RDNA 2 tends to be so far behind the others that it's more of an academic thing. Yes, you can run SD on RDNA 2 GPUs (and even RDNA, GTX, etc. if you're willing to take a long time). Relative performance will be quite low.
Hello Jarred and thank you for your answer. I am sorry for the confusion between im/s and it/s. Indeed, I have done my tests using the same parameters as you, and I have 24 512x512 imgs in 127 seconds so that mean around 5.3im/s and that's even lesser than your results... o_O

Thanks again for your answer and have a good day ! :)
 
  • Like
Reactions: JarredWaltonGPU
which is better between Rtx 3060 12GB or Rtx 3060Ti 8GB for working on AI?
That depends on what you're doing. Some AI workloads need lots of VRAM, in which case the 3060 12GB will be able to run some tasks that fail on the 3060 Ti. However, 3060 Ti has more raw compute and bandwidth and for many inference workloads, it will easily beat the 3060. If you really only care about having sufficient VRAM to run specific AI workloads, that's one of the few areas where the 4060 Ti 16GB can actually make sense (understanding that, yes, it costs a lot more than a used 3060 or 3060 Ti).

Stable Diffusion doesn't typically need more than 8GB on Nvidia GPUs, though Stable Diffusion XL might benefit from added VRAM. I haven't tried doing a bunch of SDXL testing yet, as last time I checked there were some "gotchas" with getting it to work on all the various GPU hardware (meaning Nvidia, AMD, and Intel GPUs).
 
Jun 5, 2024
2
0
10
Hello.

@JarredWaltonGPU Thank you very much for taking the time for testing this.

There is one thing I don't understand. From AMD 7000 series cards, the values in in graph "Theoretical GPU Shader Compute FP16 using "Shader Cores only" and the values in the graph "Theoretical GPU Shader Compute FP16 using AI accelerators (no sparsity)" are the same. For example, The RX 7600 is 43.0 in both graphs.

This seems odd. The accelerators in AMD case are the ones that enable the WMMA instructions.
Is the first graph wrong or is the second graph wrong, or is there something I am missing?

Thanks
 
Hello.

@JarredWaltonGPU Thank you very much for taking the time for testing this.

There is one thing I don't understand. From AMD 7000 series cards, the values in in graph "Theoretical GPU Shader Compute FP16 using "Shader Cores only" and the values in the graph "Theoretical GPU Shader Compute FP16 using AI accelerators (no sparsity)" are the same. For example, The RX 7600 is 43.0 in both graphs.

This seems odd. The accelerators in AMD case are the ones that enable the WMMA instructions.
Is the first graph wrong or is the second graph wrong, or is there something I am missing?

Thanks
The WMMA instructions improve throughput, but the theoretical compute doesn't change. So for example using GPU shaders, RX 7900 XTX has 6144 shaders at 2.5 GHz, and each of those can do four FP16 FMA (fused multiply add) instructions, which is eight operations. That gives 122.9 teraflops (6144 * 8 * 2.5 = 122,880 gigaflops).

If you use the AI accelerator WMMA instructions, you still get the same throughput. It's 512 FP16 operations per CU, which is 96*512*2.5 = 122,880 gigaflops. But from AMD's documentation:

"The WMMA instruction optimizes the scheduling of data movement and peak math operations with minimal VGPR access by providing source data reuse and intermediate destination data forwarding operations without interruption. The regular patterns experienced in matrix operations enable WMMA instructions to reduce the required power while providing optimal operations that enable sustained operations at or very near peak rates."

So even though the theoretical throughput is the same, the real-world throughput with WMMA ends up being higher.
 
Jun 5, 2024
2
0
10
I understand. Thank you very much for taking the time to respond.
It has been hard so far to get grip on the real potential of each card.
As you pointed out, there is disconnect in peak theoretical throughput and attainable throughput.
Another point of friction is how much framework developers put effort in squeezing all the performance possible from the hardware.
It seems that right now, nvidia is on top because it is ahead in all three categories. Thanks again.
 
Nov 13, 2024
1
0
10
I would love to see an updated version of this chart running newer models, like SDXL, Flux, SD 3.5, and more. There are also local video generators and local language models which would be really interesting to see too.