This is incorrect Intel Arc 770A does not do well with stable diffusion. So many issues with it. Mine takes 45 minutes to produce one image at high res which apparently isnt normal. Even after following all instructions
I haven't seen issues with up to 768x768 on Intel Arc and SD1.5/SD2.1, but I have had problems getting SDXL running. I haven't had time to revisit the subject recently, however, so I don't know if anything has changed in the past few months.This is incorrect Intel Arc 770A does not do well with stable diffusion. So many issues with it. Mine takes 45 minutes to produce one image at high res which apparently isnt normal. Even after following all instructions
You cannot compare IT/s versus images per minute, FYI, and you really shouldn't use the IT/s results as they are not comparable across different architectures and Stable Diffusion projects. IT/s and Img/min are not the same, because IT/s ignores some of the stages of image generation.Hello Jarred and thank you for this work !
You should definitely benchmark the AMD gpus on a linux distribution with the optmized pytorch version for linux. On my side, the RX 6950XT FE version computes around 10 it/s whereas yours only have 6.6 it/s... That's 50% more performances on my side, so you should try it before making updates
By the way, I am using an opensuse tumbleweed distro.
Hello Jarred and thank you for your answer. I am sorry for the confusion between im/s and it/s. Indeed, I have done my tests using the same parameters as you, and I have 24 512x512 imgs in 127 seconds so that mean around 5.3im/s and that's even lesser than your results...You cannot compare IT/s versus images per minute, FYI, and you really shouldn't use the IT/s results as they are not comparable across different architectures and Stable Diffusion projects. IT/s and Img/min are not the same, because IT/s ignores some of the stages of image generation.
I do 24 images, with 50 steps per image, in whatever batch size results in maximum throughput. The setup time is not insignificant, so even if the main loop reports something like 10 it/s and thus generates an image in ~5 seconds, in reality there might be 2~3 seconds of setup and finalize time, meaning 7~8 seconds per image.
That's potentially the difference between 12 images/min and 7.5 images/min.
Now I'm not saying Linux isn't potentially faster. However, Linux and ROCm only support a smaller subset of AMD's GPUs (Navi 21, 31, 32, and 33 — no Navi 22, 23, 24 IIRC, though perhaps I'm wrong and Navi 22 is supported). And I also suspect Nvidia and Intel may be faster under Linux as well, especially if I sort through multiple SD projects to look for one that gives higher performance. But in general, I don't expect the margins to change too much.
Also, as noted in the text, the latest DirectML project used for testing AMD GPUs does not work optimally on RDNA 2. But likewise, finding the 'best' version of Nod.ai's Sharky Studio app for optimal performance takes time, and there have been various issues over time with testing different resolutions, batch sizes, SD versions, etc.
The bottom line is that RTX 40 and 30 series parts are much faster in general for AI workloads, followed by RDNA 3. RDNA 2 tends to be so far behind the others that it's more of an academic thing. Yes, you can run SD on RDNA 2 GPUs (and even RDNA, GTX, etc. if you're willing to take a long time). Relative performance will be quite low.
That depends on what you're doing. Some AI workloads need lots of VRAM, in which case the 3060 12GB will be able to run some tasks that fail on the 3060 Ti. However, 3060 Ti has more raw compute and bandwidth and for many inference workloads, it will easily beat the 3060. If you really only care about having sufficient VRAM to run specific AI workloads, that's one of the few areas where the 4060 Ti 16GB can actually make sense (understanding that, yes, it costs a lot more than a used 3060 or 3060 Ti).which is better between Rtx 3060 12GB or Rtx 3060Ti 8GB for working on AI?
The WMMA instructions improve throughput, but the theoretical compute doesn't change. So for example using GPU shaders, RX 7900 XTX has 6144 shaders at 2.5 GHz, and each of those can do four FP16 FMA (fused multiply add) instructions, which is eight operations. That gives 122.9 teraflops (6144 * 8 * 2.5 = 122,880 gigaflops).Hello.
@JarredWaltonGPU Thank you very much for taking the time for testing this.
There is one thing I don't understand. From AMD 7000 series cards, the values in in graph "Theoretical GPU Shader Compute FP16 using "Shader Cores only" and the values in the graph "Theoretical GPU Shader Compute FP16 using AI accelerators (no sparsity)" are the same. For example, The RX 7600 is 43.0 in both graphs.
This seems odd. The accelerators in AMD case are the ones that enable the WMMA instructions.
Is the first graph wrong or is the second graph wrong, or is there something I am missing?
Thanks