Question Desktop Build Works for 8 Months, then Graphics Crash under Heavy Loads

Sep 29, 2023
3
0
10
I built a desktop for myself for the first time in December 2022/January 2023. Parts I used were the following:
  • CPU: AMD Ryzen 7 5800X3D 3.4 GHz 8-Core Processor
  • Mobo: Gigabyte B550 AORUS ELITE AX V2 ATX AM4 Rev. 1.1
  • RAM: Corsair Vengeance LPX 64 GB (2 x 32 GB) DDR4-3600 CL18
  • GPU: MSI GAMING Z TRIO GeForce RTX 3080 10GB LHR 10 GB
  • PSU: Corsair RM850x (2021) 850 W 80+ Gold Certified Fully Modular ATX
  • OS: Linux Mint 21.1/Windows 11 Dual Boot, Linux primary
I use this for a mixture of gaming and AI/ML tasks which range from computationally intensive but just for fun (e.g. Stable Diffusion) to specialized uses of ML and CUDA acceleration for the research field I’m in. All this to say I definitely push the GPU to its limits and its functioning is very important to my getting my money’s worth out of the PC.

This worked like a charm for all my purposes until late August, when the graphics card would begin to fail under or immediately after heavy loads. I first noticed this using Stable Diffusion but after it occurred for the first time it started to occur with gaming on anything remotely graphics-heavy (it even happened running the Dolphin Emulator and Honkai) and even my specialized ML tasks. Once this happens I need to reset the computer by holding down the power button. Importantly, the computer will power on and even display; these issues only occur after some period of the PC using heavy loads. Also, if I’m playing a game or listening to music when it happens the sound continues as normal, which tells me the problem probably has to do primarily with the graphics card.

On at least one occasion when this happened the case and GPU fans stopped spinning but I couldn’t tell if the CPU cooler was still working. The MSI GPU monitors also occasionally showed it jumping to max performance even without any loads, but the Windows and open-source monitors didn’t and I read this may be normal.

I updated all drivers and the BIOS and the problem still occurred on both OSes, so I think it is probably not a software or driver problem. I also physically removed the graphics card and put it back and in the process did not observe any damage to the PCIe slot, although it could be there without being visible to the untrained eye.

After some more searching online I learned this may be a power supply problem and replaced all the daisy chain cables with new, non-daisy chain ones from Corsair. This seemed to make the problem less frequent but did not stop them completely. I then tested forcing a GPU power consumption limit on Windows with MSI Afterburner and found these problems wouldn’t arise with the enforced underclocking, at least until this week.

Beginning this week the GPU has been crashing under loads again and if anything it’s gotten worse, e.g. booting into windows after the crash sometimes just shows this distorted green bar in the lower left of my monitor. Limiting power draw through software doesn’t seem to help anymore.

Because the PC is still new I want to start invoking warranties/RMAs but am not sure what is most likely to be the problem. I’m most confused by this because the PC was working excellently for 8 whole months and there’s no visible damage. Should I begin by asking Corsair for a replacement PSU or even upgrade to one with more wattage? Or is the problem more likely to be the GPU? If anyone has experience with this, I would greatly appreciate the help.

(also, I was unsure whether to post this to the graphics card or PSU subforum — I focused on the most immediate manifestation of the problem, but would be happy to move or repost)
 
I'd look towards GPU replacement since by the looks of it, you've burnt out your GPU, by using it on heavy computational workloads. That and you have LHR version of the GPU, which will fare even worse on heavy computational workloads, including cryptocurrency mining.

RTX 20- /30- /40-series, especially LHR versions, are designed for gaming use. Sure, you can use the GPU for number crunching too, but it wouldn't do it quite well. Also, it would burn out the GPU fast. In a year or so.
Now, if you'd have RTX Quadro or Radeon Pro, which are designed for number crunching and the like, story would be completely different.

You could try to RMA your GPU under warranty, but i'm not sure if the warranty would be uphold. Since it's quite easy to find out that GPU wasn't used within acceptable parameters, instead, it was abused.