Question Black screen under load on a previously working system ?

Aug 27, 2023
9
0
20
Hi!

First time posting here! Nice to talk to you guys, I hope I’m not breaking any rules with this post!

I’ve been experiencing what looks like GPU crashes for a few weeks now. I’ve been building my own computers for years, and this is the first time something I see something like this happening. I’d like to get you guys’ opinion to see if maybe I missed something.


The problem:

After some time under load, all my displays turn black and the GPU fans start spinning at full speed (without a BSOD). It started when I was playing diablo 4 and now happens after a few minutes in other games as well.

When it happens, the computer is able to shut down normally with a single press of the power button. It does not shut down by itself.
The only relevant error I found in the windows reliability history is a “Hardware error”, with nothing more of interest in the event viewer.
Everything looks normal after a reboot, until I try gaming and the computer crashes again.

I’m able to reproduce the crash by running FurMark for a few minutes, and seems to happen when the GPU stays at its thermal limit for a while. I recorded CSV logs of two crashes today using HWinfo and will add them to this post (If you think I missed important sensors, I can add them to the logging!)

This system worked perfectly for months before the issue appeared. I’ve had the 3080 for a bit longer than a year now. I bought it when GPUs were super expensive, and it should still be under warranty.

HWinfo sensor data of a crash

I sent the GPU back to the Amazon third-party vendor who sold it to me after explaining my problem. They kept the card for a month and all they told me was “we tried it on a bench, it works fine”. While they had the card, I used the computer with the CPU integrated graphics and had no issues/relevant errors in the event viewer.


My specs :
  • Ryzen 7600x cooled by a NH-D15
  • Asus B650E motherboard
  • 32gigs of DDR5 at 6000Mhz
  • A LHR RTX3080 Gainward “Phoenix”
  • a 850W Corsair RMX power supply

What I tried:
  • Updating windows
  • Updating my graphics driver after uninstalling them with DDU
  • Updating my bios
  • Maxing my case fan speeds and leaving the side panels open
  • Switching the gpu to another PCIe slot
  • Verifying/Repairing my windows install with SFC & checking my drive with CHKDSK
  • Running MemTest on my RAM for a night to check for errors

Additional info:
  • I'm running an up-to-date version of Windows 11.
  • The system is not overclocked (the ram is running at its EXPO specs, if that counts?)
  • I’m not running any third party programs like MSI afterburner
  • I have not seen any graphical artifacts as is sometime the case with dying GPUs
  • In the logs, after crashing, the GPU temp sensors all jump to 0°C…
  • Again, the system worked perfectly for months before the problem started occurring seemingly out of nowhere

I’m going to test the card on a friend’s computer soon to see if I can reproduce the problem.

How can I diagnose this? Does this look like a GPU hardware failure to you? Is there anything I can do to fix it?
I’d be happy to provide more information if needed !
Thank you for your help!
 
Solution
Define "thermal limit", because no "mfg. specifications" overrides the universal laws of physics.
Have a fan curve in your drivers and set them at 100% at 65C, no higher, don't worry, your GPU will surpass that at times, but not by more than 10-15C and not for long.

By thermal limit I mean 83°C on the main temp sensor :)
This thread is a prequel to this one, from when I was still wondering what caused my computer to black screen.

I figured out the problem does come from the gpu after testing it in another computer and experiencing the same issue

You are probably right, setting a fan curve (or even better, repasting the thing) would probably help.
What bothers me is that it did not shut down like that a year ago when I...
Define "thermal limit", because no "mfg. specifications" overrides the universal laws of physics.
Have a fan curve in your drivers and set them at 100% at 65C, no higher, don't worry, your GPU will surpass that at times, but not by more than 10-15C and not for long.
 
Define "thermal limit", because no "mfg. specifications" overrides the universal laws of physics.
Have a fan curve in your drivers and set them at 100% at 65C, no higher, don't worry, your GPU will surpass that at times, but not by more than 10-15C and not for long.

By thermal limit I mean 83°C on the main temp sensor :)
This thread is a prequel to this one, from when I was still wondering what caused my computer to black screen.

I figured out the problem does come from the gpu after testing it in another computer and experiencing the same issue

You are probably right, setting a fan curve (or even better, repasting the thing) would probably help.
What bothers me is that it did not shut down like that a year ago when I used it in worse conditions (hotter room+slower case fans).
 
Solution