Question RTX 3090 black screen crash requiring a hard reboot ?

glennek

Distinguished
Feb 11, 2017
4
0
18,510
PC Part list
Nvidia Geforce 3090
Palit Game Rock OC
Corsair RM1000x Shift (Brand new)
Ryzen 7 5800x32gb
Trident Z 3200 ddr4
MSI x470 Gaming Plus

Problem description
I just got this card yesterday from a friend. I knew it had issues but it has been RMA'd with claims that it's working perfectly from vendor.

When the card is under heavy load all screens turn black, fans spike up to what sounds like 120% and everything is stuck until I hard reboot. I can still hear sounds from the desktop etc. No BSOD. The crash seems inevitable when load is at 100%, and mostly happens within 1 minute of it. Under desktop use and light gaming it runs perfectly fine.

The GPU is powered by 3x 4+4pin, straight cables -- no splitting ones. All cables are brand new, so is the PSU.

GPU was RMA'd to where it was bought, but they sent it back claiming it passed the PCmark stress test or something. Can't remember the benchmark used.
Before this it was tested in another set up where it didn't even provide an image at all.

I've noticed a couple of things that looks concerning (All tests done running Kombustor):
  • GPU Core temp is at 82c, but hot spot temperature is hovering at 105c (maybe spiking higher, hard to tell)
  • The Windows event log spews hundreds of nvlddmkm errors every second when the crash occurs. Lots of different ID 13 errors and even more ID 0 errors. Some of these include:
    • \Device\Video3Graphics SM Global Exception on (GPC 4, TPC 3, SM 0): Multiple Warp Errors
    • \Device\Video3Variable String too Large
    • \Device\Video3Graphics Exception: ESR 0x525e14=0xffffffff 0x525e10=0xffffffff
    • \Device\Video3Graphics Exception on GPC 0 ZROP 0: Graphics is hung, FATAL!!

What "works" so far
I've managed to make the card run stable at 65% Power with 100% fan speed through MSI afterburner. This leaves the temperatures hovering for core at 76c and hotspot at 97c. The core then runs at about 1300 Mhz. It says maximum has been up in 1800Mhz but I never saw it hover there.

The GPU is then using about 250w. It also works when I OC Core by +100mhz and Memory by +500. I haven't bothered going higher since this seems like too much of a bandaid fix.

If I increase it to 70% power the core spikes to 82c and hotspot to 105c. It still "survives" for a minute or two compared to 100% which is down to seconds. This does feel like slightly above maximum though.

I feel like it's obvious it's a temperature issue, but how can it be this bad? Does the cooling paste have to be reapplied? Is there some other issue I might not be aware of? BIOS settings, some kind of firmware?

I see this has been a widespread issue in the past and I've tried several other fixes including:
  • Reseating GPU
  • Reseating and switching up power cables (also in the PSU outlet)
  • BIOS update
  • DDU fresh driver
  • Various power settings in Windows Power Plan
 

Phaaze88

Titan
Ambassador
GPU Core temp is at 82c, but hot spot temperature is hovering at 105c (maybe spiking higher, hard to tell)
A ~20C gap between core and hot spot is typically reported, though the maximum of the latter is getting very hot; I believe the hot spot limit is 110C.
Case airflow is poor, or not setup well? 82C is borderline default throttle limit for the gpu core - but it's not the max(that's 91C). The core still has headroom, hot spot, not so much.
The memory junction temperature was omitted. Please include that.

When the card is under heavy load all screens turn black, fans spike up to what sounds like 120% and everything is stuck until I hard reboot. I can still hear sounds from the desktop etc.
Gpu memory error if you can still hear sounds over the speakers during a black screen crash.

\Device\Video3Graphics Exception: ESR 0x525e14=0xffffffff 0x525e10=0xffffffff
Memory again.


Due to the size of a 3090, I could guess weakened solder connections. The vendor might not be finding any problems(or they're lying out their butt), if their testing was done on a test bench, in which case, the card would've been installed vertically.
Try laying your case on its side and running the PC like that?
 
  • Like
Reactions: glennek

glennek

Distinguished
Feb 11, 2017
4
0
18,510
While typing this I just had a random ID: 41 error restart my PC, so there might be more to it than just thermal issues. I'll have to keep an eye out if this was a one time occurrence.

The memory junction temperature was omitted. Please include that.
Memory junction seems stable below 65c

Gpu memory error if you can still hear sounds over the speakers during a black screen crash.

Memory again.
Could this just be memory failing because something else is failing first?
I can't recall if always hear sound, but I'm certain I do most of the time...

Try laying your case on its side and running the PC like that?
That's an interesting idea, I'll give it a shot!

Running some tests now to produce a proper log file.
 

Phaaze88

Titan
Ambassador
Error 41 on Windows: "Derp! I don't know what caused me to suddenly close... err, I'm just letting you know I did!"
It's pretty useless.
Trying to search ID 13 gets: "The description for Event ID 13 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer."
Can you run DDU again, but do not install the Nvidia driver? See if the card crashes without it.


Memory junction is REALLY good... did your friend do any repaste or pad changes to the card before giving it to you? I see that once hot spot got into the 100s, the gpu started throttling. Voltage, power and core clock dropped off, but no dice; hot spot sustained high temperature with a final read of 107C.

What's the case?

Could this just be memory failing because something else is failing first?
I can't recall if always hear sound, but I'm certain I do most of the time...
Well... you already ran the DDU. Power supply and gpu VRM can black screen, but no sound through the speakers. You tried the card in another PC as well, with no real change.
I've watched gpu repair videos that conclude the weight of the coolers is playing a part in damaging the PCBs, or breaking/loosening solder joints.
 
  • Like
Reactions: glennek

glennek

Distinguished
Feb 11, 2017
4
0
18,510
Memory junction is REALLY good... did your friend do any repaste or pad changes to the card before giving it to you?
Not that I know of, but there is two aftermarket heat sinks underneath the card. Wouldn't the high difference in core and hotspot temperature imply that there is something wrong with the paste or pads?

Also it looks like the GPU is slanted, but it's mounted straight.

unHWNIu.jpeg


What's the case?
It's a Fractal Meshify 2. Right now I'm running it with both panels off in a room temperature room. It also crashed with both panels on. Took it off to prevent PSU cables bending out of place with the tight fit.

Power supply and gpu VRM can black screen, but no sound through the speakers.
The PSU runs my old 1070 fine, not that it draws the same power.

Can you run DDU again, but do not install the Nvidia driver? See if the card crashes without it.
I'll try! Do you think that could affect the horrible temperatures though? Like you said, it seems like there is a throttling issue, and maybe the card goes into panic mode and shuts down when it approaches 110c spot?

I have no idea myself, never dealt with cards of this caliber before.
 

Phaaze88

Titan
Ambassador
Wouldn't the high difference in core and hotspot temperature imply that there is something wrong with the paste or pads?
It depends on all the projects your buddy did with the card. The good memory temperature, but high core + hot spot gap can be caused by:
-the paste used. Bare die applications are selective. Cpus with their IHSs, you can use about any paste.
-thermal pads that were too thick or too firm. If it doesn't cause the PCB to crack, then it'll warp, causing poor mating between the cooler and gpu die.
So, are those 2 heatsinks in the picture all they did to the card?


I'll try! Do you think that could affect the horrible temperatures though?
No, the idea of running DDU and going without the Nvidia driver was another means of checking whether the crashes were hardware or software related. Without the Nvidia driver, you can't make full use of the card's features. Take that away, and it's a plain ol' gpu, if you can call it that.