Question is my RTX 4090 dead?

Oct 20, 2022
34
3
35
Hello,

I think I am having issues with my GPU (Asus ROG Strix RTX 4090 OC edition) crashing. What usually happens is that after playing a game for about 3-5 minutes my display stops responding (HDMI no signal), my GPU fans go to 100% and I’m unable to do anything except reboot. Sometimes it won’t happen until after I game for hours and then it will occur within 3-5 minutes of doing nothing. It seems to be an issue with my gfx card entering/exiting a certain mode? Strangely, twice now this issue has randomly gone away for over a month and then returned.

EventViewer usually produces the following two errors nvlddmkm error codes together:

“The description for Event ID 14 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.”
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
\Device\Video3
0000(0000) 00000000 00000000
Element not found

A separate identical error code/message is almost always simultaneously occurs for Event ID 0

Note: I have also seen this error with the following information included in the event:

\Device\000000ff
Variable String too Large

\Device\000000ff
Error occurred on GPUID: 100

\Device\Video3
CMDre 00000002 00003ffc ffffffff 00000007 00ffffff

\Device\00000107
Row Remapper: New row marked for remapping, reset gpu to activate.

\Device\Video3
UCodeReset TDR occurred on GPUID:100
Element not found

I have also seen this error in EventViewer:

Display driver nvlddmkm stopped responding and has successfully recovered.
*Note: For a while when I was getting this error code my game would just freeze, but my GPU fans didn’t ramp to 100% and I didn’t have to reboot. The video driver did seem to recover successfully, but lately I get this error even though it doesn’t actually recover.

A couple times a .dmp file has been created. Someone on another forum analyzed it and said: “It's the GPU driver, but the reason for the crash sounds more software related. The crash error is "An attempt was made to release a semaphore such that its maximum count would have been exceeded.”
After researching these error codes, it appears that nvlddmkm errors are very generic and could indicate a variety of different problems.

Here are the troubleshooting steps I have already tried based on researching this issue:

• DDU drivers in safe mode and updated to latest driver and an old driver that previously worked for me.
• Clean reinstall of Windows 11
• Updated GPU Bios
• Disabled XMP
• Completely Uninstalled all 3rd party lighting controllers (Asus Armoury Crate, Corsair iCUE, Razer Synapse)
• Checked PSU power consumption. I’ve never seen it hit over 700w and it’s a 1000w PSU.
• Changed power management mode to “Prefer maximum performance.”
• Changed Hardware accelerated GPU scheduling (HAGS) to off.
• Tried using a DisplayPort cable instead of HDMI and different HDMI ports.
• Scanned SSDs with Samsung magician. Both are in good health.
• Checked CPU and GPU temperature while gaming and both are normal (under 70C).
• Changed Link State Power Management to OFF
• Changed ECC state to On
• Power limited GPU to 80%
• Manually flipped the switch on the graphics card that sets it to Quiet instead of Performance vBIOS profile
• Changed user permission of nvlddmkm.sys to full control
• Disabled "Fast startup" for Windows 11
• Disabled G-sync

While 90% of the time this crash occurs within the first 3-5 minutes of gaming, I have also had this occur a few times while away from my computer (my computer not running anything). Another couple times I was able to game for hours and within 5 minutes of stopping, the same crash occurred. A few times it occurred right after booting into Windows after previously crashing.

Interestingly, I’m able to run low GPU usage games without issue.

I have had this PC for almost 1 year and to the best of my knowledge my issues started only after updating to the latest version of ASUS Armoury Crate back in June.

I would really appreciate your thoughts or suggestions.

Thanks in advance!

Link to full PC specs:
https://pcpartpicker.com/b/DYNPxr
 
I had the same problem with mt 4090 would make screen go blank and fans shoot up high speed and my specs are nearly nearly same apart from motherboard and GPU brand tried everything DDU and lowering setting turning off XMP and such but after couple days dose the same thing. I changed my PSU from 1000W to 1300W and not had problem since for now. Lucky amazon refunded me because ASUS wont deal with my PSU since not bought it directly from them!

I had small thought that my system was to powerful for my ex ASUS 1000W and 4090 was giving out to much power witch it caursed blank screen and fans go up! Everyone person computer is diffrent and may not be same issue but here an post

 
Last edited:
Oct 20, 2022
34
3
35
I had the same problem with mt 4090 would make screen go blank and fans shoot up high speed and my specs are nearly nearly same apart from motherboard and GPU brand tried everything DDU and lowering setting turning off XMP and such but after couple days dose the same thing. I changed my PSU from 1000W to 1300W and not had problem since for now. Lucky amazon refunded me because ASUS wont deal with my PSU since not bought it directly from them!

I had small thought that my system was to powerful for my ex ASUS 1000W and 4090 was giving out to much power witch it caursed blank screen and fans go up! Everyone person computer is diffrent and may not be same issue but here an post

Thanks for your response. I have monitored my PSU wattage since I have a ROG Thor 1000w Platinum that has an OLED screen that display the wattage. I've never seen it get much over 700w so I don't think not enough wattage is the issue. There's plenty of people who run a similar setup on 850w too. It also wouldn't explain why the problem went away for a month and is now back. I'll keep it in mind, though.
 
Oct 20, 2022
34
3
35
if you are running a high quality reliable power supply with enough wattage that doesn't cause the same or similar issue with other GPUs;
contact ASUS support and request a replacement.
Thanks, it's the ROG Thor 1000w Platinum II power supply that was recommended by Asus themselves. I've not heard of any issues. I'm definitely getting to the point where I might just request an RMA for the gfx card because I've tried just about everything.
 
the ROG Thor 1000w Platinum II
the Thor series has usually been rated pretty high quality.
power supply that was recommended by Asus themselves
of course ASUS is going to recommend their own product vs another manufacturer.
this just happened to be a fairly nice one.

if even @ 80% power the same exact issue arises then i would say it's likely an issue with the card.
and if drives or other components were causing it, it normally wouldn't cause the card's fans to go haywire and run 100% and the reports wouldn't include GPU data.
 
Thanks, it's the ROG Thor 1000w Platinum II power supply that was recommended by Asus themselves.
ASUS recommended one of their own PSUs? Imagine that! ;)
I've not heard of any issues. I'm definitely getting to the point where I might just request an RMA for the gfx card because I've tried just about everything.
Yeah, you have tried pretty much everything. A card that expensive should just work. ASUS cards cost way too much for what they offer which is why I've never owned one.
 
  • Like
Reactions: Order 66
Oct 20, 2022
34
3
35
the Thor series has usually been rated pretty high quality.

of course ASUS is going to recommend their own product vs another manufacturer.
this just happened to be a fairly nice one.

if even @ 80% power the same exact issue arises then i would say it's likely an issue with the card.
and if drives or other components were causing it, it normally wouldn't cause the card's fans to go haywire and run 100% and the reports wouldn't include GPU data.
I'll try again at 80% power but it definitely seems like the GPU is the issue.
 
I assume you've verified that the 16-pin adapter isn't getting hot / melted? (Just dotting my I's and crossing my T's.) Beyond that, the Strix is a massive card and can use quite a lot of power. While average power use might be around 485W in the worst-case workloads (running stock), transients can definitely spike higher. I wouldn't think that a 1000W Thor would be "too little," but if you can try a 1300-1600 watt PSU, that would be something to check.

Basically, you have a lot of fans, a high-power CPU, extreme graphics, etc. If any of the rails spike too high on power for even a fraction of a second, it could trip something and cause the system to crash. Most testing equipment that doesn't cost a ton of money won't really show those spikes, so wall power of 700W probably means the system is drawing up to 630W (90% efficiency estimate). It's possible a spike could got much higher than that, on a single rail, and cause the crash — unlikely, but not out of the question.

Cooling of the CPU is another possibility. I don't know if you've checked that, but 13900K can run quite hot, and it's possible to have mobo auto-overclocking (or at least, aggressive tuning) push things too far. Case in point: my test PC for GPUs. Something is whacky (probably the MSI mobo BIOS), but certain games do a shader compilation that really stresses the CPU and can cause a game or system crash. I'm aware of at least several: The Last of Us Part 1, Metro Exodus Enhanced, Hogwarts Legacy, and Horizon Zero Dawn. If I set CPU affinity to just the P-cores for the shader compilation, CPU temps stay under 90C and things are fine. Once the initial shader compile is done, I can set affinity back to all CPU cores. (I have made batch files to do exactly that, because many of those are in my test suite.) Anyway, check the CPU temps and make sure they're not hitting high 90s or more. HWiNFO64 with a 0.5 second polling and logging of data would help determine if anything is getting too hot or using too much power.

One final potential before we get to the "bad hardware" is some BIOS setting. You said you turned off XMP, but have you done a full BIOS reset? (This can be required even after flashing the BIOS, and if you didn't apply defaults, you might have some "undefined" state that's to blame.) Sometimes Asus can do tuning outside of XMP that might push some parameter too high / too far. This isn't super likely, but you basically just need to go into the BIOS, load the defaults, check your UEFI setting (so Windows can boot), and then see if anything changes.

Beyond the above, the only other likely explanation is faulty hardware somewhere in the system. The GPU is the most likely culprit, but RAM, CPU cooling, and motherboard aren't out of the question. The only way to narrow it down is to start swapping hardware. If you have a good friend nearby who also has a high-end PC, you can start by trying your GPU in their PC for a day or whatever and see if the problem crops up. If it does, good and bad news is you found your culprit and you should RMA the GPU.

If that doesn't work, narrowing things down becomes more difficult, depending on whom you know and what you have access to. You'd basically need to swap out RAM, mobo, PSU... case and storage are highly unlikely to be the problem, so just those four core components are the ones you'd need to check. The difficulty

TL;DR: Most likely cause, in order of probability:
1) GPU
2) PSU
3) RAM
4) Mobo
5) CPU

If you lived in NoCo (Northern Colorado), I'd tell you to swing by. But I suspect you're nowhere near where I live. Good luck!
 
Oct 20, 2022
34
3
35
I assume you've verified that the 16-pin adapter isn't getting hot / melted? (Just dotting my I's and crossing my T's.) Beyond that, the Strix is a massive card and can use quite a lot of power. While average power use might be around 485W in the worst-case workloads (running stock), transients can definitely spike higher. I wouldn't think that a 1000W Thor would be "too little," but if you can try a 1300-1600 watt PSU, that would be something to check.

Basically, you have a lot of fans, a high-power CPU, extreme graphics, etc. If any of the rails spike too high on power for even a fraction of a second, it could trip something and cause the system to crash. Most testing equipment that doesn't cost a ton of money won't really show those spikes, so wall power of 700W probably means the system is drawing up to 630W (90% efficiency estimate). It's possible a spike could got much higher than that, on a single rail, and cause the crash — unlikely, but not out of the question.

Cooling of the CPU is another possibility. I don't know if you've checked that, but 13900K can run quite hot, and it's possible to have mobo auto-overclocking (or at least, aggressive tuning) push things too far. Case in point: my test PC for GPUs. Something is whacky (probably the MSI mobo BIOS), but certain games do a shader compilation that really stresses the CPU and can cause a game or system crash. I'm aware of at least several: The Last of Us Part 1, Metro Exodus Enhanced, Hogwarts Legacy, and Horizon Zero Dawn. If I set CPU affinity to just the P-cores for the shader compilation, CPU temps stay under 90C and things are fine. Once the initial shader compile is done, I can set affinity back to all CPU cores. (I have made batch files to do exactly that, because many of those are in my test suite.) Anyway, check the CPU temps and make sure they're not hitting high 90s or more. HWiNFO64 with a 0.5 second polling and logging of data would help determine if anything is getting too hot or using too much power.

One final potential before we get to the "bad hardware" is some BIOS setting. You said you turned off XMP, but have you done a full BIOS reset? (This can be required even after flashing the BIOS, and if you didn't apply defaults, you might have some "undefined" state that's to blame.) Sometimes Asus can do tuning outside of XMP that might push some parameter too high / too far. This isn't super likely, but you basically just need to go into the BIOS, load the defaults, check your UEFI setting (so Windows can boot), and then see if anything changes.

Beyond the above, the only other likely explanation is faulty hardware somewhere in the system. The GPU is the most likely culprit, but RAM, CPU cooling, and motherboard aren't out of the question. The only way to narrow it down is to start swapping hardware. If you have a good friend nearby who also has a high-end PC, you can start by trying your GPU in their PC for a day or whatever and see if the problem crops up. If it does, good and bad news is you found your culprit and you should RMA the GPU.

If that doesn't work, narrowing things down becomes more difficult, depending on whom you know and what you have access to. You'd basically need to swap out RAM, mobo, PSU... case and storage are highly unlikely to be the problem, so just those four core components are the ones you'd need to check. The difficulty

TL;DR: Most likely cause, in order of probability:
1) GPU
2) PSU
3) RAM
4) Mobo
5) CPU

If you lived in NoCo (Northern Colorado), I'd tell you to swing by. But I suspect you're nowhere near where I live. Good luck!
Thanks so much for your response. I really appreciate it.

1. I actually went with a 1000w PSU after consulting with you on whether it would be enough for this specific build. You can see our discussion here in the comment section (I'm Mr.Stranger on Youtube): https://youtu.be/O50Rm-QCZf8?si=mLxgnwgSwTn0fy0g

I have tried power limiting my GPU to 80% (setting power target to 80% with GPU Tweak III) and I have still had this issue. It does seem to be an issue only with games that fully utilize the GPU, but I never had this issue for the first 6 months I had this PC so that to me seems like it would indicate that it's not a problem with not having enough power?

Also, about 1 out of 5 times I reboot my computer I'm able to game for hours with no problem, and like I mentioned sometimes this issue won't happen until 3-5 minutes after I'm done gaming. This seems like some kind of issue with my gfx card entering/exiting a certain mode? Could it be software related as the person who analyzed my .dmp files suggested?

2. I've monitored my CPU temps and they are normal (usually around 70-75c when gaming).

3. I haven't done a full BIOS reset yet, but I will look into that.

Unfortunately, I live in rural Oregon and don't have the ability to test different hardware configurations. I'll probably need to contact Asus to see if I can get an RMA going for the graphics card.
 
  • Like
Reactions: Order 66

Order 66

Grand Moff
Apr 13, 2023
2,164
909
2,570
Thanks so much for your response. I really appreciate it.

1. I actually went with a 1000w PSU after consulting with you on whether it would be enough for this specific build. You can see our discussion here in the comment section (I'm Mr.Stranger on Youtube): https://youtu.be/O50Rm-QCZf8?si=mLxgnwgSwTn0fy0g

I have tried power limiting my GPU to 80% (setting power target to 80% with GPU Tweak III) and I have still had this issue. It does seem to be an issue only with games that fully utilize the GPU, but I never had this issue for the first 6 months I had this PC so that to me seems like it would indicate that it's not a problem with not having enough power?

Also, about 1 out of 5 times I reboot my computer I'm able to game for hours with no problem, and like I mentioned sometimes this issue won't happen until 3-5 minutes after I'm done gaming. This seems like some kind of issue with my gfx card entering/exiting a certain mode? Could it be software related as the person who analyzed my .dmp files suggested?

2. I've monitored my CPU temps and they are normal (usually around 70-75c when gaming).

3. I haven't done a full BIOS reset yet, but I will look into that.

Unfortunately, I live in rural Oregon and don't have the ability to test different hardware configurations. I'll probably need to contact Asus to see if I can get an RMA going for the graphics card.
I would start the RMA process immediately because you don't want to be in a situation where your warranty runs out.
 
Oct 20, 2022
34
3
35
I would start the RMA process immediately because you don't want to be in a situation where your warranty runs out.
It's a 3 year warranty from Asus, but I probably will look into it soon because who knows how long it will take for them to ship me a replacement.
 

osv

Distinguished
Jan 4, 2010
2
2
18,515
What usually happens is that after playing a game for about 3-5 minutes my display stops responding (HDMI no signal), my GPU fans go to 100% and I’m unable to do anything except reboot.
those symptoms are a well-documented problem, with the sense wires on your old cablemod 12vhpwr power cable, and indeed i think that it's also an issue with some factory cables as well... you could confirm it by swapping out the cm cable for the factory cable.

cm has support on reddit, where they will apparently replace cables for free? just pay shipping? they have just released new power cables that don't use the same sense wire setup, for example: https://www.reddit.com/r/cablemod/comments/150m4v5/is_the_new_revision_of_the_cablemod_12vhpwr_out/

i don't have personal experience with this, but i did a bunch of research on 4090 issues, because i just ordered one.

keep us posted how it goes.
 
  • Like
Reactions: HungryHamster
Oct 20, 2022
34
3
35
those symptoms are a well-documented problem, with the sense wires on your old cablemod 12vhpwr power cable, and indeed i think that it's also an issue with some factory cables as well... you could confirm it by swapping out the cm cable for the factory cable.

cm has support on reddit, where they will apparently replace cables for free? just pay shipping? they have just released new power cables that don't use the same sense wire setup, for example: https://www.reddit.com/r/cablemod/comments/150m4v5/is_the_new_revision_of_the_cablemod_12vhpwr_out/

i don't have personal experience with this, but i did a bunch of research on 4090 issues, because i just ordered one.

keep us posted how it goes.
Thanks for bringing this to my attention. I had no idea about the issue with the cablemod 12vhpwr power cables. I'm going to contact them and see if they can ship me the updated version.
 
Oct 20, 2022
34
3
35
those symptoms are a well-documented problem, with the sense wires on your old cablemod 12vhpwr power cable, and indeed i think that it's also an issue with some factory cables as well... you could confirm it by swapping out the cm cable for the factory cable.

cm has support on reddit, where they will apparently replace cables for free? just pay shipping? they have just released new power cables that don't use the same sense wire setup, for example: https://www.reddit.com/r/cablemod/comments/150m4v5/is_the_new_revision_of_the_cablemod_12vhpwr_out/

i don't have personal experience with this, but i did a bunch of research on 4090 issues, because i just ordered one.

keep us posted how it goes.
You were right, it was the cablemod cable! I switched to the cable that came with my PSU and this problem has completely gone away. I truly can't thank you enough for helping me figure this out. I had spent countless hours over the last few months trying to get to the bottom of this and was just about to give up and RMA.

I've reached out to cablemod and they have agreed to send me the updated version of their 12vhpwr power cable free of charge.

Thanks again!
 
  • Like
Reactions: osv