Question Graphics card crashing

Jan 4, 2025
5
0
10
Hi all,

I’m experiencing a frustrating issue with my graphics card and could use some advice. Despite trying several troubleshooting steps, I’m running out of ideas. Here’s a detailed rundown of the situation:

System Specifications:
I’m using a pre-built HP Omen PC with the following specs:
  • CPU: AMD Ryzen 7 5800X
  • RAM: 32 GB HyperX XMP DDR4-3200MHz
  • GPU: NVIDIA GeForce RTX 3070 LHR (8 GB)
  • PSU: Cooler Master 600W (original), replaced with Be Quiet! Pure Power 12 M 750W
  • OS: Dual-boot Windows 11 and Ubuntu 24.04
More detailed specs can be found here: https://support.hp.com/hk-en/document/ish_5155531-5155693-16

The Problem:
Whenever I put a heavy load on the GPU, such as running a demanding game (e.g., Red Dead Redemption 2) or a benchmark tool (FurMark), my monitors go blank and display "No video input." The PC itself continues to run, but the GPU seems to crash.
Interestingly, this happens immediately upon starting the benchmark or loading into the game—there’s no gradual overheating or delay.

What I’ve Tried So Far:
  1. Driver Updates:
    • Updated the NVIDIA driver to version 560 on both Windows and Linux.
    • On Ubuntu, the card became even less stable with driver version 560, crashing spontaneously even when idle. Downgrading to version 555 stabilized normal operations, but high GPU loads (e.g., ray tracing simulations) still cause crashes.
  2. Testing on a Different PC:
    • I installed the GPU on my brother’s PC, where it worked flawlessly under the same driver version. This suggests the GPU itself is not faulty.
  3. Power Supply Replacement:
    • Replaced the original 600W PSU with a 750W Be Quiet! Pure Power 12 M PSU. Unfortunately, this didn’t resolve the issue—the GPU still crashes during benchmarks or games.
The only hardware I haven’t tested is the motherboard. However, replacing it would be complicated, and if it comes to that, I might consider upgrading the entire PC instead.

Does anyone have any other ideas or suggestions I could try before resorting to a motherboard replacement? I’d greatly appreciate any insights or advice!

Thanks in advance!
 
Last edited:
Idle:
CPU: 30-40°C
GPU: ~30°C

Under load:
CPU: ~80°C
GPU: hard to test currently since I can't really stress it without crashing immediately


I have not touched any overclocking settings, so it is all still running at factory default which I am guessing is none.
 
When I stress the CPU, the fans are definitely spinning as expected.

However, the GPU fans remain at idle speed (33%). I haven’t been able to increase the fan speed or GPU temperature because the card crashes immediately under load. When I tested the GPU in a different PC, it worked perfectly, though I didn’t specifically monitor the fan behavior during that test.

The PSU in my system is brand new, so it’s unlikely to be the problem. Additionally, the GPU functions normally in another PC, which I thought seems to rule out heat or power draw issues.
 
Reinstall the AMD chipset drivers from AMD directly. Get them here:

https://www.amd.com/en/support/downloads/drivers.html/chipsets/am4/x570.html

All the AM4 chipset drivers are the same, by the way.

Since you seem to have issues with both Windows and Linux,. this could very well be a hardware issue. The GPU or motherboard, given that you upgraded your PSU, seems the likely culprit.

Is this system still under warranty?
 
I would not rule out your PSU just becuase it is new, if GPU works fine in another system you might have a faulty one. Usually when the GPU crashes immediately on startup for demanding games it means there was insufficient power. Try another one if you can.
 
Sounds like PSU.

Does not matter if it's new, it might not be able to handle the power spikes from 3070. As a side note, nowadays buying anything less than 1000w PSU is simply asking for trouble the moment you start putting in better GPUs, at least in my opinion.

Yes, technically even 650w should be enough for that card, but nowadays GPUs can boost wildly and momentarily go way over whatever their TDP/recommendations might imply and with the scenario you described it might just be that case.

What PSU that other system has?
 
Yes, I agree—it does sound like a PSU issue.

However, I’m still puzzled by what’s happening here for a couple of reasons:
  1. I’ve been using this PC with the same graphics card and 600W PSU for over two years without any issues, and this problem appeared out of nowhere. Either some hardware component has failed, or the newer drivers are causing the card to draw higher power spikes than before.
  2. I tested the card in another PC equipped with a Corsair 650W PSU (unfortunately, I don’t recall the exact model), and it worked perfectly fine in that setup.
Given this, I’m trying to understand how the PSU could still be the culprit.
 
It turns out this is my area of expertise as a power electronics engineer who works on EVs and high power chargers. I agree with the initial diagnosis from the gets above that it sounds like the PSU, but here are some more ideas / notes. It really does sound like a hardware problem, but good troubleshooting should help find the root cause. You did a great job by swapping the PC and see if the card performed right in another configuration. Sounds like the card itself is healthy. Maybe. Read on.

In line with the PSU being the problem, I would broaden my field of view and look for anything that is allowing the voltage to drop when you begin the step load / stress test. If the 12V rail(s) going to the card droop too far, the downstream regulators on the video card will have no choice but to trip off. This is necessary to implement some protection for the connector pin paralleling that they have used.


1. Video cards come with additional power supply connectors that have been troublesome for many users. If your card works on your friend's PC fine and not yours, check the additional 8 pin power cables on your PSU and make sure they are visibly healthy using a bright light and magnifier lens. Look for bent or burned pins or damaged wires into the connector housing. Also check for debris lodged inside the pin and sockets on both the PSU cable and video card. I would also check for rubbed off plating on the pins and sockets.

2. If you are using the 8 pin adapters for the card, consider that they are the culprit. If you bought cheaper off brand adapters, they can skimp on wire and manufacture the crimps wrong. Using smaller gauge wire and poorly crimping can cause tremendous voltage drops as the current goes up with a load step.

3. When you lose video, does the PC seem to power done properly after you tap the power button? Or does Windows want to check the drive for errors when you start again? The idea is to see if the CPU seems to still be awake and happy after video loss. If the system and CPU is not crashing, this makes me think that the whole PSU is not dropping all of the PSU rails, but it still might be dropping the 12V peripheral rails going through the 8 pin card connectors.

4. I saw COLgeek chime in that PSUs degrade over time. This is absolutely true and one caution for reusing an old PSU. The main problem is the bulk electrolytic caps drying out and leaking. You can have other compoent issues (EX: XFMR insulation degradation in high power transient environments), but the cap dryout problem tends to overshadow most others. If reusing an old PSU, try to use it at a much lower power level than its rated power.

5. You said that this setup was working fine and then started up all of a sudden. This makes me think of 'system drift' in the engineering world, and can easily be overlooked. Drift can be caused by things like high operational temps, strong temperature swings, high humidity, and connector fretting failure. The first one of temperature is well known. The effect of temp swings is less known and can trigger issues that depend upon the design and materials, but for PCBs temp swings can cause cracks in the PCB and solder joints. I don't think this one is your problem because your card works in your friends PC. Humidity causes many problems mostly due to corrosion, and it could be your problem. As the connector and plug metal to metal contact gets exposure to moisture, the metal plating gets corrosions which creates high resistance. Simply reseating the connectors over and over will scrape off the oxidation corrosion and give a usable system. The fretting failure is when the connector pin and socket system dooes not seat well and the two metal parts rub the plating off and exposes the underlying copper, which corrodes when it gets to air and moisture. Once the plating is off, the connector will not carry rated current anymore - ever. So to help prevent this, make sure high current connectors are seated very well. BTW: this is why connectors in cars or difficult to plug together - an intentional effect of increasing the mating force to stop fretting in the high vibration environment.

Last note: I find as an engineering that many failures are caused by ESD, so I advise everyone to be ESD safe. Use a wrist strap; and don't place a PCB / card onto an office chair, couch, or blanket because these are inherently ESD generators. Do use an ESD bag when putting PCBs down even if just temporarily. Last thing you want to have happen is to kill your video card while troubleshooting the problem. Tom's could literally make a large article on just proper ESD care and explain why. There is so much FUD over this one topic. I freak out when I look onto eBay and see people selling used motherboard and the picture has the board on an office chair or bed blanket. D'oh!

Hope the ideas help.
 
  • Like
Reactions: Phaaze88
I'll just add to the above post about PSU degradation is that depending on the load and PSU quality it can degrade anywhere between 1% to 5% yearly.

So, after 2 years of service 600w PSU is not really a 600w PSU anymore, and it's not even clear how long you had it before, for all you know it is now actually a 500w PSU in reality and/or some specific components in it degraded to the point they are no longer capable of handling the requirements. So that may be why originally issues appeared with old PSU.

Budget PSUs degrade faster, as cheaper components used.

Then as the post above, it could be cables too. You don't by chance reuse cables from previous PSU or using double daisy chain 8 pin cable? Yes, I know it is a silly question, but we already had someone doing this in past.
 
Thanks, @chaz_music, for the detailed answer—I really appreciate your input!

Regarding the cables: My RTX 3070 uses an unusual 12-pin connector (not a 12VHPWR but a standard 12-pin). The PC came with an adapter that converts two 8-pin connectors to this 12-pin, and I’ve continued using it with the new 750W PSU as there’s no alternative available. However, I also had to use this same adapter in the other PC where the card worked without any issues, so I am not sure the adapter is at fault.

I investigated the system’s behavior during the GPU crash by setting up an OpenSSH server on Ubuntu and logging in remotely from my laptop. It turns out the CPU continues running normally during the crash—I was able to access the system and perform other tasks without issues. However, I couldn’t establish a connection to the GPU, and monitoring tools failed to detect the device.

I also checked the syslog during the crash, and it seems like the GPU simply shuts down without triggering an explicit driver failure. Here’s the relevant part of the log from when the GPU crashes—maybe someone here can spot something useful:
2025-01-05T23:14:35.115560+01:00 tobias-OMEN-25L-Desktop kernel: pcieport 0000:00:03.1: pciehp: Slot(0): Link Down
2025-01-05T23:14:35.115571+01:00 tobias-OMEN-25L-Desktop kernel: pcieport 0000:00:03.1: pciehp: Slot(0): Card not present
2025-01-05T23:14:35.115572+01:00 tobias-OMEN-25L-Desktop kernel: snd_hda_intel 0000:06:00.1: Unable to change power state from D3hot to D0, device inaccessible
2025-01-05T23:14:35.115577+01:00 tobias-OMEN-25L-Desktop kernel: NVRM: GPU at PCI:0000:06:00: GPU-2456a89e-206c-e729-a405-3dffb4739aee
2025-01-05T23:14:35.115577+01:00 tobias-OMEN-25L-Desktop kernel: NVRM: Xid (PCI:0000:06:00): 45, pid='<unknown>', name=<unknown>, Ch 00000000
2025-01-05T23:14:35.177548+01:00 tobias-OMEN-25L-Desktop kernel: snd_hda_intel 0000:06:00.1: Unable to change power state from D3cold to D0, device inaccessible
2025-01-05T23:14:35.195481+01:00 tobias-OMEN-25L-Desktop snapd[1275]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data
2025-01-05T23:14:35.328563+01:00 tobias-OMEN-25L-Desktop kernel: NVRM: Xid (PCI:0000:06:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
2025-01-05T23:14:35.328568+01:00 tobias-OMEN-25L-Desktop kernel: NVRM: GPU 0000:06:00.0: GPU has fallen off the bus.

Unfortunately, I don’t believe there’s a way to monitor the GPU’s voltage draw on a Linux system, which would have been incredibly helpful for diagnosing the issue. If anyone knows of a method or tool that can achieve this, please let me know!
 
Last edited:
Thanks, @chaz_music, for the detailed answer—I really appreciate your input!

Regarding the cables: My RTX 3070 uses an unusual 12-pin connector (not a 12VHPWR but a standard 12-pin). The PC came with an adapter that converts two 8-pin connectors to this 12-pin, and I’ve continued using it with the new 750W PSU as there’s no alternative available. However, I also had to use this same adapter in the other PC where the card worked without any issues, so I am not sure the adapter is at fault.

I investigated the system’s behavior during the GPU crash by setting up an OpenSSH server on Ubuntu and logging in remotely from my laptop. It turns out the CPU continues running normally during the crash—I was able to access the system and perform other tasks without issues. However, I couldn’t establish a connection to the GPU, and monitoring tools failed to detect the device.

I also checked the syslog during the crash, and it seems like the GPU simply shuts down without triggering an explicit driver failure. Here’s the relevant part of the log from when the GPU crashes—maybe someone here can spot something useful:


Unfortunately, I don’t believe there’s a way to monitor the GPU’s voltage draw on a Linux system, which would have been incredibly helpful for diagnosing the issue. If anyone knows of a method or tool that can achieve this, please let me know!
Talking to the system using SSH was a good idea. You could also play some music and hear if the music stops during the events. You can use the commands from the keyboard to pause and jump to the next song.

From your log data, it sounds like the system or driver thinks the video card has gone to low power state and is trying to wake it up. That is my interpretation anyway. The last log entry sounds like the system decided that the video card is off the bus (not visible to the system anymore).

I am not sure what measurement tools you have (scope, meter with fast min/max capture, etc.), but in my home lab that would be my next try to see if the video 12V is dropping way down. Some nicer meters can capture the drop in voltage if it is above some time (1mSec for instance with some Fluke meters) and you might see how low the 12V rail is going. A scope with negative trigger would be awesome for this. As the rail drops, grab a capture and see what is up. Setting up the trigger is always fun though.

Note: make sure that the 12 pin adapter is connected to 12V 8 pin outputs that are truly on the exact same rail. If they are from different internal PSU switching outputs, one will almost always be at a slightly higher voltage and will try to support the paralleled 12V all by itself (hogging). One way to check this is to read the 12V on each connector without any load on it (not going to the PSU). If they reading within +/- 10mV from each other, you are probably ok since the cable voltage drop will act as droop resistance and help with sharing. The the two outputs are more than 25-30mV apart with no load, I would be very suspicious.

Another test idea is to use a SW tool that can load your GPU in adjustable increments, such as Prime95 (I think it still does GPU loading). You might also see what power is going into your system with a plug in power meter (Rosewill used to make one, there is also Kill-A-Watt). Those are very slow thought and might not capture a fast failure. You want to slowly increase the stress level until it trips. If you can find a loading that it does not trip, you can see if anything in the cable path is getting warm or hot.

Do you have an IR gun temp probe or a true IR camera? A poor connection will exhibit higher temperature due to voltage drop. The IR probe should be pointed at the contact pins. An IR camera is ideal, but those are expensive unless you can find the ones that connect to smart phones..

Also, I did not see if you commented on reseating the power connectors several times to cause oxidation to scrape off. I would reseat the 12 pin and 8 pin connectors, as well as any connectors for those rails on the PSU.

Or you could simply find a larger power PSU (maybe with the 12 pin power cable?). Can you duplicate the PSU configuration that your friend is using that seems to work? Note that a larger PSU will usually have poorer efficiency at idle, especially if the system idles at less than 10% of the PSU rating. But you would be giving yourself some power overhead margin.

- Charles