Question How is this situation even possible? Can anyone help. GPU demon manifestation. Motherboard issue?

Jan 23, 2025
3
0
10
Background: I have enough hardware knowledge to be 'dangerous'. Claim to fame, built a 200+ GPU ETH mine many years back on different M/B's all GPU's second hand from eBay, and ran it for a profit for some investors, so I can do things like repaste things, and basic soldering, but really I'm a software geek.

Problem background: Bought a good machine (G/Byte M/B, Core i9 and two Z790 GPU's some years ago) which was really my 'desktop for general use' but I ran it overnight to mine a little extra ETH. Over time I purchase a newer machine (again G/Byte M/B, core i9, etc). You'll see why the details of these machines is probably NOT the problem later. We can call the original machine #1 and the new machine #2. When I bought #2, I took one of the GPU's (they are identical in every way, except serial number) and put it in #2. For a while, everything was fine...

Problem manifestation: Machine #1 will sometime do what I call 'airplane'. EVERYTHING goes black, the GPU spins up to maximum RPM (both fans) and effectively it takes no input whatsoever, just have to switch off the power/hard reset.

Problem timing: When #1 did 'airplane' the first time it had been running 6 months, then it did it again a month later (ish), then again a week later, then several days later. So I'm thinking there is something wrong with the GPU. Take it out (nothing else touched), check it over (it's like perfect, not a scratch, no brown marks, nothing) and put it back, it works fine. And I'm thinking it's a weird O/S (Windows 11) glitch, no worries. About 6 months later, not exact, maybe 4 or 5, it airplanes again, reboot, does it 2 weeks later, then a few days later, and no I persist, and it gets closer and closer / shorter and after a week or so, it only runs 5 mins and then airplanes.

Attempted resolution: So I check everything, memory tests, motherboard tests, GPU tests, you name it, I test it, software and hardware, and all seems fine. So I declare the GPU 'dubious' and switch machines GPU #1 and #2. So now I have the other GPU in #1. It runs perfectly for a few months, and then the machine airplanes. Same situation, a few weeks later, then a few days later, then a few hours later. I pop the GPU out the slot, [I'm wiser now] and put it back in the same slot, fire it up, it runs for many months again.

Further context: This has been happening for about 4.5 years now, both machines are perfect, great machines, both GPU's seem indistinguishable, except in machine #1 this 'airplane' still happens, and then happens more frequently, until I pop the GPU out and put it back. I have updated the Firmware on the motherboard to the latest (stable) version recommended by G/Byte (about 2 years ago) and I still get the same problem. WHERE do I start looking for the issue?

Some specific statements:
- I feel it cannot be the GPU's, because switching them gives the same problem in #1 and no problem in #2
- I feel it cannot be the GPU slot on the motherboard, because having the same problem on 2 slots on a M/B seems unlikely
- It's not the memory (DDR4), because I've switched all the memory between the machines, same issue
- It's not the PSU, I've switched those too, they are different, but both Gold > 1000W machines, and at idle each machine draw < 100w
- It *COULD* be the CPU on #1, as I don't have the means to switch them (the particular version in #1 does not work in #2), but surely this is unlikely given the situation
- It's NOT the Windows 11 installation.. about a year back the SSD died on #1 and I have a completely new SSD, and a completely new Win 11 installation, and it still does exactly the same, so I'm ruling out the SSD and I'm ruling out the Operating system
- It's not the M/Board BIOS/FIRMWARE, I've updated it 2 years ago, and same problem persists.

- The MOST likely (in my mind) is the Gigabyte m/b on #1. Now I LOVE G/Byte MB's. I've had issues with almost EVERY other manufacturer, and I refuse to use anything except G/Byte now, and in running 30-40 machines (sometimes for YEARS without anything bar a quick re-start) I've never had a G/Byte motherboard give me an issue. Is my faith mis-placed? I don't REALLY want to go and buy a replacement Motherboard (£100?), so I'm just living with the issue. Popping the GPU out twice a year takes me about 2 mins now, so it's hardly a problem, HOWEVER, I would love to know WHY ?

I would LOVE people with more experience to ask me questions to attempt to refine (smaller) the possible problem space, or come up with any suggestions you can.

Thanks for taking the time to read my long rambling...

Final PS: I re-read the above, and fixed a few typos between #1 and #2. for the avoidance of doubt, the issue occurs in #1, #2 works perfectly for 5 years; just in case I made typos that I missed...
 
My recommendation is to take a very close look at the Reliability History/Monitor and the Event Viewer logs of both Machine 1 and Machine 2.

Compare the logs as best you can with respect to more recent events and all the swaps etc. that have been done.

Look for some common errors, warnings, or even informational events that follow the described problems.

Start with Reliability History/Monitor. Much more end user friendly and the timeline format may reveal some patterns.

Event Viewer will require more time and efford to navigate and understand. However it is likely have more information and details.

To help with Event Viewer:

How To - How to use Windows 10 Event Viewer | Tom's Hardware Forum (tomshardware.com)

In both tools you can click/select any given entry to get more details about what happened. The details may or may not be helpful. Error codes, OS and software references are all important.

Specifically you are looking for logged entries just before or at the times of the described airplane/black screen events.

Bear in mind that whatever is happening may not have a single cause. Could be some "perfect storm" of causes which is why the behavior is so intermittent. "Causes" includes software as well.
 
  • Like
Reactions: grantw88
Thank you Ralston18, I'm an old Linux sysadmin, and I've looked at the EventVwr logs many times, forgot to mention that. Nothing in the logs (except info) in App or Security or System that seems in any way relevant, just the error upon start-up that notes the system was shutdown unexpectedly.

I have NEVER come across the Reliability and Problem History before, very interesting. There is only 1 consistent item, which is a DWM crash. My current thinking is that the DWM may crash when I power the machine down, but it IS possible that the DWM crashes and causes the problem of course. Seems *kind* of weird across multiple GPU's, and multiple operating system installs, but it's certainly given me an angle of attack in an attempt to bisect the problem, thank you for your help / input, much appreciated.
 
Well Mr Ralston18 sir, I don't KNOW you've put your finger on it, but I CAN tell you that #2 *does* have a GPU support, and that #1 *does not* and that both machines run vertically, so this *genuinely* could be the issue. I'm going to put in a GPU support and if the problem has not recurred in a year (say) then you've nailed it !

Thank you again

P.S. I looked for you on 'buymeacoffee.com' and if you ever sign-up, please reply to this thread, and I'll be your first donation 😎