Background: I have enough hardware knowledge to be 'dangerous'. Claim to fame, built a 200+ GPU ETH mine many years back on different M/B's all GPU's second hand from eBay, and ran it for a profit for some investors, so I can do things like repaste things, and basic soldering, but really I'm a software geek.
Problem background: Bought a good machine (G/Byte M/B, Core i9 and two Z790 GPU's some years ago) which was really my 'desktop for general use' but I ran it overnight to mine a little extra ETH. Over time I purchase a newer machine (again G/Byte M/B, core i9, etc). You'll see why the details of these machines is probably NOT the problem later. We can call the original machine #1 and the new machine #2. When I bought #2, I took one of the GPU's (they are identical in every way, except serial number) and put it in #2. For a while, everything was fine...
Problem manifestation: Machine #1 will sometime do what I call 'airplane'. EVERYTHING goes black, the GPU spins up to maximum RPM (both fans) and effectively it takes no input whatsoever, just have to switch off the power/hard reset.
Problem timing: When #1 did 'airplane' the first time it had been running 6 months, then it did it again a month later (ish), then again a week later, then several days later. So I'm thinking there is something wrong with the GPU. Take it out (nothing else touched), check it over (it's like perfect, not a scratch, no brown marks, nothing) and put it back, it works fine. And I'm thinking it's a weird O/S (Windows 11) glitch, no worries. About 6 months later, not exact, maybe 4 or 5, it airplanes again, reboot, does it 2 weeks later, then a few days later, and no I persist, and it gets closer and closer / shorter and after a week or so, it only runs 5 mins and then airplanes.
Attempted resolution: So I check everything, memory tests, motherboard tests, GPU tests, you name it, I test it, software and hardware, and all seems fine. So I declare the GPU 'dubious' and switch machines GPU #1 and #2. So now I have the other GPU in #1. It runs perfectly for a few months, and then the machine airplanes. Same situation, a few weeks later, then a few days later, then a few hours later. I pop the GPU out the slot, [I'm wiser now] and put it back in the same slot, fire it up, it runs for many months again.
Further context: This has been happening for about 4.5 years now, both machines are perfect, great machines, both GPU's seem indistinguishable, except in machine #1 this 'airplane' still happens, and then happens more frequently, until I pop the GPU out and put it back. I have updated the Firmware on the motherboard to the latest (stable) version recommended by G/Byte (about 2 years ago) and I still get the same problem. WHERE do I start looking for the issue?
Some specific statements:
- I feel it cannot be the GPU's, because switching them gives the same problem in #1 and no problem in #2
- I feel it cannot be the GPU slot on the motherboard, because having the same problem on 2 slots on a M/B seems unlikely
- It's not the memory (DDR4), because I've switched all the memory between the machines, same issue
- It's not the PSU, I've switched those too, they are different, but both Gold > 1000W machines, and at idle each machine draw < 100w
- It *COULD* be the CPU on #1, as I don't have the means to switch them (the particular version in #1 does not work in #2), but surely this is unlikely given the situation
- It's NOT the Windows 11 installation.. about a year back the SSD died on #1 and I have a completely new SSD, and a completely new Win 11 installation, and it still does exactly the same, so I'm ruling out the SSD and I'm ruling out the Operating system
- It's not the M/Board BIOS/FIRMWARE, I've updated it 2 years ago, and same problem persists.
- The MOST likely (in my mind) is the Gigabyte m/b on #1. Now I LOVE G/Byte MB's. I've had issues with almost EVERY other manufacturer, and I refuse to use anything except G/Byte now, and in running 30-40 machines (sometimes for YEARS without anything bar a quick re-start) I've never had a G/Byte motherboard give me an issue. Is my faith mis-placed? I don't REALLY want to go and buy a replacement Motherboard (£100?), so I'm just living with the issue. Popping the GPU out twice a year takes me about 2 mins now, so it's hardly a problem, HOWEVER, I would love to know WHY ?
I would LOVE people with more experience to ask me questions to attempt to refine (smaller) the possible problem space, or come up with any suggestions you can.
Thanks for taking the time to read my long rambling...
Final PS: I re-read the above, and fixed a few typos between #1 and #2. for the avoidance of doubt, the issue occurs in #1, #2 works perfectly for 5 years; just in case I made typos that I missed...
Problem background: Bought a good machine (G/Byte M/B, Core i9 and two Z790 GPU's some years ago) which was really my 'desktop for general use' but I ran it overnight to mine a little extra ETH. Over time I purchase a newer machine (again G/Byte M/B, core i9, etc). You'll see why the details of these machines is probably NOT the problem later. We can call the original machine #1 and the new machine #2. When I bought #2, I took one of the GPU's (they are identical in every way, except serial number) and put it in #2. For a while, everything was fine...
Problem manifestation: Machine #1 will sometime do what I call 'airplane'. EVERYTHING goes black, the GPU spins up to maximum RPM (both fans) and effectively it takes no input whatsoever, just have to switch off the power/hard reset.
Problem timing: When #1 did 'airplane' the first time it had been running 6 months, then it did it again a month later (ish), then again a week later, then several days later. So I'm thinking there is something wrong with the GPU. Take it out (nothing else touched), check it over (it's like perfect, not a scratch, no brown marks, nothing) and put it back, it works fine. And I'm thinking it's a weird O/S (Windows 11) glitch, no worries. About 6 months later, not exact, maybe 4 or 5, it airplanes again, reboot, does it 2 weeks later, then a few days later, and no I persist, and it gets closer and closer / shorter and after a week or so, it only runs 5 mins and then airplanes.
Attempted resolution: So I check everything, memory tests, motherboard tests, GPU tests, you name it, I test it, software and hardware, and all seems fine. So I declare the GPU 'dubious' and switch machines GPU #1 and #2. So now I have the other GPU in #1. It runs perfectly for a few months, and then the machine airplanes. Same situation, a few weeks later, then a few days later, then a few hours later. I pop the GPU out the slot, [I'm wiser now] and put it back in the same slot, fire it up, it runs for many months again.
Further context: This has been happening for about 4.5 years now, both machines are perfect, great machines, both GPU's seem indistinguishable, except in machine #1 this 'airplane' still happens, and then happens more frequently, until I pop the GPU out and put it back. I have updated the Firmware on the motherboard to the latest (stable) version recommended by G/Byte (about 2 years ago) and I still get the same problem. WHERE do I start looking for the issue?
Some specific statements:
- I feel it cannot be the GPU's, because switching them gives the same problem in #1 and no problem in #2
- I feel it cannot be the GPU slot on the motherboard, because having the same problem on 2 slots on a M/B seems unlikely
- It's not the memory (DDR4), because I've switched all the memory between the machines, same issue
- It's not the PSU, I've switched those too, they are different, but both Gold > 1000W machines, and at idle each machine draw < 100w
- It *COULD* be the CPU on #1, as I don't have the means to switch them (the particular version in #1 does not work in #2), but surely this is unlikely given the situation
- It's NOT the Windows 11 installation.. about a year back the SSD died on #1 and I have a completely new SSD, and a completely new Win 11 installation, and it still does exactly the same, so I'm ruling out the SSD and I'm ruling out the Operating system
- It's not the M/Board BIOS/FIRMWARE, I've updated it 2 years ago, and same problem persists.
- The MOST likely (in my mind) is the Gigabyte m/b on #1. Now I LOVE G/Byte MB's. I've had issues with almost EVERY other manufacturer, and I refuse to use anything except G/Byte now, and in running 30-40 machines (sometimes for YEARS without anything bar a quick re-start) I've never had a G/Byte motherboard give me an issue. Is my faith mis-placed? I don't REALLY want to go and buy a replacement Motherboard (£100?), so I'm just living with the issue. Popping the GPU out twice a year takes me about 2 mins now, so it's hardly a problem, HOWEVER, I would love to know WHY ?
I would LOVE people with more experience to ask me questions to attempt to refine (smaller) the possible problem space, or come up with any suggestions you can.
Thanks for taking the time to read my long rambling...
Final PS: I re-read the above, and fixed a few typos between #1 and #2. for the avoidance of doubt, the issue occurs in #1, #2 works perfectly for 5 years; just in case I made typos that I missed...