PSU or GPU - what is dying over here? Gigabyte GTX 970 black screen issue

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530
A while back I moved in to a new place; my buddies helped me with transportation of my rig.

After I settled in connecting everything up, I started getting hard (black) screen freezes; sometimes the audio would still play in the background and I would have control over the media (foobar 2000) player with the keyboard shortcuts, other times I would just end up getting a buzzing sound. I have checked 2 PSUs and have had the same, sad results, so I assumed it had to be my GPU.

I've taken my GPU to a local IT service store and they reported that VRAM is broken. However, I can do everything apart from playing games, and even those take about 5-20 minutes to freeze up my PC. This lead me to believe it might not be the GPU after all.

Current build:
AMD FX 8350 @ 4 GHz
Gigabyte 970A-D3SP
16 GB KS HyperX @ 1600 mHz
GeForce GTX 970 (@ Gigabyte stock clock)
ASUS Xonar D1 temporarily removed
SSD + HDD, OS installed on SSD.
750W Corsair PSU / 600W Corsair PSU

I have not installed any driver updates or software prior to moving out, maybe except ninja Win 10 updates that I have had no control over. I have tried re-rolling numerous drivers on the GPU, but to no avail.

GPU temps are definitely not the culprit; I was running liquid cooling on my 970 for nearly a year with temps never exceeding 40*C on the core at peak performance.

Why am I starting to doubt the diagnosis I've got from the local store? Coil whine on the PSU, which stops the exact moment the screen goes black.

Anyone? Any ideas? Could an unresolved issue from this thread http://www.tomshardware.co.uk/answers/id-3258752/performance-issue-sound-crackling-fps-drops-keyboard-mouse.html have anything to do with this?

PS. I have ran numerous GPU memtests with 10 passes each with no system crashes
PPS. I have also ran numerous SSD/HDD and RAM tests with flawless results
 
Solution
If the VRM's are flaky there might be power fluctuations. That can cause RAM failures and if they only occur when the power draw is high they wouldn't be triggered by a memtest. I admit though, this is all speculation. Without testing the card separately there's no way to be sure.

I did see another thread about a 970 this week, it would stop outputting mid-game while the programs kept running, causing his monitor to start scanning for source inputs. That led me to this thread:

http://forums.evga.com/GTX-970-Black-Screen-Crash-during-game-SOLVED-RMA-m2248453.aspx

Different model, but it's enough to make me think this type of failure might be common to 970 boards.

Edit: Some people were able to reach stability by undervolting and...

aielthor

Commendable
Jan 27, 2017
5
0
1,510
If you have onvoard graphics, maybe unplug the videp card, and run your monitor off the mobo, see if that helps? And if your psu is making weird noises, it's probably a good idea to change that out anyway.
 

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530
Thanks for quick replies. @aielthor:
This motherboard has no integrated graphics. However, I have tested the build on an older HD 6870 I have had lying around, which consumes roughly the same amount of power as the 970. System ran flawlessly without any crashes, though gaming performance, obviously, had much to ask for.

@TMTOWTSAC:
I did that. In fact, I have also swapped entire PSUs as well, so all power connections were reconnected.

 

aielthor

Commendable
Jan 27, 2017
5
0
1,510
Yeah, I just googled your mobo. And I'd have to agree, sounds like a dead gpu. Test it in another rig if you can, like stated above, to know for sure.
 

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530


But how can it be dead if I am writing this message running with it inside my rig right now? How can it be dead if it passes multiple benchmarks and gpu core/VRAM stress tests? That's what I don't udnerstand.
 
I was reading the article on GPU cooling Tom's posted a few days back. It mentioned something I'd never considered before. They said some GPU components rely on indirect airflow from the card fans. VRAM, VRM's, etc. Normally that's never an issue, any time the card is working the fans would be spinning to cool the GPU and subsequently moving some air past the rest of the card. But with a water cooling setup, that's no longer the case. Only the GPU has a temperature sensor, so the other components could be overheating and taking damage over time.

http://www.tomshardware.com/reviews/optimizing-graphics-cooling,4838-3.html
 

aielthor

Commendable
Jan 27, 2017
5
0
1,510




^this if you are running water cooling
Also your vram likely won't get much usage, outside of gaming either, and subsequently, won't heat up as quickly or as high.
 

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530
I did install my stock cooler back just to see if it does any difference. Sadly, it did not. So the damage might have already been done.

You are right to say what you did about VRAM usage, but the again: how come VRAM tests came out clean and the system remained stable?
 
If the VRM's are flaky there might be power fluctuations. That can cause RAM failures and if they only occur when the power draw is high they wouldn't be triggered by a memtest. I admit though, this is all speculation. Without testing the card separately there's no way to be sure.

I did see another thread about a 970 this week, it would stop outputting mid-game while the programs kept running, causing his monitor to start scanning for source inputs. That led me to this thread:

http://forums.evga.com/GTX-970-Black-Screen-Crash-during-game-SOLVED-RMA-m2248453.aspx

Different model, but it's enough to make me think this type of failure might be common to 970 boards.

Edit: Some people were able to reach stability by undervolting and underclocking their cards, but it varied wildly.
 
Solution

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530


thank you very much for this, it gave me a lot of insight into the issue. it appears like the voltage regulators might be the issue here, and not VRAM itself.

I downclocked my core and vram by 100mhz for now, rolled back back my drivers to Gigabyte ones from 2015 and am tryng a curved fan operation; loud but seems to work for now. I also set my TDP to 90% max so the GPU doesn't try to draw too much power.

Didn't do much testing, but I've spent a good 15 minutes in game at extremely high and demanding settings (path of exile really does stress all your components) without a black screen. I also manually selected the CPU to do physx calculations. Will get back tomorrow evening with more info. I will keep this unresolved for now, as I fear I might have just had some luck tonight.
 

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530
I have done a bit more testing tonight, no crashes whatsoever for now. I am very ciruous as to what is going on. I have done 2 sessions of intense 15 minute gameplays maxing out my GPU @ 80% TDP, but downclocked by 150 mHz on the core and memory. Given that I am running on a stock cooler now (to be honest with you, I was preparing in attempt to RMA the hell out of the card, hoping that nobody would notice that the stock cover was ever taken off :p ), I am getting stable 60*C peak temps at the cost of loud fan noise (60% RPM).

My conclusions so far are:

1. 15 minute intense sessions with the game being minimized for 5 minutes after each one of them allow for continuous gameplay with no video output crash
2. It is not the GPU [core]/CPU/RAM/MOBO/SSD/HDD that have been damaged before/during/after moving
3. It appears that the troubleshooting allowed me to be certain that the issue lies either in:
a) the PSU having hiccups and not being able to deliver high enough wattage to the card at 100% or higher TDP with the factory Gigabyte OC
b) there are big enough power fluctuations in the new place that may be responsible for too much/not enough power delivery to the PSU->GPU
c) VRAM/voltage regulators were previously damaged by insufficient liquid cooling delivery loop and now require much lower temps to run smoothly, which is achieved by a, I must say, good piping system [well done Gigabyte], though extremely loud fan [crappy work, Gigabyte]

From the entire thread that TMTOWTSAC posted, most people experienced video crashes either right away, or 1-2 minutes into the game. Me? This never happened so quick.
 
I saw this:

http://www.overclock.net/t/1517026/msi-gtx-970-vrm-temperatures#post_22958490

They were concerned about a cluster of 4 VRAM chips on the back of the card. They have no cooling and recommended slapping some cheap mosfet heat sinks on them. This wouldn't rule out any existing damage of course, but if it lets you run the fans quieter it may be worth looking at. 60C is a pretty aggressive target, and might only be tangentially helping through increased airflow and less waste heat going through the PCB. It might be as easy to test as running with the case open and pointing a desk fan at the back of the card. Do you have any case fans to help circulate air through the case?
 

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530


I have a fully set up Z11 Neo case with 2x 120mm + 2x 80mm intake fans, and 3x 120mm exhaust fans. Quiet due to low RPM, but extremely reliable:

Idle CPU temp @ 20*C / peak 40*C (liquid-cooled)
Idle MoBo temp @ 29*C / peak 38*C (air-cooled)
Idle GPU temp @ 33*C / peak 60*C (air-cooled); it's 60*C because I force my GPU fan to keep it at that at all cost for now. It used to be 34*C (!) with my liquid cooling setup at peak performance

@ the MSI GTX 970: my Gigabyte card does not have any VRAM chips on the backside. They are all mounted on the inside right below the radiator plate. :) MSI sure did F up their board design here, haha.
EDIT: Okay, my bad. It was showing a Gigabyte card, but it's not my model. Mine has a completely different piping and VRAM setup.
 

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530
I have now bumped the TDP back to 100% as the GPU performance started to get insanely bad in game. So far no crashes, which may suggest that it actually WAS a cooling issue on either the VRAM or the voltage regulators. More info to come.

I am updating this thread constantly so that "future generations" that still use 970s and experience similar issues have a decent amount of info on the subject. [insert famous rainbow star and cue the jingle]
 

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530
Whatever it was, it now is fixed. :)

Thank you so much for all your insightful feedback, especially TMTOWTSAC with his amazing find on the EVGA 970 issue. I must say that I am extremely drained trying to find out what exactly was causing the issue, but I believe I have narrowed it down to the following two reasons:

1. Overheating of VRAM/voltage regulators on the video card; I slapped back stock cooler in place of my liquid cooling
2. extremely poor nvidia drivers for the 970; I have rolled back to 355.82 version from Gigabyte

I hope this thread will help people out in the future. Please DO note, everyone: it may not be necessary to RMA your 970 if you start experiencing similar issues like I have been. Check your temps and make sure your fan curve is set up correctly. I have found out that forcing the temps to remain at 60*C tops works great.

OC is back at factory settings, and TDP at 118%.
 

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530
Update: card still works and performs perfectly fine. I have now increased the overclock on both core and VRAM by an additional 150 mHz. True, I have a jet plane in my case, but it feels sooooooo good to be back to proper gaming with proper FPS.

Once again, thanks everyone for helping me out solve this stuff out. I will have to pay a visit to my local repair shop to ask them how exactly did they deduce that VRAM was faulty on the card... and ask for my money back.
 

ydoumus

Distinguished
Mar 24, 2014
47
0
18,530
Quick update: GPU officially dead.

Despite my best efforts and keeping the core at 60*C with a jet engine inside my case, today morning one of my card's voltage regulators got fried really badly. GPU and RAM units are in perfect shape, though. Gigabyte really screwed up VRM cooling on this card.

Proceeding with RMA.