Question [Nvidia] I need help tracking down hardware fault as some games grey screen and and cause driver to crash ?

Aug 19, 2023
4
0
10
Hi all, I've been trying to track down an issue for over a month and for the life of me just can't figure out whats going on or what else to try and test for faults.

A while back some games randomly started causing driver crashes, my right monitor will grey screen and my main monitor with the game will continue on for a few seconds until everything black screens and the driver recovers. Games that are causing the issue have not previously been an issue.

Things I've Tried:
DDU Drivers in Safe Mode
Turned XMP off
Ran Memtest - Passed
Ran Furmark Stress Test for 1hr+ - Passed
Ran Prime95 Stress Test for 1hr+ - Passed
Ran both tests simultaneously - Passed
Ran 3DMark - Passed
Ran HDD Checks - No bad sectors or warnings found
Reinstalled Problem games to same and alternative HDDS - Issue Persists

Eventlog just shows the usual driver stopped responding and reset TDR errors from a driver crash and nothing else. I also can't seem to reproduce the error outside of playing those specific games. Some games still work just fine. I'm honestly at my wits end and this point trying to reproduce and track down the problem and while I'd love to, and am well overdue a new rig, its just not on the cards right now so any help would be greatly appreciated.

The Build:
Win 10
Corsair H80i
Intel i7-4790
Gigabyte Z97X-Gaming 7
32gb G.Skill RipjawsZ DDR3-1600 CL10-11-10 1.50V (F3-14900CL10Q-32GBZL)
EVGA 12G-P4-2990-KR GeForce GTX Titan X
EVGA SuperNOVA 650 G1
 
Ran Memtest - Passed
How many passes?

I also can't seem to reproduce the error outside of playing those specific games.
Those games are?

Also, try running Unigine Superposition,
link: https://benchmark.unigine.com/superposition

Use 1080p Extreme preset. It will stress test your GPU.

Corsair H80i
CPU and GPU temps are? Both idle and under load.

You have ATX MoBo and most likely proper PC case, whereby there is 0 reason why to use that poor CPU cooler with Core i7.
Only justification to use single slot rad, is if you have mini-ITX build and PC case doesn't have enough CPU cooler clearance for mid-sized CPU air cooler. But you don't have mini-ITX build.

EVGA 12G-P4-2990-KR GeForce GTX Titan X
Titan X ... quite obscure GPU to use.

Two things to try:
#1 Take out your Titan X and plug your monitor to MoBo. Look if you get same issues. Do note that iGPU may not be powerful enough to run those specific games.
#2 Try with 2nd, known to work GPU. This will either tell if issue is with your Titan X or not.
 
How many passes?
3 or 4 if I'm remembering right, I let it run for a few hours
Those games are?

Also, try running Unigine Superposition,
link: https://benchmark.unigine.com/superposition

Use 1080p Extreme preset. It will stress test your GPU.
At the moment Destiny 2 and Gundam Evolution are basically guaranteed to crash at some point. Other occasional crashes have come from Rocksmith has Crashed once or twice too.

Passed the Extreme test just fine with no crashes. Not spectacular fps results but about what I expected.
CPU and GPU temps are? Both idle and under load.

You have ATX MoBo and most likely proper PC case, whereby there is 0 reason why to use that poor CPU cooler with Core i7.
Only justification to use single slot rad, is if you have mini-ITX build and PC case doesn't have enough CPU cooler clearance for mid-sized CPU air cooler. But you don't have mini-ITX build.

CPU usually idles around 30-40*c and maxes out at 70*c under full load, GPU idles around 40 and maxes at 85, thermal throttling doesn't get triggered on either going by HW monitoring.

As for the cooler itself, I bought a less then stellar budget corsair case like 8 years ago. It doesnt have a lot of room and cant really fit anything bigger and its been doing the job just fine for nearly a decade.
Titan X ... quite obscure GPU to use.

Two things to try:
#1 Take out your Titan X and plug your monitor to MoBo. Look if you get same issues. Do note that iGPU may not be powerful enough to run those specific games.
#2 Try with 2nd, known to work GPU. This will either tell if issue is with your Titan X or not.

The Titan X was a new release when I bought it. I went big with it, which, in hindsight I should have spent elsewhere on other components but oh well. Overall the build has lasted nearly a decade without needing to replace anything other then a HDD and is only just starting to run into performance issues getting brand new games playable so its served me well. If the timing wasnt abysmal id just rebuild newer.

I'll try running off the integrated when I get a chance and see if I can borrow a GPU off someone for a few hours to test.
 
I second @Aeacus advice, sounds like it could be the graphics card.

Just to add. Have you tried to troubleshoot with one monitor connected? Just to see if power is an issue with the card of late that it can't handle increased resolution or possibly refresh rate anymore or is flaky about it. Which leads me to another question regarding power plan in use for the graphics card where running in performance mode rather than optimal could have it more stable. Could try give it a lil more voltage too with Msi Afterburner.
 
3 or 4 if I'm remembering right, I let it run for a few hours
I don't think you even ran one full pass (all 13 tests), let alone 3-4 passes.

For each 8GB of RAM you have, one full pass of memtest86 takes ~1h. And if you have 16 GB as total, it takes ~2.5h. So, for 32GB (2x 16GB), which you have, one full pass takes ~5 hours or so. 2 full passes would be ~10 hours and 4 full passes would be ~20 hours.

2 full passes would be minimum, while 4 full passes is considered acceptable when testing out the RAM.
With this, RAM is still suspect since you didn't test it properly.

At the moment Destiny 2 and Gundam Evolution are basically guaranteed to crash at some point.
Either both games use specific code that doesn't like your Titan X, or Nvidia hasn't covered all issues within their drivers to Titan X.

2nd GPU, modern one, would most likely fix your issues.

The Titan X was a new release when I bought it. I went big with it, which, in hindsight I should have spent elsewhere on other components but oh well.
Titan X, while once best what money could buy, never saw much sales and thus, most likely, isn't kept up drivers wise. Then again, back then, it was hard sale for Titan X, costing double of what GTX 980 costed, while offering only ~36% better performance. Follow-up GPU, GTX 980 Ti, was ~5% better than Titan X, while still costing less than Titan X.

In that sense, yes, getting Titan X (or any Titan in that matter), would be huge waste of money. A good gimmick GPU though.

and maxes at 85
GTX Titan X has max temp of 83C,
review: https://www.tomshardware.com/reviews/nvidia-geforce-gtx-titan-x-gm200-maxwell,4091-6.html

So, how yours goes to 85C is strange. But in any event, above 80C is bad for any GPU.
Here, i'd look into reducing GPU temps, but i doubt it would fix your issue. Your issue seems to be more like drivers issue than hardware (thermal) issue. But it would be worth of a try.

I bought a less then stellar budget corsair case
I wonder, model of the PC case is?
 
I don't think you even ran one full pass (all 13 tests), let alone 3-4 passes.

For each 8GB of RAM you have, one full pass of memtest86 takes ~1h. And if you have 16 GB as total, it takes ~2.5h. So, for 32GB (2x 16GB), which you have, one full pass takes ~5 hours or so. 2 full passes would be ~10 hours and 4 full passes would be ~20 hours.

2 full passes would be minimum, while 4 full passes is considered acceptable when testing out the RAM.
With this, RAM is still suspect since you didn't test it properly.

It definitely ran a full pass I know that much but I can run another overnight if needed but I think I've found the culprit in the GPU. Let one of the problem games run of the iGPU for the last two hours and it didnt crash. Normally it crashes within like 10-40 minutes if not earlier.
Either both games use specific code that doesn't like your Titan X, or Nvidia hasn't covered all issues within their drivers to Titan X.

2nd GPU, modern one, would most likely fix your issues.

I actually get pretty frequent driver updates. With pretty much each new game ready driver. However I agree there might be an issue there. It's driving me crazy that I can't seem to reproduce the problem outside of a couple titles.

GTX Titan X has max temp of 83C,
review: https://www.tomshardware.com/reviews/nvidia-geforce-gtx-titan-x-gm200-maxwell,4091-6.html

So, how yours goes to 85C is strange. But in any event, above 80C is bad for any GPU.
Here, i'd look into reducing GPU temps, but i doubt it would fix your issue. Your issue seems to be more like drivers issue than hardware (thermal) issue. But it would be worth of a try.

85 was at peak, it usually only stays that high for a minute max before dropping into the 78-83 range instead. Unfortunately at the moment aside from just trying to jam an extra fan in I can't think of a way to get those temps to drop.
I wonder, model of the PC case is?
Corsair Graphite Series 230T. The case itself couple probably fit a dual rad on the top but clearance would be a problem. It'd impact the top of my ram and a bit of heat shielding on my mobo.
I second @Aeacus advice, sounds like it could be the graphics card.

Just to add. Have you tried to troubleshoot with one monitor connected? Just to see if power is an issue with the card of late that it can't handle increased resolution or possibly refresh rate anymore or is flaky about it. Which leads me to another question regarding power plan in use for the graphics card where running in performance mode rather than optimal could have it more stable. Could try give it a lil more voltage too with Msi Afterburner.

Already running in performance mode with no change sadly. I can try giving it a little more power but it doesnt even waver at max power draw when running furmark.
 
Corsair Graphite Series 230T. The case itself couple probably fit a dual rad on the top but clearance would be a problem. It'd impact the top of my ram and a bit of heat shielding on my mobo.
How come the AIO is only cooling option for you? 🤔

Graphite 230T has CPU cooler clearance of 165mm, meaning that you can easily put a big boy air cooler in it. E.g king of air coolers Noctua NH-D15 at 165mm or Noctua NH-D15S at 160mm.

Unfortunately at the moment aside from just trying to jam an extra fan in I can't think of a way to get those temps to drop.
Plenty of ways;
* downclock the GPU
* cap the FPS
* play at lower resolution and/or lower graphical settings
* increase case fan and/or GPU fan speeds

Idea is not to let the GPU to run full bore, thus not letting GPU to max out temp wise.
Also, when was the last time you cleaned the PC innards from dust?

I actually get pretty frequent driver updates. With pretty much each new game ready driver.
Have you ever read what those new updates contain?

For the most part, Nvidia releases so called "umbrella" drivers, that encompass several GPUs at once, rather than specific GPU.
E.g 536.67 update notes: https://www.nvidia.com/download/driverResults.aspx/209266/en-us/

Those drivers are for several generations of GPUs, including your GTX Titan X and my GTX 1660 Ti, but the update itself did 0 for both of our GPUs. Instead, it added support for RTX 4060 Ti and fixed one issue with Ampere architecture GPUs (RTX 30-series).

I'm running 536.23 myself and haven't updated it for a while, since "if it ain't broke - don't fix it". Moreover because new "update" doesn't do any good for my GPU. But there can be "bad drivers", whereby latest update messes with the system and i've had to revert back to older, stable, drivers. Twice now. After 2nd time, i called "F-it!" and haven't updated drivers since. No issues for the past few years.

but I think I've found the culprit in the GPU.
One option would be reverting to version or two older GPU drivers. Might help if the latest drivers doesn't want to play ball with your old GPU.
 
How come the AIO is only cooling option for you? 🤔

Graphite 230T has CPU cooler clearance of 165mm, meaning that you can easily put a big boy air cooler in it. E.g king of air coolers Noctua NH-D15 at 165mm or Noctua NH-D15S at 160mm.


Plenty of ways;
* downclock the GPU
* cap the FPS
* play at lower resolution and/or lower graphical settings
* increase case fan and/or GPU fan speeds

Idea is not to let the GPU to run full bore, thus not letting GPU to max out temp wise.
Also, when was the last time you cleaned the PC innards from dust?
Case gets cleaned all the time and was cleaned like 2 weeks ago. I went with an AIO at the time based on advice and reviews from friends and its been working fine for nearly a decade so I didn't do too bad. As I mentioned earlier thermal throttling hasnt been much of an issue.
Have you ever read what those new updates contain?

For the most part, Nvidia releases so called "umbrella" drivers, that encompass several GPUs at once, rather than specific GPU.
E.g 536.67 update notes: https://www.nvidia.com/download/driverResults.aspx/209266/en-us/

Those drivers are for several generations of GPUs, including your GTX Titan X and my GTX 1660 Ti, but the update itself did 0 for both of our GPUs. Instead, it added support for RTX 4060 Ti and fixed one issue with Ampere architecture GPUs (RTX 30-series).

I'm running 536.23 myself and haven't updated it for a while, since "if it ain't broke - don't fix it". Moreover because new "update" doesn't do any good for my GPU. But there can be "bad drivers", whereby latest update messes with the system and i've had to revert back to older, stable, drivers. Twice now. After 2nd time, i called "F-it!" and haven't updated drivers since. No issues for the past few years.


One option would be reverting to version or two older GPU drivers. Might help if the latest drivers doesn't want to play ball with your old GPU.

I tried drivers from before I encountered the issue and no dice, 532.03 to be precise. So I'm betting on a hardware issue somewhere, most likely GPU related as suspected. Its just driving me nuts I can't seem to reproduce the issue outside of some games.