Dreaded TDR Error Crashing PC Worse Than A Budget Airline

BallisticChickn

Honorable
Aug 23, 2013
23
0
10,510
Oh goody, another TDR thread right? Yeah. So I've tried to troubleshoot this and follow as much advice as I could read, but I'm having trouble coming up with anything conclusive. I'm not super experienced troubleshooting hardware other than a HD so I apologize for any foolish oversights in advance.

Setup: Fully stable machine (specs below) for about a year. I downloaded Saints Row IV and after about ten minutes of play the system hard locked with audio stutter and suddenly my poor graphics card shifted into 100% fan speed. I powered off at the switch and restarted. I updated Nvidia drivers and tried again - same deal. Taking the hint I uninstalled the game and tried playing SR 3 - which has always been stable - and THAT caused the same crash. The crash ONLY happens with games and seems to only happen with Steam games. The only other non-steam game I really have is ARMA 2, on full settings it hasn't crashed at all but I'm not sure if that's a meaningful comparison.

System Specs: Intel Core i5 35070k @3.40G - Win 7 64bit - 8 GB RAM (2 sticks of 4GB GSKILL) - EVGA Nvidia GTX 660 - PSU Rosewill Hive 750 - MB ASUS P8Z77-V LK.

Log Info from Who Crashed:

crash dump file: C:\Windows\Minidump\082213-26145-01.dmp
This was probably caused by the following module: nvlddmkm.sys (nvlddmkm+0x8F06BC)
Bugcheck code: 0x116 (0xFFFFFA800AA18010, 0xFFFFF8800530B6BC, 0xFFFFFFFFC000009A, 0x4)
Error: VIDEO_TDR_ERROR
file path: C:\Windows\system32\drivers\nvlddmkm.sys
product: NVIDIA Windows Kernel Mode Driver, Version 320.49
company: NVIDIA Corporation
description: NVIDIA Windows Kernel Mode Driver, Version 320.49

crash dump file: C:\Windows\Minidump\082213-30139-01.dmp
This was probably caused by the following module: nvlddmkm.sys (nvlddmkm+0x8E9ED0)
Bugcheck code: 0x116 (0xFFFFFA8006793010, 0xFFFFF880061C6ED0, 0xFFFFFFFFC000009A, 0x4)
Error: VIDEO_TDR_ERROR
file path: C:\Windows\system32\drivers\nvlddmkm.sys
product: NVIDIA Windows Kernel Mode Driver, Version 320.49
company: NVIDIA Corporation
description: NVIDIA Windows Kernel Mode Driver, Version 320.49

They all basically look the same- obviously I can furnish more info if needed.

What I've Already Done:

1) System restore back to a happier time = Didn't work. Not so happy.

2) GPU stress test (OCCT) (run 2x because why not?) = No Errors Found

3) Hard disk check = No Errors Found

4) Driver Clean & Change - Here I uninstalled drivers, cleaned the registry with driver sweeper, installed the new/old driver and tested the games again. All of these failed to fix the problem. I should note that I did NOT wipe the chipset drivers during this process since I was under the impression that would be a "bad thing." 305.27 310.90 320.18 320.49 326.80 All failed to fix things miserably.

5) Windows mem test = No Errors Found

6) Windows file check (sfc/scan now) = No Errors Found

7) Anti-Virus (super long scan) = No Errors Found

8) CPU Linpack Stress Test (OCCT) = No Errors Found

9) Steam uninstall/reinstall = Didn't Fix Anything.

10) DXDiag = No Errors Found, Current Version Installed

11) Update Windows = Didn't Fix Anything.

12) Power Settings Set to High Performance = Didn't Fix Anything.

13) RAM/CPU Stress Test (Prime 95) = No Errors Found.

14) Physical Re-Seating of the Card (Spoke soothingly to it) = Didn't Fix Anything

15) Video Memory Test (Video Memory stress Test v1.7) = No Errors Found

16) Power Supply Voltage Check = No Errors Found

17) Furmark Benchmark test:
2734 points 45 FpsAvg 44 min FPS and 49 max- Max GPU Temp 87 Celsius - Resting temp 40 C - Resting fan 30% - Resolution 1280 x 720.

18) Furmark Burn In: (I've searched all over my computer & can't find the screenshot BUT here's the info I wrote down... sorry):
Temp started to level off at 1:50 around 90 C and slowly rose to 93 C after 8 minutes. At 8 minutes I stopped the test but I think temps would prob keep rising. This is the only thing that seems out of order to me.

Concluding thoughts:
I haven't tried the registry edit to basically kill off the TDR process and I haven't nuked Windows from orbit while sticking pins into a Steve Ballmer doll (Available at the MSFT HR office) at midnight before reinstalling the OS, but that's the only other things I can think of. There might be other drivers to try I suppose but after trying 5 of them I gave up.

I lack experience with GPU hardware so I'm not really sure if those temp results are way out of wack or aren't really worrisome. I'm pretty much totally out of ideas though so I would be extremely grateful for any help you can spare.

Look, if we don't stop this error I won't be able to play games... that might even force me to... go outside. Can you imagine a worse fate? The sun is out there. Water just falls from the sky. And all that fresh air can't possibly REALLY be healthy for you right? 😉 So again, any help you can give me would be awesome.

 
Solution
You're quite correct, DX can't be uninstalled, only updated although I think there is a repair option but I can't be sure-anyone else out there care to comment?
Either you didn't say or I missed it, but try uninstalling SR4, restart then run Ccleaner (include the registry sweep option) and manually delete any folders related to SR4 before a full, 'power down and wait a bit' restart.
Only other thing I can suggest is you contact Steam technical, like you I find it very strange that the problem seems to be limited to Steam games so perhaps the issue is at their end rather than yours.


Thanks for the suggestion! In step number 4- I tried both 305.27 310.90 (and more) as drivers and neither worked. I can try to find and use the 314.18 though and see what happens.
 
i think it's 314.18, though it might be 314.22... can't remember it's been a while. whichever one is the whql certified driver. give that a shot see if the problem is fixed. we need to make sure it's not a driver fist.

just checked.. its 314.22, my bad.
 


Oh no worries, I figured that's the one you meant, I'm downloading it now and we'll see if it works. In the meantime, do the temps I'm getting in benchmark tests/idle on the GPU seem ok to you? Seems to be climbing into the 90 c range with heavy use on the Furmark testing in step 18, which is where I stopped the test. But I don't know if that's bad, or good.
 


Well, it was a very good try but it failed just like all the drivers- in about 2 minutes as opposed to 4 actually. I think that actually means I've tested every available driver for the card so we can probably rule those out? I'm still curious about what everyone's opinions of the card temps are.

This was the crash dump (the only thing different from the others is the module being "unknown" as opposed to nvlddmkm.sys):
crash dump file: C:\Windows\memory.dmp
This was probably caused by the following module: Unknown (0xFFFFFA80072E74E0)
Bugcheck code: 0x116 (0xFFFFFA80072E74E0, 0xFFFFF880061906AC, 0xFFFFFFFFC000009A, 0x4)
Error: VIDEO_TDR_ERROR
 


Oh sorry, I should have said above that I'm not overclocking at all. My bad for not including that. But no, no overclocking. Processor temps are ranged from (Note these are in Fahrenheit): Lows of 93 F to a high of 122 F. The monitoring software is telling me a max of 221F is the redline. Make of that what you will.
 
Very odd.
Just a few ideas:
Try updating DirectX to the current runtime, a lot of games install DX components, perhaps-despite the Dxdiag report-the install has been corrupted.
Check and verify the Steam cache files.
Run a malware scan (you may have forgotten to list that).
 


Great suggestions! I forgot to list Malware scans along with the anti-virus but both came up clean.

Possibly a dumb question, but how do you remove/alter a DirectX 11 installation? Microsoft's position, at least on their support site, is that any problem with DirectX 11 is really a problem with your display drivers, or the program you're running and never, ever, ever DirectX 11. They go on to say that it's part of Windows 7 itself and can't be removed. Obviously I tried anyway, but the only version I could find was previous to mine so it declined the honor of installing itself.

EDIT: Oh and I checked the Steam cache and it was clean too. Thanks for that- I didn't even know they had a verify cache option.
 
You're quite correct, DX can't be uninstalled, only updated although I think there is a repair option but I can't be sure-anyone else out there care to comment?
Either you didn't say or I missed it, but try uninstalling SR4, restart then run Ccleaner (include the registry sweep option) and manually delete any folders related to SR4 before a full, 'power down and wait a bit' restart.
Only other thing I can suggest is you contact Steam technical, like you I find it very strange that the problem seems to be limited to Steam games so perhaps the issue is at their end rather than yours.
 
Solution
Here's an update- looking at the Furmark benchmarks again it looks like this card is heating up really really fast. After 8 minutes the card temp is at 93C and climbing. That's interesting because the games crash and the fans shift to 100% at around 10 minutes of play, so they crash right about when the card would reach 100C. I updated my card's fan-curve and the last benchmark reached a max temp of 82C.

Question: If the card reaches 100C can that be causing a TDR? Basically is the card shutting down and then Windows is barfing out a TDR because suddenly it can't talk to the card anymore?

And bonus question: Since this just started happening after almost a year of stable use, is that a result of a bad GPU that needs to be RMA'd back to EVGA? The airflow in the case is great, I've got five fans in there, the PSU is 750Watts and is delivering good voltage, so I'm wondering if this was just a bad card that broke. I'm not sure a revised fan profile is going to keep it ok- I'm thinking really it's just going to make it less-bad.

Thoughts?
 
Hmmmm...
Good point on the temps, when was the last time the card heatsink was blown clean?
Is it working properly with the higher fan speed and lower temperatures?
Q1: Yes, it's possible if the card gets too hot, and they get more sensitive to high temperatures as they get older.
Keep trying, mate, I'd hate to think of you outside in all that horrible sunshine, possibly even (gulp) doing healthy physical things 😉.
 


AHA HA ok this is beginning to drive me mad. Totally bonkers. So it was a great theory, but to my inexperienced eyes the data isn't exactly cooperating here.

I adjusted the fan curves, (used compressed air to clean the card last night) and tried the game again. This time I got perhaps 15 minutes out of it before it TDR hard crashed. AHA but I was running logs on the card performance and y'know what they found?

Card Data:
Right before the crash my card was reading:
GPU Power: 120.000 Clock: 953.709 Memory Clock: 3004.679 GPU Temp: 89.000 C GPU Usage: 90.000 %... Memory Usage: 651.113 GPU Voltage: 1.025 Fan Speed: 74%

During the crash the card reported:
GPU Power: 107.000 Clock: 1110.483 Memory Clock: 3004.679 GPU Temp: 88C GPU Usage: 99.000% ... Memory Usage: 651.113 GPU Voltage: 1.125 Fan Speed: 74%
TL;DR The Fan speed at crash was a non-fatal 74% and the temp at crash was 88C... A far far cry from 100C. SO... Um... that would seem to discount the "card is overheating" theory yeah?

I'm going to follow coozie7's suggestion and basically nuke Steam from orbit- but due to my internet connection that process is gonna literally take about a day and half to finish (re-installing the game etc). And dammit, I'm prob going to end up being forced *outside* during that time... So it's critical indeed. 😉

But wow am I confused right now- if anyone else can see anything amiss in that log data feel free to beat me over the head with it.


 
Yet another update:
1) I took a look at the log files for the card at the moment it has been failing and then used Furmark to stress the card up-to and then WAY passed those values- the computer remained stable. Am I right in thinking that pretty much eliminates the card itself from the causes?

2) I nuked steam, cleaned the registry, re-downloaded one of the games giving me problems, and re-verified the steam cache- but the thing still crashed after only 3 minutes of play time.

3) I reinstalled DirectX11- I'm a little suspicious of this install though since it doesn't allow me to totally remove DirectX11 first. But still. Didn't do anything for me, games still hang and crash.

This problem is getting old. Very very old.
 
Come round to my place, I've a nice strong wall, we can both bang our heads against it 🙁.
Have to admit I'm stumped a well, the only things you've not done are:
Nuke the entire HDD and start from scratch in the hope THAT will teach the little devil a lesson.
RMA the card.
Maybe someone else has more ideas, but I'm dry, mate...Sorry.
 
ok. lets eliminate temps once and for all.

pop the side of your case off and stick a room fan into the opening, make sure it's on high. lets see if you get that error again, and what happens to your temps. I still think it's a gpu temp issue. around 90C is where gpus start to fail.
 
Any more luck on this? I started having TDR issues recently as well with my video card (about 3 months ago probably) and my rig ran fine for quite some tme. My system is an Intel i7-3770k 3.5Ghz, 8GB RAM, Win7 64bit, Radeon 7870 and the Asus PZ77-VLK. I went and set my TDR timeout to 8 seconds in the registry, and that eliminated most of the pauses, but it's still happening from time to time. I have noticed one thing -- they will tend to chain together when they happen.

The only major overlap we have is the same motherboard -- I'm wondering if there's possibly some BIOS setting that is causing this grief for us? Something like a power saver kicking in for the onboard card or something of the sort that's interfering.

 
Just tried the window fan fix and it crashed exactly like before on SR3 at exactly the same amount of time. Monitoring logs show the temps are consistent with the other tests (cooler than 90C after I ramped up my GPU fan profile).

DATA:
GPU Temp at crash is reading at 86C right before the crash it's reading at 89C. GPU usage spikes from 97% up to 100% and then 99% where it crashes. Memory usage is constant at 593.852. GPU Voltage stays pretty solid at around 1.050 spiking to 1.150 and then 1.075 at time of crash. Fan speed stays constant at 74% And that's about all that's significant.

I've used Furmark to bring the system under similar and then higher levels of stress for 15 minutes and the system remains stable.

This might be pretty important though: SR3 run under DX9 lasts about 10 minutes longer before crashing and both DX11 and 9 modes produce graphics that are of lower quality than they should be. No artifacts as such- no blocks or massive lines or anything- just a lower level of resolution than this card used to put out. Also, ARMA3 (in single player of course) does not experience TDR hard crashes in that time frame, but it DOES experience graphics degradation of the type I just described and is really really laggy. There is a massive difference between input on the controls and what, say, your vehicle does on screen (crash into a tree in this case).

So here's my current thought/question: Could this still be a graphics card issue- but just not be the type/dramatic enough to be caught by DX tools I've tried? Something wonky with the chip etc? Obviously, I've run diagnostics on pretty much every bit of software and hardware- all have passed- which means SOMEONE is lying here.

If that's what's most likely wrong- how good is EVGA about RMA'ing cards on warranty? Especially over an error that might not be instantly observable when plugged into a DX computer somewhere?

My only other thought is that maybe- since this nightmare zombie TDR began only after downloading Saints Row IV, that this is a corrupted DirectX 11 install and for some reason the Microsoft re-install under Windows 7 doesn't overwright everything it should. But would that be capable of causing the performance loss/lags on ARMA 3?

@Glucose Man, I'm trying. But like I said, I'm coming up empty on explanations here. I'm at the point where the only thing I haven't tried is nuking the drive and reinstalling Win 7. My biggest fears are a) that doesn't fix it. and related to that b) that this is something corrupted in Win 7 somewhere that may or may not be fixable. I'm still working on it though, I didn't try the registry fix.. I guess that's something. But I'm a little worried that the TDR itself isn't the core problem, as evidenced perhaps by the loss of fidelity in the graphics. I've NO idea.

@coozie7 lol I might take you up on that. This is getting really really frustrating.
 
well your temps are still silly high, which means either your gpu cooler isn't working right, or it's reporting the temps wrong. either way your temps are too high. Obviously it's not a case airflow issue though, so fixing it will be harder. I'm not familiar with Arma3's hardware issues. is anyone else playing arma 3 having the same troubles? cause if they are that's a pretty clear indication that there is a driver problem at fault.

Setting that aside. at this point i'd try a RMA with evga. describe the problem see what their customer service suggests. see if they'll do an RMA. It might just be a bad card.

 
@ingtar33 Thanks for that info- I've no idea what a normal temp range for that card is, other than 100C being really really bad, so that information is very helpful.

At this point I'm going to contact EVGA about the RMA process, hopefully they'll be pretty cool about it since I really don't want to buy another card this soon. I'm also upgrading my hard drive, I've been meaning to get something with more space for a while now, and that will involve a fairly clean install of Windows 7. So hopefully, between the card being replaced and the new Windows install, this problem will take a hint and go away. I hope. This has been harder to kill than any virus I've ever dealt with, even the ones back in the 90's that you HAD to hunt through the registry to kill off for sure, were better than this.

@Glucose Not odd at all! I'm just using the sound off my MOBO if that clears anything up.
 
@Ballistic Hmm ok, I am using a PCI-E sound card, and was wondering if that was another common overlap between our systems.

The only thing that I find weird is that I also have run FurMark and blasted the GPU up to hot temps and have never made it TDR. In my case it appears to be MechWarrior Online that is causing my issues.

I have a friend who worked on drivers at AMD, and he said that these TDRs are usually the result of a long display call being sent to the card and timing out for whatever reason, and that errors can typically be memory related. I'm trying to think what these games are doing different that a benchmark does not (hence the sound card question).

 


In my case, aside from persistent TDRs in SR3 (&4 but I uninstalled that one very quickly) I have laggy, reduced graphics in Arma 3 and reduced graphics in Tropico 4. Have you noticed anything off in any other games or videos/web browsing? In my case the graphics are just... crappier. No dramatic lines or anything, they're just at a lower resolution.

I don't know the precise faulting, but I'm thinking it might be an error severe enough to TDR under SR3 & 4 but not under less demanding games, so it could be a hardware fault that Furmark wouldn't detect- that's currently my theory. I've had lots of theories so far and... uh... well, none of them have been right. But hey, it's gotta happen sometime yeah?

I did blast my CPU and RAM with a several hour long Prime 95 stress test and nothing went wrong, but I'm not sure if that tested the card memory as well or not. I've been told that Prime 95 will usually stress CPU/RAM sufficiently to find errors within a few hours unlike Windows internal memory diagnostic. Hopefully.