Question Hardware issue disguised as a software issue?

Skiff

Honorable
Jan 18, 2014
64
0
10,640
So I'm not sure if this is the right place to post this due to not know exactly what the problem is, only that is is being reported as an Nvidia problem. Since Mid-March I have been having an issue where 9 out of 10 boots is unstable, and by unstable the signs are that the sound is completely garbled, similar to what happens if you are listening to music and get a crash the audio gets stuck on the current note . The other symptom is that the monitor randomly loses signal, there is no pattern to what causes it, switching tabs in chrome, opening a program, playing a game, literally anything. Originally the audio was not garbled, just completely absent in all games but that no longer happens and it has switched for the garbled audio.

The problems gradually got worse leading to system crashes and BSOD's, then freezes during the motherboard splash and at one point I would not get any video input until the Windows login screen. Now, it's not 100% constant, like I said I can get a stable boot but it takes a while. As an example, I may switch on the PC, it will freeze at the motherboard splash, I hit reset, do this a couple of times, get the windows repair screen, reboot and then I may get to windows. Once in windows play some music or watch a video and check if the audio garbles or the video input cuts out(if it is going to happen it happens almost immediately), if it does not happen the system will be fine with no issues until I I shut it down, if not rinse and repeat until it is stable or give up.

The BSOD Error I keep getting is:

On Thu 02/04/2020 13:24:00 your computer crashed or a problem was reported
crash dump file: C:\WINDOWS\Minidump\040220-10359-01.dmp
This was probably caused by the following module: nvlddmkm.sys (0xFFFFF802651D67F4)
Bugcheck code: 0x116 (0xFFFF8082C1A97010, 0xFFFFF802651D67F4, 0xFFFFFFFFC000009A, 0x4)
Error: VIDEO_TDR_ERROR
file path: C:\WINDOWS\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_e0a5a1b06de180e3\nvlddmkm.sys
product: NVIDIA Windows Kernel Mode Driver, Version 445.75
company: NVIDIA Corporation
description: NVIDIA Windows Kernel Mode Driver, Version 445.75
Bug check description: This indicates that an attempt to reset the display driver and recover from a timeout failed.
A third party driver was identified as the probable root cause of this system error. It is suggested you look for an update for the following driver: nvlddmkm.sys (NVIDIA Windows Kernel Mode Driver, Version 445.75 , NVIDIA Corporation).
Google query: nvlddmkm.sys NVIDIA Corporation VIDEO_TDR_ERROR

I will list below everything I can think of that I have tried so far:

I have reseated the graphics card.
I have used DDU to uninstall the drivers in safe mode, disconnected from the internet and reinstalled the latest drivers.
I have run furmark with no apparent issues for an hour.
I have tried an old driver.
I have completely formatted my HDD and reinstalled windows in case it was a corrupted windows file causing a conflict and completely reinstalled the drivers.
I have also tried a GTX 970 card in the system and quickly got the same issues.

The fact that I tried a different card with different drivers and a clean install of windows makes me think that despite the BSOD error pointing the the Nvidia drivers, maybe it isn't, how can it be if I have installed windows cleanly and tried multiple clean driver installs. This all leave me thinking that if this was a software issue after all the changes I have tried, more people would be experiencing it but I cant any.

I have been in contact with Nvidia several times regarding it with no success, sometimes something I do makes the system stable for a few days but then it just starts again. I'm starting to wonder if it could be a motherboard or PSU problem which is causing the system to think there is a problem with the GPU but I have no idea how I would diagnose this.

My system is as follows:

GPU: ROG Strix GTX 1080Ti
CPU: i7 8700k
PSU: EVGA 850 G2 80+ Gold
MOB: ROG Maximus X Hero
RAM: G.SKILL Trident Z RGB 2 x 8GB DDR4 4000
 

Ralston18

Titan
Moderator
Look in Reliability History and Event Viewer for related error codes and warnings.

Reliability History is much more user friendly so start there. You can right click any given entries for more information/details.

Manually reinstall the GPU drivers via the manufacturer's (Nvidia) website. No third party tools or utilities.

Download directly, reinstall, and reconfigure for your system. Start with a basic stable, working configuration and proceed step by step towards the working configuration you wish to have.

Idea being to find some threshold configuration where the problem returns.
 
Are you running the RAM at 4000MHz? RAM instability can cause a lot of issues. Since your CPU support 2666MHz guaranteed to work I would test the RAM at lower speed to see if the issues persist. You would be surprised how many people I see with RAM instability on this forum in a week.
 

Skiff

Honorable
Jan 18, 2014
64
0
10,640
Look in Reliability History and Event Viewer for related error codes and warnings.

Reliability History is much more user friendly so start there. You can right click any given entries for more information/details.

Manually reinstall the GPU drivers via the manufacturer's (Nvidia) website. No third party tools or utilities.

Download directly, reinstall, and reconfigure for your system. Start with a basic stable, working configuration and proceed step by step towards the working configuration you wish to have.

Idea being to find some threshold configuration where the problem returns.

Never seen reliability history before, I have checked it and it seems to correlate with the times my system was unstable (since my fresh install anyway), and there seem to be a lot of critical events, mostly listed as Hardware errors. I have looked through them but I have no idea what I should be looking for, they all seem to be Live Kernel Events but again, I have no idea what that means.

Are you running the RAM at 4000MHz? RAM instability can cause a lot of issues. Since your CPU support 2666MHz guaranteed to work I would test the RAM at lower speed to see if the issues persist. You would be surprised how many people I see with RAM instability on this forum in a week.

I haven't changed my RAM speeds so I assume they are running at 4000MHz, I checked CPU-Z which gives a DRAM Frequency of 1071.4MHz, which I'm supposed to double because it's DRAM right? But even that seems low?
 

Skiff

Honorable
Jan 18, 2014
64
0
10,640
So I had 6 days with no issues at all, no freezes, no crashes, no loss of video input and no audio scrambling and then out of nowhere, it all started again yesterday. I dropped the RAM down to 2333mhz, the problems still happened, so I went lower to 1500, still problems so then I went as low as I can go to 800mhz, still the problems persist.

Every log in the reliability monitor says it's a hardware error and every minidump says its an Nvidia driver issue but I know it's not the GPU because when I tried a different GPU I got the exact same error, is there anyway to narrow this down? Something that may or may not be related is that occasionally when booting up I get the long beep followed by 3 short beeps which I know means it's a hardware failure but when I say occasionally I mean like once every 2 months or something but the last time it happened is a few weeks back now.

Edit: Just a thought, but as everything points towards the GPU, and I have tried another GPU and got the same error, could it be the PCIe slot on the motherboard maybe has a loose connection or something?
 
Last edited:

Skiff

Honorable
Jan 18, 2014
64
0
10,640
Start by rolling back on the RAM speeds as suggested by @Nemesia.

That is something you can directly and immediately do. If the system stabilizes then work upwards in incremental steps.

Determine if there is some threshold value that causes a change from stable to unstable.

So after trying changing the RAM speeds and getting no improvements I disabled the GPU and ran everything off the onboard graphics and everything runs perfectly. I tried a few reboots, playing music while opening applications and watching videos as well as playing some low requirement games and I have had no issues at all. Could this be a problem with the slot in the motherboard? I know it's not the GPU itself because when I put a different card in I got the same problems.
 

aakarshan

Distinguished
Nov 29, 2013
542
3
19,015
So after trying changing the RAM speeds and getting no improvements I disabled the GPU and ran everything off the onboard graphics and everything runs perfectly. I tried a few reboots, playing music while opening applications and watching videos as well as playing some low requirement games and I have had no issues at all. Could this be a problem with the slot in the motherboard? I know it's not the GPU itself because when I put a different card in I got the same problems.
There could be a problem with your PSU. It could be possible that it not being able to handle the GPU.
 

Skiff

Honorable
Jan 18, 2014
64
0
10,640
There could be a problem with your PSU. It could be possible that it not being able to handle the GPU.

So I switched mt PSU in from my old system, computer ran fine until I enabled the GPU. I then got the message "windows has stopped this device" with the code 43 in the device status window. I deactivated the GPU and did a clean reinstall of the drivers, as soon as I activated the GPU the screen lost input for a few seconds and the sound became garbled. I don't think the problem was the PSU.


Check the motherboard's User Guide/Manual or manufacturer's website to decipher the beep codes (1 long, 3 short).

1 long beep followed by a short beep means no GPU detected. As I have tested the drivers, tried another card and PSU and got the same problems is it safe to assume that the only thing that remains is a problem with the board itself?
 

Ralston18

Titan
Moderator
It does appear that the symptoms and troubleshooting are narrowing down to the motherboard.

And if other known working GPU's fail (Post #6) then the motherboard/slot is likely suspect.

My suggestions:

1) Do you have a multimeter and know how to use it? Or have a family member or friend who does. Test the PSU output voltages per the following link:

https://www.lifewire.com/how-to-manually-test-a-power-supply-with-a-multimeter-2626158

PSU is not underload so the test is not necessarily a final diagnosis. However, any voltages out of spec may be revealing,

2) Make the additional effort of one more round of the preceding troubleshooting tests. Should not take time to do and to verify previous results. Sort of a "second opinion" round. Doing so will confirm the results to date or, if something changes, then you may find other options to pursue.
 

Skiff

Honorable
Jan 18, 2014
64
0
10,640
It does appear that the symptoms and troubleshooting are narrowing down to the motherboard.

I have now resolved this, it was indeed the motherboard. To test it out, I switched the GPU into the second PCIe slot, once I had done this and rebooted everything was perfect, no trace of any errors at all so I must have damaged the slot during my move.

I decided to upgrade the motherboard because it did not feel practical using the GPU in that bottom slot and since the upgrade none of the previous issues have occured.