Question What are the odds on receiving two faulty cards in a row?

nasch007

Distinguished
May 30, 2016
44
0
18,530
This is a long story but you need the history and I need your great minds to collaborate with me here. Please bare with me. I'm posting in the graphics card section because that was the issue beforehand, and like the title says, what are the odds?

I have a custom built that runs excellent - when it runs. UserBenchmarks at 98% and 97%...

I'm seeking your help because I'm having some deja vu issues and I don't know if it's me- am I going crazy, do I have the worst luck in the world, did we replace the wrong parts? Or is it something else?

To top it off, I think I am up against a deadline, timewise (June 30th).

Here is my build:
Case- Phanteks Enthoo Pro M SE
PS- Thermaltake Smart Pro RGB 750W Zero Fan
MOBO- MSI X99 Gaming Pro Carbon LGA 2011-3 (refurb)
RAM- Corsair Vengeance LPX 32GB (8x4GB) Quad-Channel DDR4 2400MHz C14
CPU- Intel Xeon E5-1620 v4 Broadwell-EP Quad Core CPU @ 3.5GHz
Cooler- Enermax Liqmax II 240 "Front/Pull" config
NMVe- MyDigitalSSD 240GB 80mm BPX Pro m.2 PCIE
GPU- MSI GeForce GTX 1070 Armor OC 8GB
SSDs- 1x240GB Patriot, 1x240GB HyperX Cloud 9, 1x120GB Kingston SSDNow! v300
Storage - WD Passport USB 3.0

So I set everything up, converted the internal drives to GPT. Clean installed Windows 10 and I believe everything is GPT/UEFI mode. I installed all of MSI's (somewhat old) drivers, configured BIOS settings, ran XMP profiles, got all my softwares running, and things were great.

Then things started going downhill.

It would, rarely, not POST.
Sometimes it would POST, but not boot (freeze during the "circling circles" loading screen).
Often times it would boot, but crash with a white or pink screen after a random amount of time - regardless of activity, could be gaming, youtube, microsoft word, etc..

Eventually, it became predictable, every morning I would boot it up, wait for it to get to the desktop, crash, and then hard power off (just tap the power button once) and boot it up again.

Oddly, it would run perfectly fine after that intial crash...but that's no way to live... not after $1100 of blood, sweat, and tears.

The Pagefile was enabled, size was large enough, but crash dumps weren't being written out.

Event viewer showed lots of critical events in the past hour, then kernel-power "the computer rebooted but was not shut down cleanly" or something to that effect. The only other events close to that were event id 14 nvlddmkm... I read a lot about this and tried some troubleshooting. I do use a steamlink hardware, to stream my PC from the bedroom to the living room. Crashes seem to happen regardless of whether I'm streaming, or using the computer "locally".

Desperate for answers I reverted back to some (age old) drivers. They definitely reduced performance in game and benchmark by about 20%... but they seemed to relieve at least temporarily the crashing and the nvlddmkm error. But then the issue started back up again. I tried updating drivers, it only made the issue more prevalent. I tried updating the card's video bios, same issue. I tried checking and tweaking all sorts of settings and issues - from power profiles to hardware encoding/decoding settings in browsers to modifying the TDR values....sfc, dism, etc. etc. I finally decided to RMA the card.

I sent it in to MSI and in its place popped in an old R7 370 4GB. While the card was much slower, (GTA V now playing at 15 fps hahaha), it worked with absolutely no issues. No reboots, no freezes, nothing.

So the GTX 1070 card was definitely bad, right? Or is it the software (drivers)?

I noticed though with the R7 in that my NVMe drive was idling at temps around 70c. It must be thermal throttling under load, I thought. So I tried to update the firmware, and it was unsuccessful. I contacted the company and they RMAd it for me. Sent me back the Pro version, and it had a firmware update so I went ahead and did that. Much better temperatures, so I would have peace of mind with that.

By the time I'm done with that, I got back from MSI a replacement GTX 1070 Armor OC 8GB (Grade B). It was a refurb.

Now I'm taking no chances, right? A new NVMe, up to date firmware... I did a clean install of the latest Windows 10 the media creation tool would give me. I used SDI to install only the latest, best matching drivers, with no overlap or redundancy. Although I did let Windows 10 auto-detect and install the video card drivers (by mistake, I forgot to unplug the network cable).

So... history repeats itself...

Get all my settings tweaked, programs installed, backups created... life is grand! Like I said, benchmarked at 98% and 97%... but now things are happening again. The same sorts of things as before.

I tried to update the video card drivers, but they wouldn't install. I did some research and figured out that Windows update installed the DCH version of drivers. No biggie, I will leave them on the DCH ones and I will set all my settings as before, maximum power, no power management for pci, fan curves, etc. etc.



Now problems are starting to arise.

It always posts, and maybe once or twice it has not booted to the desktop. But now once at the desktop... I noticed my mouse lags a little sometimes (4k @ 60hz resolution), and it looks like the video card is working harder than the previous one I had (shows 6 or 7% GPU use idling at desktop, RMAd card was 0-3%). Sometimes it looks like the cpu is in use, then it will clear up and not be an issue.

I can game in 4k, GTAV with a steady 45fps (up to 70plus depending on what's happening). But now, like before, I will get a random reboot. Could be watching youtube, or streaming content to the steamlink (setup in my living room so we can watch TV shows, movies, etc.) and it just reboots. Crash dump isn't written out, and I see a couple of event id 14s in the event log (not as frequently as it was before though).

I loaded BIOS optimized defaults, and fearing the video card again, I downloaded superposition https://benchmark.unigine.com/superposition to stress test it. I also downloaded MSI afterburner and GPU-Z. I let afterburner do its smart overclock and it created a nice curve, topping out around 2000MHz core clock. I then loaded up GPU-Z and ran superposition. It ran through successfully multiple passes. The sensors in GPU-Z all seemed perfectly normal, except for PerfCap Reason. From what I understand, this is the thing that's holding you back, what's capping your perfomance. Mine said something like PwrRel or VolRel... the tooltip said something to the effect of "reliable power". I'm not sure what that means or if it is even significant.

So here I am... confused as what to do next. That's why I'm asking you all. What do you think? I can do any troubleshooting you want and provide logs and screens and videos... any help or guidance would be appreciated.

I don't want a system that has to daily crash and reboot, in order to be "stable" especially after spending hard earned money on it. I don't want to keep RMAing random parts (having to pay to ship it is very annoying), and I don't know if there's something I could be missing... is my logic bad? The issues went away with an AMD card... they resumed on a clean install... is it NVIDIA's terrible drivers? This event ID 14 nvlddmkm issue has a long history... could MSI have given me two faulty cards in a row? Could it be my power supply isn't up to it? My motherboard is a refurb from MSI also, could that be it?

Please let me know what information you need and what steps to take. "Help me Obi-Wan Kenobi, you're my only hope..."
 

nasch007

Distinguished
May 30, 2016
44
0
18,530
Sounds like it's not the graphics card. The gtx 1070 might be placing more load onto a possible defective power supply than the RX 470. How are your processor temps during gaming?
I have the enermax liquid cooler, so my CPU temps are very low. We're talking 25-26C at idle. I will look at processor temps when I benchmark or play tonight. EDIT: Benchmarked and my CPU temps were around 35C. Max was 43C. And I have all fan profiles at defaults from the BIOS.

I have a power supply from my last build, it's an old 530W... should I try and power just the motherboard/cpu and video card? I think the CPU's TDP is 140W and when I stress tested the video card, GPU-Z showed 200W usage. Do you think that will pan out?
 
Last edited:
Few things, SI X99 Gaming Pro Carbon LGA 2011-3 (refurb) <--- refurb = used or repaired = possible not stable.

Reusing the an older and already average quality power supply can also be an issue. This is from a Toms review of the PSU "We do, however, believe that much of this unit's production cost was devoted to bells and whistles rather than the best performance and reliability possible. "

You also have a bunch of things in the system. You would want to test things a bit at a time, meaning one RAM stick, one boot drive, no USB drives connected.
 

nasch007

Distinguished
May 30, 2016
44
0
18,530
Few things, SI X99 Gaming Pro Carbon LGA 2011-3 (refurb) <--- refurb = used or repaired = possible not stable.

Reusing the an older and already average quality power supply can also be an issue. This is from a Toms review of the PSU "We do, however, believe that much of this unit's production cost was devoted to bells and whistles rather than the best performance and reliability possible. "

You also have a bunch of things in the system. You would want to test things a bit at a time, meaning one RAM stick, one boot drive, no USB drives connected.
Regarding the power supply, I just want to make sure you understand I bought the 750W brand new, it's new in the system I am not reusing it. What I was saying is that I only have an older one to swap it out with, in order to test. Is your review quote about the Thermaltake Smart Pro RGB 750W? That is kinda of disappointing if it is :/

I've purchased refurbed items before with no troubles. From what I understand about refurbs is it means there was an initial issue and they were sent back to the factory to be repaired. From what I've read these repaired items then have to undergo additional testing and scrutiny above the initial QA all the other units get, to make sure they aren't re-selling an unfixed product. From the problems I described, and the fact that the first nvidia card was like 99.9% bad, and the AMD card worked just fine while it was in, do you really suspect it was the motherboard? Is there a specific way I can test?

I can try and RMA the motherboard, but I really want to be sure... because that will put me out of commission for quite a while.
 
Regarding the power supply, I just want to make sure you understand I bought the 750W brand new, it's new in the system I am not reusing it. What I was saying is that I only have an older one to swap it out with, in order to test. Is your review quote about the Thermaltake Smart Pro RGB 750W? That is kinda of disappointing if it is :/

I've purchased refurbed items before with no troubles. From what I understand about refurbs is it means there was an initial issue and they were sent back to the factory to be repaired. From what I've read these repaired items then have to undergo additional testing and scrutiny above the initial QA all the other units get, to make sure they aren't re-selling an unfixed product. From the problems I described, and the fact that the first nvidia card was like 99.9% bad, and the AMD card worked just fine while it was in, do you really suspect it was the motherboard? Is there a specific way I can test?

I can try and RMA the motherboard, but I really want to be sure... because that will put me out of commission for quite a while.

I thought you re-used the old PSU you had in this system. The one you bought is OK, not a top tier model. Average for a gaming setup.

Really the only way to see if the issue is the video card or not is to test the card in another system to see if it runs there or not. Refurb or used parts usually are good, but at the same time, they can still fail more than new ones.

The R7 370 you had to test with should run GTA V pretty well, you should be getting at least 60fps on that not 15.
 

nasch007

Distinguished
May 30, 2016
44
0
18,530
I thought you re-used the old PSU you had in this system. The one you bought is OK, not a top tier model. Average for a gaming setup.

Really the only way to see if the issue is the video card or not is to test the card in another system to see if it runs there or not. Refurb or used parts usually are good, but at the same time, they can still fail more than new ones.

The R7 370 you had to test with should run GTA V pretty well, you should be getting at least 60fps on that not 15.
Yeah, your quote inspired me to read Tom's whole review for that Power Supply - albeit the 850W version. Pretty much they said it's ok, not the greatest, not the most efficient, but a 7 year warranty is legit considering the components used.

As far as the R7 goes, just so we're clear, I didn't optimize in game settings, the 15fps is at 4k resolution, which is what I would expect tbh. I'm sure 1080p is playable.
 

nasch007

Distinguished
May 30, 2016
44
0
18,530
I never marked a solution for this, I apologize. It's actually an on-going thing. I worked with one of the guys on tenforums and isolated it down to Steam drivers using Driver Verifier within Windows. You can follow that here: https://www.tenforums.com/bsod-cras...ter-lockup-restarts-bad-video-card-again.html Unfortunately, I've discovered a BIOS/Motherboard issue so I think I have to handle that, before proceeding with this and further developments. Perhaps I'll swapping the power supply first, as I will have to breakdown the whole system soon anyways.

For anyone out there with a similar issue, stay tuned. And I'll try to mark this as resolved soon!
 

nasch007

Distinguished
May 30, 2016
44
0
18,530
I never marked a solution for this, I apologize. It's actually an on-going thing. I worked with one of the guys on tenforums and isolated it down to Steam drivers using Driver Verifier within Windows. You can follow that here: https://www.tenforums.com/bsod-cras...ter-lockup-restarts-bad-video-card-again.html Unfortunately, I've discovered a BIOS/Motherboard issue so I think I have to handle that, before proceeding with this and further developments. Perhaps I'll swapping the power supply first, as I will have to breakdown the whole system soon anyways.

For anyone out there with a similar issue, stay tuned. And I'll try to mark this as resolved soon!
So I tried to update my BIOS when another version was released... didn't work properly. Got MSI to send me an RMA for an out of warranty repair. Cost me 45 bucks. They flashed an old version of the BIOS on, and one was corrupt/stuck in a weird sort of boot loop. I complained and they sent me a prepaid box to ship it back to them. I did. Then they said they have flashed the BIOS but discovered and issue with the m.2 slot.

Oh brother.

They offered an X399 board to replace my X99 board... um. Fully incompatible?

They offered to refund my purchase price of the motherboard... but not the tax, shipping costs, or 45 bucks on the repair.... finally they said they have absolutely no other x99 boards but they would fix my m.2 slot if I give them time.

So I'm waiting.

Tech Support thread with MSI has been open and somewhat entertaining, from 7/30/19 until now...

I will keep everyone posted so we might be able to mark this thread as having a best answer at some point!
 
Last edited: