Question RMA all possibly faulty components even if I am not sure which are really broken?

robert_de

Prominent
Jun 8, 2020
4
0
510
Full context (see below for TLDR):
I built a custom PC about one and a half years ago, so all the components should still be under warranty (I can provide a full list of components if needed). The PC runs Windows 10 and I use it mostly for gaming and some programming. The CPU is an i7-8700, so there is no overclocking involved.
Yesterday I started getting bluescreens. The first bluescreen happened while I ran a build in Android Studio (i.e., a high CPU load situation). The subsequent ones also seemed to be mostly related to high CPU load. I can boot to the windows login screen most of the time, but as I login (after entering the password), it will bluescreen about 50% of the time. If I am able to login, the system will run fine if left idle, but doing something CPU-intensive will also cause a bluescreen most of the time (e.g. I tried running Prime95, which resulted in an instant crash). Most of the bluescreens were not associated with any specific driver (checked with NirSoft bluescreenview and WinDbg on some of them), except for one that was caused by netio.sys. Also, the bluescreens do not have one consistent code, but are different all the time. Here's a list of some of the ones I can remember:
  • IRQL_NOT_LESS_OR_EQUAL
  • KMODE_EXCEPTION_NOT_HANDLED
  • UNEXPECTED_KERNEL_MODE_TRAP
  • ATTEMPTED_EXECUTE_OF_NOEXECUTE_MEMORY
As this is a very diffuse error pattern, I started suspecting a hardware issue, so I took out the graphics card and disconnected the secondary hard drive from the system. This left a pretty minimal setup of motherboard, CPU, system drive (which is an nvme ssd), RAM (2 sticks), and PSU, as well as 2 case fans and the CPU fan.
With this pretty minimal setup, the system was as instable as before, bluescreening in the same situations as described above. To make sure that this is not a windows or hard drive issue, I also booted into an Ubuntu live system from an USB stick, which was also unstable (i.e. screen freezes that it doesn't recover from) when the CPU was put under load. I also ran windows memchk once, which did not find any errors, so I believe a RAM issue can be ruled out as well.
The system is also not overly dusty, as I clean it regularly, and there is no visible damage to any of the components. I have also double-checked that all connectors are properly connected and even re-seated the CPU.
Based on the fact that the system is mostly unstable under high CPU load, I believe that there is some power delivery issue, which could be due to the motherboard or PSU, or that the CPU itself is broken (but I believe that to be more unlikely as it then probably wouldn't work at all).
Unfortunately I do not have any spare components to do definitive tests to determine which component really is causing the issue, and I also won't be able to get any quickly in the absence of buying them new. Would the best option here to just RMA the motherboard, CPU, and PSU all at the same time, even if I am not sure which one is really broken? Or is there a better way to go about fixing this?


TLDR:
I suspect that my CPU, motherboard, and/or PSU are faulty. I do not have any spare components to swap them out and determine for sure which one/ones is/are faulty. Should I just RMA all three components or is there a better way to go about this (e.g., buying new ones to swap out and sending them back once I determined which one is broken)? I'm asking because I've read some horror stories about RMA processes and would like to have it fixed as soon as possible. Also, do you agree with my assessment or is there anything left that I could try?

Note: a similar version of this question was also posted at https://superuser.com/questions/155...even-if-i-am-not-sure-which-are-really-broken.
 
List of parts. Also to eliminate RAM, try with only 1 stick, then the other stick. You also said you reseated the CPU. Did you apply new thermal paste? Have you been able to see if overheating is an issue?
 
Hi,

here's the list of parts:
Intel Core i7-8700 (with a Thermalright 100700726 Macho Rev. B cooler)
16 GB (2x8GB) G.Skill RipJaws V DDR4-3200
Corsair TX650M PSU
Gigabyte Z370 D3 Motherboard
Gigabyte Geforce RTX 2070 Windforce 8G
System SSD: Crucial P1 CT500P1SSD8 500GB
Secondary HDD: Seagate ST2000DMZ08 BarraCuda

Trying the RAM sticks separately, the problem persists.
I did not apply new thermal paste when I reseated the CPU, as I don't have any handy and the existing one still looked fine.
Overheating has not been an issue with the system previously and when the BSODs occur, CPU temps in HWMonitor are reported below 40° C (I didn't have it open every time, but the times I did the temps were as described).

I also just updated the BIOS to the latest version, but it's as unstable as before.
 
What might have changed since all was well?
Perhaps some sort of a maintenance update?

Run memtest86.
It boots from a usb stick and does not use windows.
It is a basic test of pc functioning.
It takes windows issues off the table.
You can download the free edition here:
https://www.memtest86.com/download.htm

If you can run a full pass with NO errors, your ram should be ok.

Try booting in safe mode.
It runs with a minimal set of essential drivers.
If you run ok, you may have a driver issue.

Your symptoms sound like a psu problem.
Your issue happens under load.
If you can, test with a known good psu.

It is very unlikely that the processor is defective.
 
There was no software change that took place shortly before this.

I ran memtest86 already, should have mentioned that in the original post, sorry. 4 passes showed no errors, so the RAM should be fine.

I also just tried disabling turbo boost (by setting the maximum processor state to 99% in Windows' power options). With it disabled, I was actually able to run Prime95 for 20 minutes without any issues and no apparent instability. The maximum CPU temp after that was 45 °C. But this could still mean anything from a broken PSU to a broken CPU, couldn't it?

Unfortunately I don't have another PSU to test (just moved to a new city, so also no friends nearby to ask for one). I'd have to order a new one or RMA the existing one. That's why I wrote the original post, looking for a better way :/
 
Unfortunately, the only way to diagnose hardware issues is by inspired replacement of parts.
That you can run ok at 99% is a bit baffling.

You could open an incident with corsair.
See if they can't cross ship you a psu replacement.
 
Running prime for 20mins on any system should result in high temps, not in the 40's. Something seems wrong with the temp reporting if that's as high as you get. Even a $500 water loop would probably hit higher than 40 in prime for 20mins.

Hard to speculate, as said, being able to test different parts is what I would do and helpful to have like 10 different PC's with all different sockets, and chips in my house, but not ideal for someone without that. Possibly motherboard. Seeing as how you tried Linux, we can rule out windows and telling you to do a reinstall.

Myself, anytime I remove the cooler, I redo thermal paste, but the problem started beforehand, and afterwards.

Possibly cold solder joint on the motherboard, I'm grasping. lol.
 
Thanks for all your answers. I will try to get a replacement PSU and motherboard and hopefully one of those will solve it.

A note on the Prime95 test: as it ran with turboboost disabled, the CPU only ran at 3.1GHz, so maybe that's why it didn't heat up as much. AFAIK my CPU cooler is also pretty overkill for this CPU. I also ran another longer run of about an hour. There it got up to 55 °C.

One last datapoint that I collected was to try installing the GPU again. With it, the system would still POST (I have a beeper installed), but I couldn't get an image on the screen (neither from windows nor from the BIOS). I guess this confirms that the CPU is not damaged, but the motherboard or the PSU (unless I somehow managed to damage the GPU in all of this testing, which I really hope is not the case 😳).