Question PC Randomly Reboots (Advanced Issue)

Status
Not open for further replies.

Ender156

Distinguished
Jul 11, 2015
5
0
18,510
Hello Tom’s Hardware,

This is possibly my first ever internet post for computer trouble. I apologize if I’m not following proper standards or rules. I’m also on mobile right now, so sorry for the lack of formatting.

Background:
Just to establish some background, I’m a long time tech person who’s been building PC’s (both privately and for hire) for well over a decade. I also have a couple of degrees in Computer/Electrical engineering and I’m a full time software developer. This is just to set the tone of the question, as this will be an advanced-level discussion.

My current build from 2020 is as follows:
CPU: AMD Ryzen 9 5950x
RAM: Corsair Vengeance RGB Pro 32GB (4x8GB) DDR4 3600MHz
Motherboard: MSI X570 MPG Gaming
GPU: PNY Nvidia RTX 3080
PSU: Corsair RM750 750W 80+ Gold Fully Modular (with Cablemod cables)
Drives: Samsung 970 M.2 (1Tb) (boot drive), Samsung 860 Evo SATA (1Tb), WD Black 2Tb HDD SATA, an old Intel 256GB SSD SATA (former boot drive)
Case: Fractal Design Vector RS Blackout with RGB connected to jrainbow header and front panel that includes USB-C, USB 3, power + reset, audio in/out, and HDD LED front panel
Fans: 3x 140mm normal Fractal fans connected to a Fractal PWM fan controller that is connected to PSU SATA power and a system fan header, 3x Cooler Master RGB fans (2x CPU fan headers, 1x to fan controller) with RGB connected to jrainbow headers
CPU Cooler: Coolermaster Hyper 212 Evo (with a new mounting bracket for the AM4 socket) (two fans mentioned above)
Monitors: 2x Acer 1440p 144hz G-sync HDR via DisplayPort
Peripherals: Corsair K70 RGB Keyboard (two USBs for keyboard and power), Logitech G502 Mouse USB, SteelSeries Arctic Pro headset (with USB set atop box), Blue Yeti USB mic. All connected to the back panel.
OS: Windows 10 Home

Primary Use (in order of importance): Development/work, gaming, communication/web browsing

It is used for many hours a day every day without fail, so I would consider it heavily used. I do transport it via vehicle a few times a year.

I did do a small bit of overclocking. RAM runs on an XMP profile. Despite being shipped as 3600Mhz, they run as 2133Mhz default, so flipping an XMP profile on gets it up to 3600Mhz. And, obviously, the fabric clock of the Ryzen goes from 1067Mhz to 1800Mhz, which is within normal operating range so it’s not a problem. I also overclocked the 5950X. The base clock is up to 4Ghz and boosts to 4.4Ghz. These overclocks have been running stable for the lifetime of the build.

Problem:
My PC has started randomly rebooting. Without warning, the screens will go to black and the computer will reboot. The computer does not turn off; the RGB lights and fans never stop. The computer immediately reboots the same exact way if you hit the reset/reboot button. Windows reports that a Kernel-Power event ID 41, which automatically implies this is a hardware reset.

This issue will happen at random intervals; no identifiable pattern could be discerned. It would take anywhere between 5 minutes and 2 hours to reboot. Sometimes, in the middle of booting after the issue, it would reboot again.

My first thought was that there could be a short in the front panel connectors, so I removed the reset headers. Still happened at random.

I thought it may have something to do with power draw. I ran many stress tests (using Cinebench, Furmark, and the AMD Ryzen Master stress test) and monitored them all for temperature, clock speed, voltage, power delivery, temperature, etc. Everything seemed fine. Maximum CPU temperature only ever hit around 60C, and everything else was lower. Temperature is ruled out. Interestingly enough, a reboot never occurred during these tests.

I also worried about my memory. I had a previous server box with a motherboard memory pin short that caused tons of trouble before I realized a nut has fallen under the board and was shorting two pins. I ran TechPowerup’s memory test, Windows 10 memory test, and memtest. All reported everything functioning normally.

My next thought was that my PSU was failing. Admittedly, the 750W cuts it close in terms of supplying power to this system, so I thought maybe the constant high usage wore it down. I purchased a new EVGA 1000W 80+ Gold and installed it (and, to my disappointment, the cablemod cables that worked for Corsair did not work for EVGA). I replaced every power cable in the PC for the new PSU and I also replaced the external power cord and changed outlets. The issue persisted.

Then, despite my reluctance, I decided to do a motherboard replacement. I purchased an MSI X570 MEG Unify, which was a higher quality board than my original. To be specific, it had better board cooling (actual heat pipe for the heatsink covering the VRMs), more VRM phases, and an additional 4 pins of power to the CPU connectors. Truth be told, my previous board was probably a bit low quality compared to my CPU. So this was a fitting upgrade now. After getting everything moved over and installed, I booted and all seemed fine. It was late at night, so I used it for like a couple of hours and everything was fine. I woke up today, resumed using it, and about 0.5-1hr in, it rebooted again.

I believe this is everything. Now, I’m at a loss for what is going on. Like I said, I’ve never posted on a forum for help like this. Generally, I’m a DIY problem solver, and I have solved some very complex and strange issues in my time. But right now, work is busy and I don’t quite have the time to spend another several weeks working through this issue.

If anyone has any advice, it would be much appreciated! I will edit with any additional information or information I have forgotten.

EDIT:
To clarify: the only parts of my computer that I have not replaced are CPU, GPU, and memory. All of which I have tested to the best of my ability. Could they be causing this restart?
 
Last edited:

Ender156

Distinguished
Jul 11, 2015
5
0
18,510
Did you attempt to use the Cablemod cables with the new power supply, even once?
As I said, the cablemod cables were purchased for the corsair PSU (as I found out) because the connectors on the PSU end is different for the peripheral cables (GPU). So I used the default new ones that came with the EVGA and took the cablemod cables out.
 
I understand, but my question is did you attempt to use them with the new power supply before you found out that the pinouts were different?

Also, did you go to the product support page for your X570 MEG Unify and download/install ALL of the latest drivers for chipset (Can be obtained on MSI or AMD websites), LAN, WiFi, Bluetooth and Realtek audio after you changed boards? Making sure that you download the drivers specifically intended for Windows 10, not Windows 11, as some of them may be the same but some of them will absolutely be different?
 

Ralston18

Titan
Moderator
And I will add the suggestion to look in Reliability History/Monitor and Event Viewer.

Either one or both tools may be capturing some error codes, warnings, or even informational events just before or at the time of the cited random reboots.

Reliability History/Monitor may reveal some pattern to what appears "random".
 
  • Like
Reactions: Darkbreeze

Ender156

Distinguished
Jul 11, 2015
5
0
18,510
I understand, but my question is did you attempt to use them with the new power supply before you found out that the pinouts were different?

Also, did you go to the product support page for your X570 MEG Unify and download/install ALL of the latest drivers for chipset (Can be obtained on MSI or AMD websites), LAN, WiFi, Bluetooth and Realtek audio after you changed boards? Making sure that you download the drivers specifically intended for Windows 10, not Windows 11, as some of them may be the same but some of them will absolutely be different?
Ah. No, once realizing even one of the connectors wasn’t the same, I just decided to rewire it all and use the EVGA cables. Never even had any cablemod wires connected to any PC parts while connected to the EVGA.

Yes, I should have mentioned. I downloaded and installed all drivers from MSI’s page on the MEG Unify (for Win10 of course). Everything is up to date (mobo drivers, gpu drivers, windows) but I have not updated the BIOS on either board. I will check the BIOS version of the new board. While I will probably update it, I doubt it matters if both boards are still showing the same issue despite one having an old BIOS and this one having a newer one.

To your other message, yes my PC ran in this exact configuration for 3 years before this problem randomly appeared.
 

Ender156

Distinguished
Jul 11, 2015
5
0
18,510
And I will add the suggestion to look in Reliability History/Monitor and Event Viewer.

Either one or both tools may be capturing some error codes, warnings, or even informational events just before or at the time of the cited random reboots.

Reliability History/Monitor may reveal some pattern to what appears "random".
I will look into Reliability History/Monitor. Event viewer is where I am seeing the Kernel-Power event ID 41. It’s the only critical error and it appears each time the PC resets. Otherwise, there are no other serious errors being reported.

I am familiar with driver-level error reporting there. I once had an issue with a GPU failing and saw power errors reported in event viewer. None of that this time, though.

I’ll also add for clarity: I checked, but since there is no BSOD, there now Windows dumps (minidumps) to check unfortunately.
 
Ok, so you ran this build for 3 years without this problem, and it started happening out of the blue. SO, the question now is, WHAT if anything did you DO or CHANGE between when it was happening and now?

Do you have an approximate date when then began, that you can look in Windows update and see if there were any updates that happened around that time that might have been the trigger?

Knowing for sure what the BIOS version is would be helpful too BECAUSE, sometimes Microshaft releases updates that are released with the ASSUMPTION that users have already applied recent BIOS updates and when they have not it CAN, sometimes, trigger problems that did not exist before but might be resolved through a firmware update. A lot of people think updating the BIOS is like it used to be and that you should only do it IF you have a problem that cannot be resolved through other means. I mean to tell you, that is not how things work these days. These days, BIOS updates are as common as driver updates and if you don't keep the BIOS up to date you MIGHT experience problems that you wouldn't if you had. To be sure, not always, but consider, there are MILLIONS of combinations of hardware for any given configuration when you realize how many different onboard options and peripherals there are, not to mention all the potential software issues with drivers, applications and security considerations. So, IMO BIOS updates, aside from beta versions, should be checked for and applied regularly even if there is nothing in the "notes" for that particular update that seem to apply because they do NOT always include notations for everything that gets modified in any given BIOS update. This cannot be overstated if we're being honest. It is a VERY common fix for a LOT of problems on modern platforms. But it is not always the fix, so I'm not trying to mislead anybody when I say "maybe".

Still a very good idea these days especially if you DO have problems. Usually though, they don't just start out of the blue on a system that was working fine previously UNLESS something changes. Could be a Windows update. Could be a driver update. Could be an installed application that wasn't on there before.

Could also totally be something else. Your 860 EVO is getting long in the tooth, so to speak. I think it would be a good idea to look at your storage devices first.

Do us both a favor, post a screenshot of the drive management window. Make sure all drives are shown. I think you know how to do this because clearly you have a pretty fair understanding of the basics and even are moderately experienced in relevant troubleshooting. Then we can go from there.

The reason is, I want to see what the deal is with the older drive that USED to have Windows installed on it, that MIGHT be causing issues, and believe me, it can, even if you think it can't especially if you did not specifically eliminate the previous hidden boot partitions. This happens OFTEN when users install Windows on new drives without disconnecting ALL drives, in order to eliminate Windows seeing an existing EFI partition and thinking it does not need to create a new one.

Also, testing EACH of the drives health using multiple utilities like Hard disk sentinel, Seatools for Windows, Western digital lifeguard tools or any number of other health and physical testing utilities would really be a good place to start.

If all of that is good, I'd suggest this to be the next step, just so we can eliminate the obvious, if possible.

Memtest86


Go to the Passmark software website and download the USB Memtest86 free version. You can do the optical disk version too if for some reason you cannot use a bootable USB flash drive.


Create bootable media using the downloaded Memtest86. Once you have done that, go into your BIOS and configure the system to boot to the USB drive that contains the Memtest86 USB media or the optical drive if using that option.


You CAN use Memtest86+, as they've recently updated the program after MANY years of no updates, but for the purpose of this guide I recommend using the Passmark version as this is a tried and true utility while I've not had the opportunity to investigate the reliability of the latest 86+ release as compared to Memtest86. Possibly, consider using Memtest86+ as simply a secondary test to Memtest86, much as Windows memory diagnostic utility and Prime95 Blend or custom modes can be used for a second opinion utility.


Create a bootable USB Flash drive:

1. Download the Windows MemTest86 USB image.

2. Right click on the downloaded file and select the "Extract to Here" option. This places the USB image and imaging tool into the current folder.

3. Run the included imageUSB tool, it should already have the image file selected and you just need to choose which connected USB drive to turn into a bootable drive. Note that this will erase all data on the drive.



No memory should ever fail to pass Memtest86 when it is at the default configuration that the system sets it at when you start out or do a clear CMOS by removing the CMOS battery for five minutes.

Best method for testing memory is to first run four passes of Memtest86, all 11 tests, WITH the memory at the default configuration. This should be done BEFORE setting the memory to the XMP profile settings. The paid version has 13 tests but the free version only has tests 1-10 and test 13. So run full passes of all 11 tests. Be sure to download the latest version of Memtest86. Memtest86+ has not been updated in MANY years. It is NO-WISE as good as regular Memtest86 from Passmark software.

If there are ANY errors, at all, then the memory configuration is not stable. Bumping the DRAM voltage up slightly may resolve that OR you may need to make adjustments to the primary timings. There are very few secondary or tertiary timings that should be altered. I can tell you about those if you are trying to tighten your memory timings.

If you cannot pass Memtest86 with the memory at the XMP configuration settings then I would recommend restoring the memory to the default JEDEC SPD of 1333/2133mhz (Depending on your platform and memory type) with everything left on the auto/default configuration and running Memtest86 over again. If it completes the four full passes without error you can try again with the XMP settings but first try bumping the DRAM voltage up once again by whatever small increment the motherboard will allow you to increase it by. If it passes, great, move on to the Prime95 testing.

If it still fails, try once again bumping the voltage if you are still within the maximum allowable voltage for your memory type and test again. If it still fails, you are likely going to need more advanced help with configuring your primary timings and should return the memory to the default configuration until you can sort it out.

If the memory will not pass Memtest86 for four passes when it IS at the stock default non-XMP configuration, even after a minor bump in voltage, then there is likely something physically wrong with one or more of the memory modules and I'd recommend running Memtest on each individual module, separately, to determine which module is causing the issue. If you find a single module that is faulty you should contact the seller or the memory manufacturer and have them replace the memory as a SET. Memory comes matched for a reason as I made clear earlier and if you let them replace only one module rather than the entire set you are back to using unmatched memory which is an open door for problems with incompatible memory.

Be aware that you SHOULD run Memtest86 to test the memory at the default, non-XMP, non-custom profile settings BEFORE ever making any changes to the memory configuration so that you will know if the problem is a setting or is a physical problem with the memory.
 
Status
Not open for further replies.