Question Computer sometimes fails to boot or wake, [usually] citing a RAM issue, but happens independently of what RAM sticks are in

Sep 15, 2020
5
0
10
0
This issue's been coming up for a week or so and I've tried some troubleshooting, but to no avail, and the problem might be getting worse or indicative of a bigger oncoming problem, but all parts were gotten and installed back in about October, save for the GPU and one SSD from ~September 2019. The short version is that it looks like it's an issue with RAM being loaded on boot, but not with the actual sticks I'm using. I'll have to break this down into a few paragraphs because I've alternated between having the problem, trying some troubleshooting, and having more of the problem.

Initial Problem

Over the last week, my computer had 3 instances of failing to boot from sleep after being left in sleep overnight. My motherboard (Strix X570-F Gaming) has lights on it to indicate problems. On the first and third instance, I got an orange light, indicating a RAM issue (2 16GB Corsair Pro RGB rated at 3600MHz, running DOCP). On the second, I got a red light, indicating a CPU issue (Ryzen 3900X). In all three instances, I held the power button to force shutdown, and upon switching on, everything's fine, RAM is still clocking 3600MHz, 32GB, temperatures are normal. These 'failed to boot' instances didn't happen every day, just about 3 out of 5 days or so. I also tried, as a test, putting the system to sleep and waking it up right after, and everything was fine, but a sample size of 2 isn't much good.

Initial Troubleshooting

To try some firmware troubleshooting in case this was some problem with my BIOS, I updated to the latest non-beta BIOS for my motherboard (version 3001), ensured firmware-based TPM was turned off, and updated my GPU (GTX 1660) drivers - most likely irrelevant, but they needed updating anyway.

Problem Continues / Worsens, Hardware Troubleshooting

Yesterday morning, I try to wake from sleep, get the same issue, orange light, says RAM issue (all lights from this point onward are orange). Figure it's time to reseat the RAM. Shut off, I do that, taking them out and putting them back in, and now I don't even get to POST from off state. Nothing sent to monitor. Orange light, RAM issue. To try something out I take them out and put in my old RAM, 2 8GB Corsair Vengeance sticks. Same light. So it's a RAM issue, but seemingly not bound to these 16GB sticks. I switch off, switch off at kettle, unplug kettle for ~10 min, come back, plug back in and it's turning on fine. Again, I check the RAM, it's running 2x16GB at 3600MHz just fine. I turn off secure boot in the BIOS, just on the off chance that it's related to it struggling to wake or POST. System is fine for the rest of the day.

Today: problem persists, struggling to post or wake from shorter sleep

Switched on today from sleep, orange light is back. Hold off, shuts down. Turn on again with no unplugging, no POST. Turn off, unplug, go away, come back, plug in, turn on, orange light. Turn off again, unplug, wait a while, plug back in, boots just fine.

I start to wonder if it's related to the length of sleep, so I run another test, putting it to sleep, going for a shower (~30m), come back, try to wake, orange light. Hold shutdown, off, turn back on with no unplugging, boots fine, here I am typing this.

Specs, to recap:
Motherboard: Asus Rog Strix X570-F Gaming
Bios: Version 3001
CPU: Ryzen 9 3900X
GPU: Zotac Geforce GTX1660
RAM: 2x Corsair Vengeance Pro RGB DDR4 16GB rated at 3600MHz, DOCP defaults
PSU: Corsair RM850
OS: Win10 Home 20H2 / 19042.789
antivirus: ESET (in case that mucks with anything booting)
OS install drive: Samsung 2TB 860 QVO.

So what gives?

The issue is:
  • not constant
  • doesn't necessarily seem to be linked to sleep length or even sleep vs boot from shutdown, and I might have just been getting coincidental results prior to yesterday. Similarly the 'unplug it and go away for a while' might not be a fix and I might just randomly and coincidentally be getting good boots 2 out of 3 times after unplugging it. Restarts from OS have only been done 2 or 3 times but have not had issues.
  • indicated as a RAM issue about 6 times, a CPU issue once.
  • not a RAM issue specific to the current RAM sticks, so not at all likely to be related to my sticks running at 3600MHz DOCP (AMD equivalent of TPM)
  • first instance doesn't seem to correlate with any recent software changes
  • not linked to fwTPM or secureboot
I'm starting to suspect it's more something with the motherboard not dealing with the RAM in some way, or not getting enough power at boot, but my power supply is an RM850 (modular) with plenty overhead. Later today I'm going to unplug it, open up the side panel and unplug / replug every connection to the motherboard and PSU, in case it's just something loose, but I am really worried that it might actually be getting progressively worse, that at some point it's going to stop entirely or having to hold shutdown is going to ruin something, and I don't know what to look at. Every involved part is barely a few months into its life so I'm definitely warrantied but I struggle to believe that something works for that long just fine then starts having an issue. Can't help but worry that I've done something wrong.

Any particular recommendations?
 
Last edited:

Ralston18

Titan
Moderator
At face value my first suspect is the PSU.

However, do as you plan:

Power down, unplug, open the case.

Clean out dust and debris.

Reseat all cards, connectors, RAM, and jumpers to ensure that all are fully and firmly in place.

Look in Reliability History and Event Viewer for error codes, warnings, and even informational events.

Increasing numbers of varying problems is symptomatic of a faltering PSU.

Reliability History has a timeline format that can readily show such trends.

If you have a multi-meter and know how to use it you can do some testing on the PSU. (Or find a family member or friend who does and can help).

Reference:

https://www.lifewire.com/how-to-manually-test-a-power-supply-with-a-multimeter-2626158

Not a full test as the PSU is not under load.

Any voltages out of spec would be of concern.
 
Sep 15, 2020
5
0
10
0
Thanks for your response, the big reseat is gonna go ahead tomorrow.

Look in Reliability History and Event Viewer for error codes, warnings, and even informational events. Increasing numbers of varying problems is symptomatic of a faltering PSU. Reliability History has a timeline format that can readily show such trends.
Ok, so, I don't have much of a frame of reference for how to read this, so I'll go through Reliability Monitor first which seems more or less okay, then Event Monitor which looks more problematic.

In reliability monitor, all I've had over the past week are application failures and information alerts and none of them seem unaccounted for, nor do they correlate 1:1 with the days I had startup issues. It's just things like OBS, Resolve, iCue and Razer crashes. Outside of the Razer ones these are all kinda regular and expected in my experience, and again, the issue predates the two days with razer crashes (they also released a new update today which looks like it fixes that). The Informational events are all just successful windows updates that I can account for.

I did get 1 windows failure alert on the 20th, predating the issue, three instances of a hardware error, but no instances of it since and as far as I can tell that's actually correlating with Davinci Resolve crashes which happen with Nvidia cards on some then-buggy drivers. No instances since. A warning on the 21st from Asus Update Helper when I was redoing some Aura stuff. That's it. No cascade of errors since the issues started, no failures that I couldn't account for otherwise.

Event Viewer is a bit more vague but does seem a bit more pessimistic, depending on how I'm supposed to interpret this?

0 Critical Events, but
83 errors in the last 24 hours, 1,001 in the last 7 days. Almost all of these are event ID 2002 in EaPHost, from an Application (40 and 332) and ID 10010 in DistributedCOM (34 and 471)
175 Warnings in the last 24 hours, 1376 in the last 7 days. The spread's a little more even but two standouts are Event 642 in ESENT (17 in 24 hours, 457 in a week), and Event 10016 (DistributedCOM) - 118 in 24 hours, 613 in 7 days.
Information seems fairly spread but with the biggest coming from Powershell and AppModel-runtime so it doesn't look problematic?
 
...

Any particular recommendations?
With all that's been written it's not at all clear: have you ever tried a CMOS reset? I'd strongly suggest doing it... shut down, disconnect power, remove the CMOS coin cell battery, short the reset pins (look in the manual) for 5 min's or so then put back together. Then run the system in full-up default settings (including running it in CSM mode) and see if that helps anything. If it's good, then start the custom BIOS settings bit by bit.
 
Sep 15, 2020
5
0
10
0
With all that's been written it's not at all clear: have you ever tried a CMOS reset? I'd strongly suggest doing it... shut down, disconnect power, remove the CMOS coin cell battery, short the reset pins (look in the manual) for 5 min's or so then put back together. Then run the system in full-up default settings (including running it in CSM mode) and see if that helps anything. If it's good, then start the custom BIOS settings bit by bit.
I hadn't thought of that, will give it a shot, remember reading about it in the manual - is there anything I should do beforehand for data safety or whatever or can I just do it more or less straight away?
As an aside, however, this is the first I've heard of 'CSM Mode'; what's that entail?
 
I hadn't thought of that, will give it a shot, remember reading about it in the manual - is there anything I should do beforehand for data safety or whatever or can I just do it more or less straight away?
As an aside, however, this is the first I've heard of 'CSM Mode'; what's that entail?
Resetting CMOS does not affect data stored on hard drives so it is safe. It's only resetting information in a special memory store that's used when the CPU initializes memory and other parts of the system. It's reset to settings that are safe and it then re-learns what's right for the hardware. It sometimes gets messed up, especially when changing hardware (like memory) and BIOS'.

CSM mode = Compatibility Support Mode and refers to how the machine boots itself, the alternative is UEFI mode. Many BIOS's call it different things, like my MSI board calls it 'Windows Enhanced' when using UEFI mode so disabling that puts it in CSM mode. You have to enable UEFI mode, which takes it out of CSM mode, in order to enable Secure Boot.

CSM is not desireable, but it's available when you have legacy hardware or software that's not compatible with modern UEFI firmware.
 
Last edited:

ASK THE COMMUNITY

TRENDING THREADS