Question Strange Crash/Reboot, but only when system is left unattended ?

Feb 27, 2024
19
0
20
Summary: After a fresh install of Win11, BIOS update system will crash/reboot but ONLY while not in use/unattended. Fresh install was done because of 3 year old OS and to remove vestiges of my school account from PC. Additional verbiage at bottom of post. I've done a ton of searching and can't locate anything that can give me a direction to go next.

Potentially of note: PC was co-located near my 3D printer and that might have cause my m.2 screw for my OS drive to loosen - unconfirmed however.

Quick Answers:
  • No new hardware installed; no unusual software packages (O365, Steam, Adobe)
  • Pulled drivers from ASRock for Mobo
  • Pulled chipset driver updater from AMD
  • Do not have any DMP files, mini or other, as they aren't being created
Troubleshooting steps:
  • sfc scannow
    • Found and repaired corrupted files
  • Ran an app called Heavy Load to 100% my CPU to check temps and such
    • Nothing to report; CPU stays around 76/77 C and no BSoD or crashes
  • Did a simple GPU test using a tool from matthew-x83 online GPU test
    • Ran this with 1000 objects for 10 minutes with no issues
    • Gaming presents without crashes as well
  • Checked for loose/unseated devices (near my 3D printer which is a huge vibration source)
    • Found the screw for my OS m.2 was loose; tightened
  • Confirmed PSU fan was, in fact, working
  • Tested PC with only 3 USB connections active on mainboard and not add-in (K+M, Headset receiver)
    • No change in issue
  • Used the PC for over 8 hours without incident. Within 25 minutes of walking away, crash/reboot
  • Export of Critical/Severe items from Event Viewer: I can provide this through several means, but it's XLSX right now and I know that can be suspect when it comes to file sharing. I have included text from the two Criticals that I see regularly.
    • WHEA-Logger Event Details:

      WHEA-Logger, Event ID 18
      A fatal hardware error has occurred.
      Reported by component: Processor Core
      Error Source: Machine Check Exception
      Error Type: Cache Hierarchy Error
      Processor APIC ID: 10
      The details view of this entry contains further information.
      WHEA-Logger, Event ID 18
      A fatal hardware error has occurred.
      Reported by component: Processor Core
      Error Source: Machine Check Exception
      Error Type: Bus/Interconnect Error
      Processor APIC ID: 0
      The details view of this entry contains further information.

Some Steps I have not done yet:
  • Pulled CMOS/Reset BIOS to default settings
  • Tested RAM using any tool - not sure what tool to use
  • Testing internal components - not sure what the best method is
  • Removed any internal components
System:
Speccy Snapshot: http://speccy.piriform.com/results/yPnIouF1MUrTpHD2xtImPBD

ASRock X570 Phantom Gaming 4 | BIOS v5.6

AMD Ryzen 9 5950X

64GB XPG D50 3200MHz

MSI GeForce RTX™ 3090 GAMING X TRIO 24G | Driver Version: 27.21.14.5671

500GB XPG Spectrix S40G (m.2)

1TB ADATA SX8200PNP (m.2)

1TB TEAMGROUP TM8FP4001T (PCI-E)

PCI-E USB Add-in Card

Thermaltake Toughpower Grand 750

Windows 11, Version 23H2 (Build 22631.3155)


Decided to wipe system because Windows (I had issues cropping up from their crappy updates). I downloaded latest bios, chipset drivers, and stand-alone drivers from ASRock (my mobo) and reinstalled Windows. Everything seemed great, no issues during flashing nor reinstall. Drivers picked up and I began reinstalling my software. Once I had this finished I noticed that my monitors wouldn't go into standby nor the PC itself into sleep - so I began working on this issue. Found that it was something to do with Steam and noticed some janky behavior as I was trying to change Power plan settings. Ran SFC and it found corrupted files - this fixed the standby/sleep issue (I don't use sleep but for testing purposes I was).

This brings me to the current dilemma; my PC will crash and reboot (or attempt to reboot) when I am not actively using it. I have used the PC for 10 hours straight, gaming, school work, browsing, without so much as a new entry being added to my event viewer yet within 20 minutes of me walking away it will crash/reboot. I have no dump files, minidumps, nothing being created when the machine crashes/reboots yet I know it's happening because my peripherals reset and I have to log back in.
 
Last edited:
Solution
Leaving this here for posterity; some other poor soul may find this useful...



Well, I don't know if I would call it "figured out" yet as I haven't hit 24 hours of stability, but, I'm not far off and so far so good.

So, depending on main board and processor your mileage may vary but here is what I posted in another thread on the Reddits:



For anyone still looking at this thread, I am experiencing WHEA-Logger errors and have replaced my mobo, PSU, primary and secondary m.2 drives so far. Previously, I had used the solution located here (https://www.reddit.com/r/Amd/s/kcu0mkyFbH - Change Power Supply Idle Control to Typical) which removed most of them. I just now disabled Core Performance Boost and Global C-State...
Since you mention heavy vibration from your 3D-printer, I would reseat your ram, gpu and check all cable connections for a tight fit.
I have done one pass of the machine checking for loose items, but, I did not remove and reseat the GPU or RAM so I can do that. I will be running memtest86+ tonight as well.

Also, thank you for responding. This is the third place I have posted this and you are the only one to reply.
 
Just pulled everything PCI, m.2, and reseated power and SATA cables.

View: https://imgur.com/a/OCsW8Vn
Potentially new development. System booted fine and I dropped a mouse wiggle program onto it. System froze; like totally unresponsive to anything and kept what was last displayed on the monitors. I unplugged my mouse (it still had RGB on) and moved it to another USB port on the mobo and it’s dead; no power, no input.
 
Two thoughts:

1) Check Reliability Monitor/History for error codes, warnings, and event informational events that occur just before or at the time of the crashes/reboots. Reliability History/Monitor presents a timeline format that can help discover patterns.

2) Case image in Post #7: Counting GPU (removed) ther are 8 (eight) fans - correct? What directions are the fan airflows?
 
Two thoughts:

1) Check Reliability Monitor/History for error codes, warnings, and event informational events that occur just before or at the time of the crashes/reboots. Reliability History/Monitor presents a timeline format that can help discover patterns.

2) Case image in Post #7: Counting GPU (removed) ther are 8 (eight) fans - correct? What directions are the fan airflows?
Reliability Monitor: View: https://imgur.com/a/31P6gBu


Fans at bottom of image on radiator pull into case; fans at top and right of pic exhaust from case. I should have rotated the image 90 degrees counter/anti-clockwise. The pic has the front of my case at the bottom. Apologies if this is stating the obvious.

I have been keeping the system off while I use other methods to research. Today was the first time that I had it 'freeze' totally while I was at the keyboard. As for the reliability monitor all I see is that windows was not shut down properly for the most part (I have the reliability history saved out as an XML file as well). The common things that I see/saw in the event viewer are the WHEA-logger events from the OP and occasionally I see this:
View: https://imgur.com/a/l6wa4kG


It looks like today's "lock-up" was Windows Update being lovely.
 
Fans: my general thought being that the fans slow/stop when idle and something gets hot or perhaps a cable moves slightly (from a change in air flows) and looses connectivity or causes a short.

= = = =

In Reliability History take another look at the dates showing all three type of errors.

Problems started on 2/16 and stopped on 2/26 for 4 days (inclusive). Then reoccured today 3/1.

Any thing common to those days/dates: backups, updates, certain apps being run, etc.?

Check Task Scheduler for anything that may be triggering other processes at those times.

Are the errors or error patterns identical?

Make note of the error codes.

Any known installs or apps that may have been run on those dates?

Run the built in Windows Troubleshooters - just as a "do over".

Then run "dism" and "sdc /scannow" again.

= = = =

Many of the errors are in Misc. Failures - correct?

If there is an overall pattern of increasing numbers of varying errors then the PSU may be suspect.
 
Here is a summary from HWINFO64. I've wiped the machine again and the only things I have manually installed are Chrome and HWINFO64. I did a reset to defaults on my BIOS, wiped my entire OS drive, and did the install.

System is actually more unstable now than it was before. I've got HWINFO logging at the moment but I don't know if that will actually catch anything. I'm about to wave the white flag and take it to a shop because I need this functional for school purposes - it chose now when I am 1.5 terms away from graduating to start this lol

Thanks for the info and I will continue to update here since I would like to see if I can find the root cause.

https://1drv.ms/u/s!AgTMZS8A00Rj4UmbwWKV6KqGsi7m?e=dGqgPb
 
In Reliability History take another look at the dates showing all three type of errors.

Problems started on 2/16 and stopped on 2/26 for 4 days (inclusive). Then reoccured today 3/1.

Any thing common to those days/dates: backups, updates, certain apps being run, etc.?

The system was powered down during these times where there was nothing.

Are the errors or error patterns identical?
Timing, no pattern. As for the errors that I can see in the Admin events it's WHEA-Logger APIC ID 0 or 10 with the weird PCI error that has no info. There is occasionally an error about something being dropped in transport.

Any known installs or apps that may have been run on those dates?
I have not installed anything in well over a week. I disabled all start-up apps save for OneDrive. Even now on the fresh wipe I only have Nvidia control panel and OneDrive in the systray.

Run the built in Windows Troubleshooters - just as a "do over".

Then run "dism" and "sdc /scannow" again.
Ran sfc scannow already - found corrupted on a brand new install. Just ran DISM and sfc again and they are clean.

Noticed this in HWINFO as I am typing this: View: https://imgur.com/a/ARgreNN


I'm rebooting now after DISM and I'm running logging from HWINFO64 as well.
 
Out (full disclosure) of my comfort zone now.

Based on HWinfo results (Post #12) - when was the thermal paste redone - if ever?
It’s an iBuyPower system that I had built that has been solid until just recently. Never redone thermal paste.

I just disabled the monitors turning off so I can watch it and HWINFO is logging right now. Funnily enough, it hasn’t crashed in over an hour. I can upload these files as well but I don’t see anything that jumps out right now.
 
I have reread your original complaint and remembered that the issue happens when you are not using the computer. Or has this changed? What settings do you have for Windows update? I set my settings so that the Windows update does not install for some time, that way I can be present, and choose which update, is installed so that if there is a problem I can address it then.
 
Monitor: some monitors have their own drivers. Make and model monitor(s)?

The system does not necessarily need to be "working" to have a power glitch cause problems.

All of those "Stopped working" and "Windows was not properly shutdown" suggest to me either a loose connection or faltering/failing PSU.

Click and view the Details - look again at the error code numbers. The error codes per se may or may not be helpful.

When the system is idle and perhaps has been idle for some there may some background app trying to do things: backup, update, or simply phone home.

Look in Task Scheduler for processes that may be being triggered during non "working" or idle times.

You can also use Process Explorer (Microsoft, free) to look at and identify all running processes.

https://learn.microsoft.com/en-us/sysinternals/downloads/process-explorer

And I like @Fix_that_Glitch idea to stop the updates. Certainly another way to simplify things and gain more control and visibility into what the syste is doing or trying to do. Or stops doing when being "idle".

Will defer to others regarding the "certification" errors etc..

= = = =

Do be sure that all important data on the system is indeed backed up at least 2 x to locations away from the system itself. Be sure to verify that the data is recoverable and readable.
 
I have reread your original complaint and remembered that the issue happens when you are not using the computer. Or has this changed? What settings do you have for Windows update? I set my settings so that the Windows update does not install for some time, that way I can be present, and choose which update, is installed so that if there is a problem I can address it then.
Sadly this has changed. It is now no longer automatically recovering and presents as a freeze (everything stays on screen just as it was and is totally unresponsive) instead of a crash/reboot cycle OR crashes with zero responsiveness and dark screens and I have to force shutdown and reboot.

As I was reviewing the reliability history I noticed an AMD driver failed to install so I grabbed AMD’s chipset installer and got that added and the others updated. I left it alone overnight with the monitors powered off.

I’ve paused Win updates and run the update troubleshooter which did find and issue and said it fixed it. Started a new HWINFO log file and it’s sitting now.
 
Monitor: some monitors have their own drivers. Make and model monitor(s)?
Monoprice Zero-G Curved Gaming Monitor - 27 Inch - Model 27C1R
These were operating as generic monitors before my first wipe mentioned in post #1

All of those "Stopped working" and "Windows was not properly shutdown" suggest to me either a loose connection or faltering/failing PSU.

Click and view the Details - look again at the error code numbers. The error codes per se may or may not be helpful.
When checking the problem details, it simply states: The previous system shutdown at 10:08:56 PM on ‎3/‎1/‎2024 was unexpected.

Is there a better resource than checking the Administrative Events in Event Viewer? Is there a good way to share this info?

Look in Task Scheduler for processes that may be being triggered during non "working" or idle times.

You can also use Process Explorer (Microsoft, free) to look at and identify all running processes.

https://learn.microsoft.com/en-us/sysinternals/downloads/process-explorer
Checked scheduler and only items I saw were disabled.

Downloaded Process Explorer but this is outside of my comfort zone now. I'm not sure how to utilize this.

And I like @Fix_that_Glitch idea to stop the updates. Certainly another way to simplify things and gain more control and visibility into what the syste is doing or trying to do. Or stops doing when being "idle".

Will defer to others regarding the "certification" errors etc..
I have paused updates and run the update troubleshooter so it's a wait and see on that. Still getting a WHEA-Logger Hardware failure event in Event Viewer though.

And, I realized that I didn't put that this is Windows 11 Pro though, in relation to the issues, I don't believe that could make a difference.
 
Reference:

"WHEA-Logger Hardware failure event"

Google the phrase and read a few of the results. Revise the search criteria based on what you read and other things that you may have noticed with respect to the system problems at hand.

Objective being to narrow down and identify potential culprits.

Be prepared to do a lot of reading. However, it is very likely that you will be able to quickly reject some of the search results. Most likely those that offer some software that will fix the problem. Many of those products will show up no matter what problem is being addressed. Avoid registry edit fixes as well.

You asked for "better resources": One tool is Powershell and the use of simple Get- cmdlets to obtain specific information. I have not noted any immediate "Get" cmdlets that could be helpful. Will read back and see what, if any, may be diagnostically useful.

And you have gone a ways up the proverbial learning curve. Review what all has been said and done.

Likely the number of potential causes can be reduced - hardware becoming more suspect....

= = = =

As for Process Explorer it can be helpful with respect to identifying processes. Those that are running and those that should or should not be running. Along with how much of any given system resource the process is using.

Unfortunately identifying and discovering which processes are which is problematic. Some running processes may be listed multiple times because other processes are using them.
 
Reference:

"WHEA-Logger Hardware failure event"

Google the phrase and read a few of the results. Revise the search criteria based on what you read and other things that you may have noticed with respect to the system problems at hand.

Objective being to narrow down and identify potential culprits.

Be prepared to do a lot of reading. However, it is very likely that you will be able to quickly reject some of the search results. Most likely those that offer some software that will fix the problem. Many of those products will show up no matter what problem is being addressed. Avoid registry edit fixes as well.

You asked for "better resources": One tool is Powershell and the use of simple Get- cmdlets to obtain specific information. I have not noted any immediate "Get" cmdlets that could be helpful. Will read back and see what, if any, may be diagnostically useful.

And you have gone a ways up the proverbial learning curve. Review what all has been said and done.

Likely the number of potential causes can be reduced - hardware becoming more suspect....

= = = =

As for Process Explorer it can be helpful with respect to identifying processes. Those that are running and those that should or should not be running. Along with how much of any given system resource the process is using.

Unfortunately identifying and discovering which processes are which is problematic. Some running processes may be listed multiple times because other processes are using them.
I’m beginning to think I should have left my BIOS alone. I’m currently running a FurMark “torture test” (GPU & CPU) and just hit the 15 minute mark. Goal is to recreate a failure but so far not even a flicker. GPU sitting at or just below 80c and CPU doesn’t go above 74.

I should note I’m not a novice but I’m no expert either. I’ve just never encountered something this damned illusive before. I know it’s typically not recommended to play with a ton, but, could it be worthwhile to downgrade BIOS? Latest version was to address a potential exploit but, I don’t even have secure boot enabled.
 
Well, this entire thread sounds familiar…especially this post: https://www.reddit.com/r/Amd/s/kcu0mkyFbH

It took some digging, but, I found the mentioned setting in my BIOS and made the change. System is now sitting idle and will report back in about an hour.
So, I set my monitors to sleep after 5 minutes and allowed the system to sit. It still crashed with dark screens and no recovery reboot. Had to hold power to turn off and then turned it back on. Good news though! There was no WHEA-Logger event this time. Still dealing with the WindowsPackageManagerServer.exe crash.

Located this thread, seems I'm not the only one with this issue and it's recently come into being: https://answers.microsoft.com/en-us...ror-with/68f2d055-9144-45a5-8be5-6498e2a2bc37

Symptoms align with some of the respondents too; system crash with no video and have to hold power to shut down and then turn the PC back on. However, this is beginning to look less like it's related to the crashes as I just witnessed it get logged when I did a restart of the system.

I'm thinking now that whatever is causing the Event ID 56 may be the culprit of the crashes. I've opened a thread over at Microsoft Answers about it.

"WHEA-Logger Hardware failure event"

Google the phrase and read a few of the results. Revise the search criteria based on what you read and other things that you may have noticed with respect to the system problems at hand.

Objective being to narrow down and identify potential culprits.
It goes without saying that I hate Google's search algorithm sometimes....I have searched those damned WHEA events multiple times and never came across the thread I did today.

I'm going to leave my system as is to watch for WHEA-logger events and I've added my $.02 onto the thread over on the Microsoft forums. I really hope this is a sign of progress...
 
Last edited:
So, no help from the Microsoft answers threads. And, according to the local shop that I took my PC to, they are pointing the finger at me updating my BIOS which seems to be a “known” thing for the ASRock board that I have. They stressed I needed to research this more by searching for the part number of the board that I have.

I may get it back and see if I can downgrade the BIOS back to stability. If that doesn’t work I’ll replace it with ASUS or Gigabyte.
 

TRENDING THREADS