I've had 2 fail-to-boots: No POST lights with fans all going full speed (probably means MOBA/fan control failed to start)
Restarting again resolves the fail-to-boot: 10-20s delay with orange DRAM, red CPU POST lights. These eventually resolve and the machine boots slowly (for an SSD) to desktop.
First fail-to-boot was on 10/15/23. The machine worked normally for browsing/streaming, but when I to play a variety of titles I would experience sudden CTDs 5-15 minutes in. Here's my data for that:
AOE2: DE
Had hiccups, but did not crash
Wingspan & Vermintide 2
Crash w/o error log
Starship Troopers:Exterimination crashed the fastest so I farmed some error messages from it. They were a series of EXCEPTION_ACCESS_VIOLATIONS codes with a range of addresses as well as a DEVICE_HUNG error code.
Deep Rock Galactic had a very similar crash log:
Event Viewer - Applications pointed me at both my GPU (Nvidia) and CPU (AMD) drivers potentially being at fault. So I did clean installs of the latest drivers for both.
Event Viewer - System supported my suspicion of a GPU culprit with some nvddmkm (nvidia) errors surrounding the crashes
Reliability Monitor had a list of Hardware errors to match my crashes. E.g.
XMP Solution
Eventually in my searches I came across a a tip that XMP (RAM Overclocking) might be the cause, it seemed like a stretch but I investigated it. I Went into UEFI and found that my RAM was set to not overclocking, so I turned it on. After that games were able to run without crashing. I feel like I'm getting lower FPS but I don't have hard data to compare to.
I didn't understand why this would solve the crashing, and I was hoping for an answer anyways, but then I had a
Second fail-to-boot 10/1/8/23 this morning. Started on the 2nd try, but now I am concerned that there is a deeper issue that I need to diagnose/explain and I would really like some help with.
What I have tried so far
Restarting again resolves the fail-to-boot: 10-20s delay with orange DRAM, red CPU POST lights. These eventually resolve and the machine boots slowly (for an SSD) to desktop.
MOBO: Micro-Star International Co. Ltd. PRO B650-P WIFI (MS-7D78) (AM5)
CPU: AMD Ryzen 7 7700X 8-Core Processor
GPU: NVIDIA GeForce RTX 2070
RAM: G.Skill Flare x5 16gb (x2)
Drives:
ST2000VX008-2E3164
WDC WD20EARS-00MVWB0
WDC WDS100T2G0A-00JH30 (SSD)
Samsung SSD 850 EVO 500GB (SSD)
CPU: AMD Ryzen 7 7700X 8-Core Processor
GPU: NVIDIA GeForce RTX 2070
RAM: G.Skill Flare x5 16gb (x2)
Drives:
ST2000VX008-2E3164
WDC WD20EARS-00MVWB0
WDC WDS100T2G0A-00JH30 (SSD)
Samsung SSD 850 EVO 500GB (SSD)
First fail-to-boot was on 10/15/23. The machine worked normally for browsing/streaming, but when I to play a variety of titles I would experience sudden CTDs 5-15 minutes in. Here's my data for that:
AOE2: DE
Had hiccups, but did not crash
Wingspan & Vermintide 2
Crash w/o error log
Starship Troopers:Exterimination crashed the fastest so I farmed some error messages from it. They were a series of EXCEPTION_ACCESS_VIOLATIONS codes with a range of addresses as well as a DEVICE_HUNG error code.
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000003df5
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000024d3bb65944
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000000068
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000003df5
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0xffffffffffffffff
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x000001e4c2d25648
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000024d3bb65944
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000000068
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000003df5
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0xffffffffffffffff
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x000001e4c2d25648
[File:C:\buildWork\9295552041ccd50b\UnrealEngine\Engine\Source\Runtime\D3D12RHI\Private\D3D12Util.cpp] [Line: 873] CurrentQueue.Fence.D3DFence->GetCompletedValue() failed at C:\buildWork\9295552041ccd50b\UnrealEngine\Engine\Source\Runtime\D3D12RHI\Private\D3D12Submission.cpp:939 with error DXGI_ERROR_DEVICE_REMOVED with Reason: DXGI_ERROR_DEVICE_HUNG
Deep Rock Galactic had a very similar crash log:
Fatal error: [File:Unknown] [Line: 684] pResource->Map(Subresource, pReadRange, reinterpret_cast<void**>(&pData)) failed at C:\BuildAgent\work\DRG-release\UnrealEngine\Engine\Source\Runtime\D3D12RHI\Private\D3D12RHIPrivate.h:1181 with error DXGI_ERROR_DEVICE_REMOVED with Reason: DXGI_ERROR_DEVICE_HUNG
Event Viewer - Applications pointed me at both my GPU (Nvidia) and CPU (AMD) drivers potentially being at fault. So I did clean installs of the latest drivers for both.
The description for Event ID 2 from source NVIDIA OpenGL Driver cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
DrvSetContext failed functionality indeterminant
(pid=22412 radeonsoftware.exe 64bit)
The system cannot find the file specified
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
DrvSetContext failed functionality indeterminant
(pid=22412 radeonsoftware.exe 64bit)
The system cannot find the file specified
The description for Event ID 2 from source AMD_ANR_BG_PROC cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
AMD_ANR_BG_PROC
Function: ret=0 size=6 called.
The message resource is present but the message was not found in the message table
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
AMD_ANR_BG_PROC
Function: ret=0 size=6 called.
The message resource is present but the message was not found in the message table
Event Viewer - System supported my suspicion of a GPU culprit with some nvddmkm (nvidia) errors surrounding the crashes
The description for Event ID 0 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
\Device\00000103
Error occurred on GPUID: 100
The message resource is present but the message was not found in the message table
If the event originated on another computer, the display information had to be saved with the event.
The following information was included with the event:
\Device\00000103
Error occurred on GPUID: 100
The message resource is present but the message was not found in the message table
Reliability Monitor had a list of Hardware errors to match my crashes. E.g.
Description
A problem with your hardware caused Windows to stop working correctly.
Problem signature
Problem Event Name: LiveKernelEvent
Code: 141
Parameter 1: ffff918c0a7e6050
Parameter 2: fffff800c10c9080
Parameter 3: 0
Parameter 4: 810
OS version: 10_0_22621
Service Pack: 0_0
Product: 768_1
OS Version: 10.0.22621.2.0.0.768.101
Locale ID: 1033
A problem with your hardware caused Windows to stop working correctly.
Problem signature
Problem Event Name: LiveKernelEvent
Code: 141
Parameter 1: ffff918c0a7e6050
Parameter 2: fffff800c10c9080
Parameter 3: 0
Parameter 4: 810
OS version: 10_0_22621
Service Pack: 0_0
Product: 768_1
OS Version: 10.0.22621.2.0.0.768.101
Locale ID: 1033
XMP Solution
Eventually in my searches I came across a a tip that XMP (RAM Overclocking) might be the cause, it seemed like a stretch but I investigated it. I Went into UEFI and found that my RAM was set to not overclocking, so I turned it on. After that games were able to run without crashing. I feel like I'm getting lower FPS but I don't have hard data to compare to.
I didn't understand why this would solve the crashing, and I was hoping for an answer anyways, but then I had a
Second fail-to-boot 10/1/8/23 this morning. Started on the 2nd try, but now I am concerned that there is a deeper issue that I need to diagnose/explain and I would really like some help with.
What I have tried so far
- Updated/Fresh Installed Nvidia drivers
- Updated/Fresh Installed AMD drivers
- Scanned drives for errors - clean
- sfc /scannow - clean
- Defragged all drives
- Monitored CPU/GPU temps (both under 70 *C when gaming/crashing)
- Virus Scan
- Verified Game files
- Reinstalled games to same/different drives (no change)