Question Boot Fails & CTD across game titles

Pandawaffle

Honorable
Apr 1, 2017
10
0
10,510
I've had 2 fail-to-boots: No POST lights with fans all going full speed (probably means MOBA/fan control failed to start)
Restarting again resolves the fail-to-boot: 10-20s delay with orange DRAM, red CPU POST lights. These eventually resolve and the machine boots slowly (for an SSD) to desktop.
MOBO: Micro-Star International Co. Ltd. PRO B650-P WIFI (MS-7D78) (AM5)
CPU: AMD Ryzen 7 7700X 8-Core Processor
GPU: NVIDIA GeForce RTX 2070
RAM: G.Skill Flare x5 16gb (x2)
Drives:
ST2000VX008-2E3164
WDC WD20EARS-00MVWB0
WDC WDS100T2G0A-00JH30 (SSD)
Samsung SSD 850 EVO 500GB (SSD)

First fail-to-boot was on 10/15/23. The machine worked normally for browsing/streaming, but when I to play a variety of titles I would experience sudden CTDs 5-15 minutes in. Here's my data for that:

AOE2: DE
Had hiccups, but did not crash

Wingspan & Vermintide 2
Crash w/o error log

Starship Troopers:Exterimination crashed the fastest so I farmed some error messages from it. They were a series of EXCEPTION_ACCESS_VIOLATIONS codes with a range of addresses as well as a DEVICE_HUNG error code.

Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000003df5
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000024d3bb65944
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000000068
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000003df5
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0xffffffffffffffff
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x000001e4c2d25648

[File:C:\buildWork\9295552041ccd50b\UnrealEngine\Engine\Source\Runtime\D3D12RHI\Private\D3D12Util.cpp] [Line: 873] CurrentQueue.Fence.D3DFence->GetCompletedValue() failed at C:\buildWork\9295552041ccd50b\UnrealEngine\Engine\Source\Runtime\D3D12RHI\Private\D3D12Submission.cpp:939 with error DXGI_ERROR_DEVICE_REMOVED with Reason: DXGI_ERROR_DEVICE_HUNG

Deep Rock Galactic had a very similar crash log:

Fatal error: [File:Unknown] [Line: 684] pResource->Map(Subresource, pReadRange, reinterpret_cast<void**>(&pData)) failed at C:\BuildAgent\work\DRG-release\UnrealEngine\Engine\Source\Runtime\D3D12RHI\Private\D3D12RHIPrivate.h:1181 with error DXGI_ERROR_DEVICE_REMOVED with Reason: DXGI_ERROR_DEVICE_HUNG

Event Viewer - Applications pointed me at both my GPU (Nvidia) and CPU (AMD) drivers potentially being at fault. So I did clean installs of the latest drivers for both.
The description for Event ID 2 from source NVIDIA OpenGL Driver cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

DrvSetContext failed functionality indeterminant
(pid=22412 radeonsoftware.exe 64bit)

The system cannot find the file specified

The description for Event ID 2 from source AMD_ANR_BG_PROC cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

AMD_ANR_BG_PROC
Function: ret=0 size=6 called.

The message resource is present but the message was not found in the message table

Event Viewer - System supported my suspicion of a GPU culprit with some nvddmkm (nvidia) errors surrounding the crashes
The description for Event ID 0 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\00000103
Error occurred on GPUID: 100

The message resource is present but the message was not found in the message table

Reliability Monitor had a list of Hardware errors to match my crashes. E.g.
Description
A problem with your hardware caused Windows to stop working correctly.

Problem signature
Problem Event Name: LiveKernelEvent
Code: 141
Parameter 1: ffff918c0a7e6050
Parameter 2: fffff800c10c9080
Parameter 3: 0
Parameter 4: 810
OS version: 10_0_22621
Service Pack: 0_0
Product: 768_1
OS Version: 10.0.22621.2.0.0.768.101
Locale ID: 1033

XMP Solution
Eventually in my searches I came across a a tip that XMP (RAM Overclocking) might be the cause, it seemed like a stretch but I investigated it. I Went into UEFI and found that my RAM was set to not overclocking, so I turned it on. After that games were able to run without crashing. I feel like I'm getting lower FPS but I don't have hard data to compare to.
I didn't understand why this would solve the crashing, and I was hoping for an answer anyways, but then I had a
Second fail-to-boot 10/1/8/23 this morning. Started on the 2nd try, but now I am concerned that there is a deeper issue that I need to diagnose/explain and I would really like some help with.

What I have tried so far
  • Updated/Fresh Installed Nvidia drivers
  • Updated/Fresh Installed AMD drivers
  • Scanned drives for errors - clean
  • sfc /scannow - clean
  • Defragged all drives
  • Monitored CPU/GPU temps (both under 70 *C when gaming/crashing)
  • Virus Scan
  • Verified Game files
  • Reinstalled games to same/different drives (no change)
 
to have a look what the problem could be:
run userbenchmark.com and post the http link of your result, e.g. https://www.userbenchmark.com/UserRun/28977730

Reset the BIOS by jumper clrCMOS or JBAT or similar (eventually you will have to set the boot priority correctly after that)

check windows integrity
open the command prompt as administrator and type DISM /Online /Cleanup-Image /RestoreHealth
https://www.lifewire.com/how-to-open-an-elevated-command-prompt-2618088
https://answers.microsoft.com/en-us...em-files/bc609315-da1f-4775-812c-695b60477a93

clean boot

check the memory by running memtest.org usb autoinstaller (bootable USB flash drive)

check the hard drive for errors with its manufacturer´s tool and if available, update the firmware

use ddu uninstaller and reinstall the latest graphics driver