Question Finding problem for WHEA Uncorrectable Error

blue4fun

Distinguished
Feb 2, 2012
2
0
18,510
I moved my setup to the other side of my room, and ever since I've been getting WHEA Uncorrectable errors. The first time happened when I loaded into a comp overwatch game with wuthering waves open + on a discord call. I could trigger crashes pretty consistently in this situation for a while. Not sure if it matters, but the games are on different drives.

I've updated all my drivers, ran windows memory diagnostic and scanned both nvme drives with Samsung magician, neither of which showed any problems. I also reseated the ram, which didn't help. Temps look fine too. I disconnected and reconnected the 24 pin power connector and 8 pin cpu power cable, and the problem seemed to subside, though that may have been a coincidence. I just got a new desk so I moved my PC around a bit, and the problem came right back. This time it happened while I was downloading a game on my main drive and playing wuthering waves while on a discord call. It happened again later during a comp overwatch match on call as well.

I have dump files for the most recent ones, as they weren't making dump files until I changed the paging file settings to 'system managed' on my second drive.

https://files.catbox.moe/wslmhx.dmp

https://files.catbox.moe/n9risn.dmp

https://files.catbox.moe/4swwqr.dmp

https://files.catbox.moe/vq5r9p.dmp

Here is my sysinfo file.

The above was a reddit post from about 2 weeks ago that no one responded to :( The crashes persist.

Since then, I have ran all 3 prime95 tests for an hour or longer with no issue. I also ran memtest86 for at least one pass on all four sticks of ram individually, as well as separate tests for each slot. All tests showed no errors. I ran Crystal Disk Mark on both drives multiple times with no issues or crashes.

Other info:
3 volmgr errors are listed in event viewer before each crash: 161, 45 and 46. These all occur on my C: drive. If I set my D: drive to have a paging file, I get minidumps, except for the past couple crashes.

Not sure what else to do to figure out which part is the problem. Since the dumps reference a driver error, should I try a windows repair/reinstall?
 
Last edited:

ubuysa

Distinguished
From the dumps, which are all the same, the issue does appear to be with one of those NVMe drives, though I'm not able to tell which one. Below is a stack trace from one of the dumps, you read these from the bottom up (it's a push-down stack)...
Code:
11: kd> k
 # Child-SP          RetAddr               Call Site
00 ffffd787`eeb70458 fffff804`119bc67f     nt!KeBugCheckEx
01 ffffd787`eeb70460 fffff804`119bd0e9     nt!WheaReportHwError+0x4cf
02 ffffd787`eeb70530 fffff804`119bd205     nt!WheaHwErrorReportSubmitDeviceDriver+0xe9
03 ffffd787`eeb70560 fffff804`14443891     nt!WheaReportFatalHwErrorDeviceDriverEx+0xf5
04 ffffd787`eeb705c0 fffff804`1443cc70     storport!StorpWheaReportError+0x9d
05 ffffd787`eeb70650 fffff804`1440f0cc     storport!StorpMarkDeviceFailed+0x358
06 ffffd787`eeb708e0 fffff804`144cb57d     storport!StorPortNotification+0x91c
07 ffffd787`eeb709b0 fffff804`144ce78e     stornvme!ControllerReset+0x1a1
08 ffffd787`eeb70a30 fffff804`144cd6ef     stornvme!NVMeControllerReset+0x10a
09 ffffd787`eeb70a60 fffff804`1443a245     stornvme!NVMeControllerAsyncResetWorker+0x3f  <=====THIS IS WHERE THE PROBLEM STARTS
0a ffffd787`eeb70a90 fffff804`116ef275     storport!StorPortWorkItemRoutine+0x45
0b ffffd787`eeb70ac0 fffff804`116c90c5     nt!IopProcessWorkItem+0x135
0c ffffd787`eeb70b30 fffff804`11748da5     nt!ExpWorkerThread+0x105
0d ffffd787`eeb70bd0 fffff804`11806b58     nt!PspSystemThreadStartup+0x55
0e ffffd787`eeb70c20 00000000`00000000     nt!KiStartSystemThread+0x28
You see the system thread start, followed by the I/O processor starting a work item. Since the I/O is for a storage device the Windows storport.sys storage driver starts a work item for an I/O to a storage device. Since the device is an NVMe drive the Windows stornvme.sys driver then becomes involved in order to manage the access to an NVMe drive. You can see that immediately we access the drive controller there is a problem, as seen by the call to the stornvme!NVMeControllerAsyncResetWorker function to reset (and restart) the work item (started by storport.sys). This is followed by a reset of the NVMe controller, which fails. Then we start the notification chain reporting on the NVMe controller failure which ultimately results in a WHEA BSOD.

There is clearly something amiss with one of your 980 Pro drives, but sadly I'm not able to identify which one was involved in this failure. From what you describe in your OP I think the C: drive is most likely the one with the problem. Can you download Samsung Magician and use that to run a full diagnostic on the C: drive (both drives actually). Also use it to check for firmware updates to either drive.

Also, in my exprience M.2 ports are less than perfectly reliable. You might try removing the C: drive and reinserting it fully. You may need to replace any heatsink pad that was on the drive