Question Recurring BSOD: WHEA_Uncorrectable_Error ?

Jun 19, 2022
1
0
10
Hello everyone, I’ve had a continuing WHEA_Uncorrectable_Error with my system that I was hoping to get some advice on.

Specs:
Ryzen 5900x (no overclocking/BIOS settings changes beyond XMP)
MSI Ventus 3X 3060TI
MSI X570 Gaming Edge Wifi
Corsair CX750 PSU
Corsair Vengeance Pro 3Ghz 4x8 RAM
2 Samsung 980 Pro’s (one boot drive)
2 additional SSD’s
2 additional HDD’s

The problem started about 3-4 months ago. I was in the middle of my busy season and had been able to get a hold of a 3060TI while the market was still hot. I didn’t get to use it for leisure for a few weeks and was just working mostly using programs through chrome. I started playing some Crusader Kings 3 and Civilization 6, having a grand ol’ time until the crashes came. At first, all went well, games ran great and didn’t have any issues and then I started getting random BSOD’s. At first, I didn’t think anything of it, and then they started increasing in frequency. They were relatively random, ranging from a few hours between to a few minutes. At one point I could replicate it by running CK3 and moving/copying a large file.

My first troubleshooting was the basic, update windows, update drivers, update BIOS, ran checkdisk, windows memory diagnostic, memtest, etc.. I tried looking for dump files, but no matter the windows settings, they wouldn’t generate. At first the issue went away and then would reemerge a day or two later. I tried reseating the GPU and RAM, no luck, then the CPU, no difference. Because of the proximity to upgrading my GPU, I assumed it had to be related to an error with the GPU and so RMA’ed it. I still had my 1070TI so I swapped it in. And at first, peace returned to the land. Was able to emulate what had been consistently causing a crash without issue, and life was good, until about a week or two later.

The error returned along with sadness, and more troubleshooting led to no success. I reinstalled windows and that helped for a moment, but again it returned. After a recommendation in a tech support group, additional error codes indicated windows believed the error was tied to the NVMe slot where my 980 pro was. And so after trying a few firmware and software updates to no avail, I RMA’ed the 980, temporarily switched to another SSD, ordered a new 980 in the meantime, and life proceeded apace. For about a month to two months it was gone, with heavy use no problems, until early this past week when it reappeared while playing Cyberpunk 2077.

One of the strange parts of the issue is putting it under load using typical tools doesn’t cause it to appear. Running furmark and cynebench or other CPU stressors at the same time caused no issues, even after a couple hours. But going into game? No such luck.

My main concern is that I’m going to RMA another component only to have the problem re-emerge, or take it to a repair shop for them to “fix” the issue only to have it reappear in another month or two (through no fault of theirs).

My main question is if there is a component that’s more likely to be the culprit or additional steps I can take? Alternatively is this an error that could be house power related? My rig is on a busy strip plug on a relatively busy circuit, but idk if the symptoms reflect power issues.

Other places have indicated that it may be a loss on the silicon lottery with my 5900x and have submitted an RMA request, but want to exhaust other possibilities

Sorry for the long post, but I hope it provide enough information.

Thank you for your help and time and I am happy to provide any additional information.
 
Hello everyone, I’ve had a continuing WHEA_Uncorrectable_Error with my system that I was hoping to get some advice on.

Specs:
Ryzen 5900x (no overclocking/BIOS settings changes beyond XMP)
MSI Ventus 3X 3060TI
MSI X570 Gaming Edge Wifi
Corsair CX750 PSU
Corsair Vengeance Pro 3Ghz 4x8 RAM
2 Samsung 980 Pro’s (one boot drive)
2 additional SSD’s
2 additional HDD’s

The problem started about 3-4 months ago. I was in the middle of my busy season and had been able to get a hold of a 3060TI while the market was still hot. I didn’t get to use it for leisure for a few weeks and was just working mostly using programs through chrome. I started playing some Crusader Kings 3 and Civilization 6, having a grand ol’ time until the crashes came. At first, all went well, games ran great and didn’t have any issues and then I started getting random BSOD’s. At first, I didn’t think anything of it, and then they started increasing in frequency. They were relatively random, ranging from a few hours between to a few minutes. At one point I could replicate it by running CK3 and moving/copying a large file.

My first troubleshooting was the basic, update windows, update drivers, update BIOS, ran checkdisk, windows memory diagnostic, memtest, etc.. I tried looking for dump files, but no matter the windows settings, they wouldn’t generate. At first the issue went away and then would reemerge a day or two later. I tried reseating the GPU and RAM, no luck, then the CPU, no difference. Because of the proximity to upgrading my GPU, I assumed it had to be related to an error with the GPU and so RMA’ed it. I still had my 1070TI so I swapped it in. And at first, peace returned to the land. Was able to emulate what had been consistently causing a crash without issue, and life was good, until about a week or two later.

The error returned along with sadness, and more troubleshooting led to no success. I reinstalled windows and that helped for a moment, but again it returned. After a recommendation in a tech support group, additional error codes indicated windows believed the error was tied to the NVMe slot where my 980 pro was. And so after trying a few firmware and software updates to no avail, I RMA’ed the 980, temporarily switched to another SSD, ordered a new 980 in the meantime, and life proceeded apace. For about a month to two months it was gone, with heavy use no problems, until early this past week when it reappeared while playing Cyberpunk 2077.

One of the strange parts of the issue is putting it under load using typical tools doesn’t cause it to appear. Running furmark and cynebench or other CPU stressors at the same time caused no issues, even after a couple hours. But going into game? No such luck.

My main concern is that I’m going to RMA another component only to have the problem re-emerge, or take it to a repair shop for them to “fix” the issue only to have it reappear in another month or two (through no fault of theirs).

My main question is if there is a component that’s more likely to be the culprit or additional steps I can take? Alternatively is this an error that could be house power related? My rig is on a busy strip plug on a relatively busy circuit, but idk if the symptoms reflect power issues.

Other places have indicated that it may be a loss on the silicon lottery with my 5900x and have submitted an RMA request, but want to exhaust other possibilities

Sorry for the long post, but I hope it provide enough information.

Thank you for your help and time and I am happy to provide any additional information.

AMD is not publishing their cpu current errata lists (4 years old now), if you got the cpu on the grey market I would just do a exchange if you can get it. I have debugged many AMD problems down to where the only remaining suspect was the cpu. All of them were x versions shipped out of china that had problems. Patches were not available outside of that market. I only found info about the bugs from trying to translate websites that were written in Chinese. The patches were not on other websites since the cpu was not shipped out of China.
(except via grey market vendors)
 

BGillen

Honorable
Jun 17, 2016
8
0
10,510
What method did you use to get dump files?
I'm having a very similar issue. Built a new system, no problems. Upgraded the GPU, now it crashes in games after a random amount of time but can run furmark overnight no problems. I RMAed the GPU and the problem still persists.