[SOLVED] How to diagnose failing hardware? WHEA uncorrectable error

Nov 20, 2021
5
0
10
Current pc specs:
Gigabyte 3080 ti

Ryzen 5800x
x5 Noctua NF-12 fans
NH-U12A cpu fan
EVGA 220-g5-0850-x1 850 watt power supply
X570-pro asus tuf mother board
x2 Crucial Ballistix 3200 ram BL2K16G32C16U4B
Samsung NVMe 970 evo plus 2tb
Corsair iCue 4000x mid tower


Original pc specs:
Gigabyte 3080 ti

Ryzen 5800x
x4 Corsair stock fans
Cooler master hyper 212 cpu cooler
Corsair RMx 1000w power supply
X550-plus asus tuf mother board
x2 Crucial Ballistix 3200 ram BL2K16G32C16U4B
Samsung NVMe 970 evo plus 2tb
NZXT H510 mid tower


The following are some of the different things I've tried.
I've swapped out my 3080 ti to my working 1060 6gb.
My cpu was reaching temperatures as high as low 90 degree c with original pc specs. My original cpu cooler was a cooler master hyper 212 but I was advised it might not be enough so I swapped out the case fans and cpu cooler with noctua fans.
I've swapped out the power supply from a corsair to an evga.
I've tested the pc with a single ram stick installed at a time by inserting it into the A2 slot. I also ran windows memory diagnostic tool and got no errors.
I've updated to latest BIOS on mother board(4021) https://www.asus.com/Motherboards-C...ming/TUF-GAMING-X570-PRO-WI-FI/HelpDesk_BIOS/
I've done a clean windows install post component changes(This is one of the last things I tried).
All the previous changes still resulted in a bsod when gaming.


So far the only parts I haven't swapped out of the system are the NVMe, cpu, and ram sticks. I get the whea uncorrectable error when gaming. I used core temp to log the cpu temp while gaming but the highest it reached was 68 degrees c before crashing.
Since first assembled the pc has been displaying whea uncorrectable errors when gaming. My first reassembly was done by a pc repair shop to swap out all components into a new mid tower with a the new 570 pro motherboard.
Second change was swapping out the cooler master 212 with a NH-u12 noctua cpu fan done by a different repair shop. He also advised me to swap out the stock corsair case fans with new ones since it might not be enough to properly cool all my components so I swapped out those stock fans with noctua case fans.

Any idea as to what other things I should try and look into? Getting pretty desperate.
 
Last edited:
Solution
If your CPU was running "hot", and we have no idea what your idea of "hot" entails (Which would be GOOD to know), then there is a problem with the installation of your CPU cooler because that cooler should not have any problems keeping that CPU within recommended specifications even without terrific case cooling. In fact, a single intake and a single exhaust, with that CPU cooler, on that CPU, should not overheat if you are not overclocking. It might not maintain high boost rates well without decent case cooling, but it should not overheat and certainly should not throw any errors from thermal conditions. I'd double check the backplate and mounting hardware, make certain that if there was protective plastic covering the bottom of the...
If your CPU was running "hot", and we have no idea what your idea of "hot" entails (Which would be GOOD to know), then there is a problem with the installation of your CPU cooler because that cooler should not have any problems keeping that CPU within recommended specifications even without terrific case cooling. In fact, a single intake and a single exhaust, with that CPU cooler, on that CPU, should not overheat if you are not overclocking. It might not maintain high boost rates well without decent case cooling, but it should not overheat and certainly should not throw any errors from thermal conditions. I'd double check the backplate and mounting hardware, make certain that if there was protective plastic covering the bottom of the heatsink that you took that off and that you PROPERLY applied the thermal interface material (thermal paste) in an amount and in a way that is in accordance with recommendations or at least that you didn't just put a tiny drop or a giant glob on, but instead, something in between like about 1/4 the size of the exposed part of a #2 pencil eraser, dead center on the top of the CPU heat spreader.

There are also other, more CPU model specific recommendations out there such as those available at gamersnexus for various Ryzen models which have at least minor differences in application processes compared to previous generations of both AMD and Intel processors.

Enough on the cooling for now. I think it likely that you simply have a backplate or cooler mounting hardware that is not correctly installed or is not fully tightened all the way around AND it's also possible that your problem COULD result from such a condition since we know that in some cases where CPU coolers are either unevenly tightened in one area or, for example, a pushpin or screw has popped free or worked itself loose, it can break contact between some contacts between the CPU and motherboard resulting in what seems like memory or other issues, and therefore, obviously, errors. Perhaps that's not the case here but I'd start by REALLY checking that, even to the point of perhaps reinstalling the cooler just to be sure.

Has this problem existed since the system was assembled or was it working fine and then at some point began to have problems?

What does "tried testing a single RAM stick at a time" and "tried to update BIOS" mean? You have either done those things, or you have not done them, there is no "tried". If you tested each DIMM individually, did you have them installed in the A2 slot which is the DIMM slot closest to the edge of the motherboard on this board? Did you have both sticks installed in the A2 and B2 slots, which are the second and fourth slots over from the CPU socket, with the fourth slot being the one closest to the edge of the motherboard, when you had both DIMMs installed and were having problems? If not, try that, because that is where they belong.

Did you ACTUALLY update the BIOS, or did you attempt to and were not able to accomplish it? If you did, what version are you currently on?

For the record, 68 degrees is not "running pretty hot", so if that was your peak temperature at any point, you can disregard the above cooling suggestions because it's likely you do not have a cooling problem in the first place.

Have you installed the latest chipset, networking and audio drivers from the X570-Pro product support pages on the ASUS website?

Have you tried a clean install of the graphics drivers using the DDU?

Are you using a Windows installation that was used previously with another system, or is this a CLEAN install that was performed AFTER the assembly of the system which was never in use on any prior system?
 
Solution
Nov 20, 2021
5
0
10
If your CPU was running "hot", and we have no idea what your idea of "hot" entails (Which would be GOOD to know), then there is a problem with the installation of your CPU cooler because that cooler should not have any problems keeping that CPU within recommended specifications even without terrific case cooling. In fact, a single intake and a single exhaust, with that CPU cooler, on that CPU, should not overheat if you are not overclocking. It might not maintain high boost rates well without decent case cooling, but it should not overheat and certainly should not throw any errors from thermal conditions. I'd double check the backplate and mounting hardware, make certain that if there was protective plastic covering the bottom of the heatsink that you took that off and that you PROPERLY applied the thermal interface material (thermal paste) in an amount and in a way that is in accordance with recommendations or at least that you didn't just put a tiny drop or a giant glob on, but instead, something in between like about 1/4 the size of the exposed part of a #2 pencil eraser, dead center on the top of the CPU heat spreader.

There are also other, more CPU model specific recommendations out there such as those available at gamersnexus for various Ryzen models which have at least minor differences in application processes compared to previous generations of both AMD and Intel processors.

Enough on the cooling for now. I think it likely that you simply have a backplate or cooler mounting hardware that is not correctly installed or is not fully tightened all the way around AND it's also possible that your problem COULD result from such a condition since we know that in some cases where CPU coolers are either unevenly tightened in one area or, for example, a pushpin or screw has popped free or worked itself loose, it can break contact between some contacts between the CPU and motherboard resulting in what seems like memory or other issues, and therefore, obviously, errors. Perhaps that's not the case here but I'd start by REALLY checking that, even to the point of perhaps reinstalling the cooler just to be sure.

Has this problem existed since the system was assembled or was it working fine and then at some point began to have problems?

What does "tried testing a single RAM stick at a time" and "tried to update BIOS" mean? You have either done those things, or you have not done them, there is no "tried". If you tested each DIMM individually, did you have them installed in the A2 slot which is the DIMM slot closest to the edge of the motherboard on this board? Did you have both sticks installed in the A2 and B2 slots, which are the second and fourth slots over from the CPU socket, with the fourth slot being the one closest to the edge of the motherboard, when you had both DIMMs installed and were having problems? If not, try that, because that is where they belong.

Did you ACTUALLY update the BIOS, or did you attempt to and were not able to accomplish it? If you did, what version are you currently on?

For the record, 68 degrees is not "running pretty hot", so if that was your peak temperature at any point, you can disregard the above cooling suggestions because it's likely you do not have a cooling problem in the first place.

Have you installed the latest chipset, networking and audio drivers from the X570-Pro product support pages on the ASUS website?

Have you tried a clean install of the graphics drivers using the DDU?

Are you using a Windows installation that was used previously with another system, or is this a CLEAN install that was performed AFTER the assembly of the system which was never in use on any prior system?

With the first reassembly a pc tech reinstalled the hyper 212 cpu cooler for me and he did let me know I used too much thermal paste the first time. He didn't mention anything else about improperly installing it besides that.

This problem has persisted since first building the pc.

I tested running the pc with a single ram stick in the A2 slot with both ram sticks separately. I currently have both ram sticks installed in the A2 and B2 slots.

I am not sure about having the latest chipset, networking, and audio drivers installed. I don't think I ever manually installed any drivers related to my mother board for those.

I've tried reinstalling my graphics drivers but not using the DDU. I will try that.

It was a clean windows install when first assembled. I then wiped and reinstalled windows a second time after making all the component changes and was one of the last things I tried.
 
So, I would download and install ALL of these. They are critical and they (Or newer version if one exists at that time the installation is done) usually should be installed almost immediately following the completion of any Windows 10 installation. It should be noted that if in the future you should install some other version than Windows 10, like Windows 11 or whatever it is, you may need to go to the support page for your motherboard and download different drivers than these which are meant for that different OS/version, instead.

It's equally possible that after looking you might find that the driver version for 10 and 11 (Or whatever version it happens to be) are the same driver in some cases.Also, these could at some point be updated with newer Windows 10 versions, or begin having separate versions later down the road as well, so it always bears checking the page for the latest versions and then downloading and installing them anytime you've just done a clean install of Windows.

Chipset drivers: https://dlcdnets.asus.com/pub/ASUS/...AMD_AM4_SZ-TSD_W11_64_V31022706_20211116R.zip

LAN drivers: https://dlcdnets.asus.com/pub/ASUS/mb/04LAN/DRV_LAN_Intel_I225_SZ-TSD_W10_64_V10214_20211019R.zip

Wireless network adapter: https://dlcdnets.asus.com/pub/ASUS/...Intel_All_SZ-TSD_W11_64_V228011_20211022R.zip

Audio chipset: https://dlcdnets.asus.com/pub/ASUS/...tek_Audio_Driver_V6.0.8971.1_WIN10_64-bit.zip

Bluetooth adapter: https://dlcdnets.asus.com/pub/ASUS/...el_non211_SZ-TSD_W11_64_V228004_20211022R.zip

All of which can be found here currently:

 
Nov 20, 2021
5
0
10
So, I would download and install ALL of these. They are critical and they (Or newer version if one exists at that time the installation is done) usually should be installed almost immediately following the completion of any Windows 10 installation. It should be noted that if in the future you should install some other version than Windows 10, like Windows 11 or whatever it is, you may need to go to the support page for your motherboard and download different drivers than these which are meant for that different OS/version, instead.

It's equally possible that after looking you might find that the driver version for 10 and 11 (Or whatever version it happens to be) are the same driver in some cases.Also, these could at some point be updated with newer Windows 10 versions, or begin having separate versions later down the road as well, so it always bears checking the page for the latest versions and then downloading and installing them anytime you've just done a clean install of Windows.

Chipset drivers: https://dlcdnets.asus.com/pub/ASUS/...AMD_AM4_SZ-TSD_W11_64_V31022706_20211116R.zip

LAN drivers: https://dlcdnets.asus.com/pub/ASUS/mb/04LAN/DRV_LAN_Intel_I225_SZ-TSD_W10_64_V10214_20211019R.zip

Wireless network adapter: https://dlcdnets.asus.com/pub/ASUS/...Intel_All_SZ-TSD_W11_64_V228011_20211022R.zip

Audio chipset: https://dlcdnets.asus.com/pub/ASUS/...tek_Audio_Driver_V6.0.8971.1_WIN10_64-bit.zip

Bluetooth adapter: https://dlcdnets.asus.com/pub/ASUS/...el_non211_SZ-TSD_W11_64_V228004_20211022R.zip

All of which can be found here currently:

I installed all the linked drivers. BSOD persists. I've been able to reliably crash when gaming. Only time pc doesn't crash is when I point a small fan at my tower with the side panel off. I switched my case fan config to have 2 case fans be exhausts on the top but bsod still happens.
I downloaded gpu z to log my cpu and gpu temperatures and pasted a crash log here: https://pastebin.com/pKb2yfjr .
 
EXACTLY where are case fans installed and in what direction is EACH of them blowing? Using a diagram (Even hand drawn is fine) to show air path is a good idea, but however you want to relate that information is ok. To be clear, just changing a couple of fans in the top to exhaust doesn't really solve anything. For one thing, if you have more than one fan in the top of the case, no matter what direction you have them blowing, you are not doing yourself any favors with an air cooler installed. Not as big of a deal with an AIO or custom loop, with with air cooling you really don't want any fans in the top of the case except for in the top-rear location, otherwise you are going to either create very unhelpful turbulence (IF you have middle or front fans blowing into the case as intake) or "steal" cool airflow coming from the front intake fans (If you have middle or front fans in the top of the case blowing out, as exhaust).

It's not an "idea". It's been shown, through various smoke and fog machine testing as well as simply being common sense. With that cooler installed, in that case, I'd highly recommend that any front fans be intake fans, and that you have both a rear and top-rear fan installed with both of them being exhaust (Blowing out). With that kind of configuration there really shouldn't be any thermal problems that are a result of the case fan configuration. I'll assume, hopefully, that your "five fans" are three in front as intake and two as exhaust, located in the rear and top-rear locations. If not, I'd make the change.

You've talked a lot about the CPU temps, and the fact that you've tried two different boards with the same problems says to me that it's not a VRM issue, which, when coupled with the fact that taking the side panel off and pointing a fan at it seems to alleviate the problem suggests this might be a thermal issue on the graphics card and you've made no mention about the monitoring of thermals on the dGPU. Since swapping for the GTX 1060 did no good, I assume you've since put the 3080 ti back in? If so, let's install HWinfo and run "Sensors only" (Uncheck the "Summary" option) and see what the thermal sensors on the graphics card are saying.
 
Nov 20, 2021
5
0
10
I posted the paste bin earlier showing what my temperatures were like before blue screening. My GPU and CPU are definitely not overheating and the temperatures look normal. I then looked at event viewer and noticed that one of the errors said:

A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Bus/Interconnect Error
Processor APIC ID: 14

The details view of this entry contains further information.

I always suspected it was the CPU but I hadn't been able to diagnose. Then I read this post by a different user: https://forums.tomshardware.com/threads/new-ryzen-5800x-build-bsod-whea_uncorrectable_error.3663507/


Where he said:
Then I noticed while watching ryzen master software that the clock speed and voltage fluctuate quite a bit, upwards of 1.45 volts and over 4.8ghz on some cores. This is with the default settings. My guess is that the bios is just not always stable at some clock speeds matched with certain voltages as they both fluctuate up and down. AMD might have been a little ambitious when binning my CPU.

So I decided to underclock my cpu speed to 3800 max using ryzen master and set the voltage to be 1.35 and I haven't crashed since! I don't understand much about what I did though and why changing these fixed my crashes!