Question Random BSODs related to memory, yet no errors reported in Memtest86+ ?

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
D

Deleted member 2969713

Guest
I recently upgraded my CPU and heatsink with some difficulty, and a number of times accidentally struck the motherboard with the screwdriver when it slipped. I'm worried that this is the cause of the crashes I'm experiencing, but I wonder if perhaps someone who knows more than me would know different. I also changed my NVMe SSD and reinstalled Windows at the same time as the CPU swap.

I've experienced two system crashes in the last couple of days. They're explained in more detail below.

After installing the CPU, I first manually set the memory in the BIOS to 3200, since I read that that's what the Ryzen 5600x, my new processor, officially supports. Then I experienced a BSOD, and the screen went by too fast for me to see any error code. But I learned you could view the error using Event Viewer. "The bugcheck was: 0x0000001a (0x0000000000041790, 0xffffe5000c52ed20, 0x0000000000000000, 0x0000000000000001)." According to Microsoft, it's a memory management error, indicating that a page table was corrupted.

After that happened, I went into the BIOS and disabled my manual clock setting and enabled the first XMP setting, which set the memory to 3600. Then I ran memtest86+ for 6 hours and eight passes, with no errors detected. I played a PC game for a while in the evening with no issues.

This morning I started up my PC, and when I turned on my monitor I was greeted with another BSOD (I had disabled auto-restart on crash). It was another memory-related crash. I went into Event Viewer expecting to find the same bugcheck, but this time it was 0x0000001a (0x0000000000041792, 0xffff97000005cc88, 0x0000000000040000, 0x0000000000000000). This indicates a corrupted PTE (page table entry?).

Given that I ran memtest86+ for a long time with a thorough 8 passes and encountered zero issues, I'm skeptical that my RAM sticks are the real source of the problem, especially since this issue only started after swapping my CPU and installing a new cooler. It might be worth mentioning that I initially used the stock cooler, and had removed and reapplied it several times without applying new thermal paste because I didn't have any on hand and I was having trouble installing the heatsink.

I didn't encounter any crashes with the stock cooler installed, but I didn't use my computer much either, because I ran stress tests and watched as the CPU temperature exceeded 90 degrees Celsius and knew I had to at least buy and reapply thermal paste. But I decided to actually buy an aftermarket cooler since I had read they could be much better than the stock one, and I specifically bought the Thermalright Assassin X120 Refined SE tower cooler.

I installed the new aftermarket cooler without any major difficulties, and I also removed my case's front acrylic panel to provide better airflow. I ran a stress test and was pleased to see that the CPU temps were now hovering around 60 degrees after minutes of full load instead of exceeding 90.

But now I'm getting BSODs.

Sorry for the long post, but I'm hoping someone might have insights I do not. I'm currently typing this post on my PC after rebooting after the second BSOD and it hasn't crashed yet...

I suppose I could try running the memory at the stock speeds and see if the BSODs keep happening, but I don't really want to do that since without tweaking the speed is at 2400 or something low like that, and I was previously running it at 2933 with my Ryzen 3200g.
 
Solution
It's unwise to run without a paging file - because you won't be able to write any dumps. Dumps are written initially to the paging file. Enter the command sysdm.cpl at the Run command prompt, click the Advanced tab, click the top Settings button (Performance), click the Advanced tab, click the Change button in Virtual Memory. In there ensure that the top checkbox (Automatically manage paging file size for all drives) IS checked. Windows will then size the paging file appropriately and place it on your fastest drive - which should be the system drive in any case.

That most recent dump (the 0x7A) is very useful because it indicates that the problem was in paging in a paged-out page. That means that the failure was either in RAM or in the...
Not sure quite how to do that, as all BSODs except one occurred without me doing anything except starting the PC and waiting a bit before sitting down to use it. The one that occurred while I was using the PC was the outlier with the different bugcheck info that hasn't occurred again since. Weird to say this, but hopefully I get a BSOD so the cause can be ascertained.
As I mentioned, Driver Verifier can only check drivers as they are loaded, so you need to ensure that every third-party driver gets loaded at some time. Use every device, every third-party app, ever game, etc. Driver Verifier will BSOD on its own if it detects a misbehaving driver, your role is to get all drivers loaded at some point so that Driver Verifier can test them.

The name Driver Verifier is a bit of a misnomer because it doesn't verify drivers at all really. It subjects drivers to a set of tests and checks (the ones you selected) and if any of those tests fail then Driver Verifier will immediately BSOD. We can use the resulting minidump to see what driver was running at the time of the BSOD, that will be the suspect one.

Driver Verifier was developed as a tool for driver developers to check that their newly coded drivers behave properly. Microsoft made it available in Windows so that it can be used as a driver troubleshooter - which is what we're doing.
 
Well, yesterday I tried starting up Visual Studio, browsing using Edge, playing a couple of games, and opening random apps, and I didn't get any BSODs. This morning I did my usual "turn on computer and do other stuff for a while" and there was a BSOD when I turned on my monitor. Unfortunately, BlueScreenView still blames the Windows kernel for it.

Dump from this morning's BSOD

And maybe you can sanity check that driver verifier is configured correctly. Here's the output from /query:

Code:
Time Stamp: 01/11/2024 08:04:30.317

Verifier Flags: 0x0012892b

  Standard Flags:

    [X] 0x00000001 Special pool.
    [X] 0x00000002 Force IRQL checking.
    [X] 0x00000008 Pool tracking.
    [ ] 0x00000010 I/O verification.
    [X] 0x00000020 Deadlock detection.
    [ ] 0x00000080 DMA checking.
    [X] 0x00000100 Security checks.
    [X] 0x00000800 Miscellaneous checks.
    [X] 0x00020000 DDI compliance checking.

  Additional Flags:

    [ ] 0x00000004 Randomized low resources simulation.
    [ ] 0x00000200 Force pending I/O requests.
    [ ] 0x00000400 IRP logging.
    [ ] 0x00002000 Invariant MDL checking for stack.
    [ ] 0x00004000 Invariant MDL checking for driver.
    [X] 0x00008000 Power framework delay fuzzing.
    [ ] 0x00010000 Port/miniport interface checking.
    [ ] 0x00040000 Systematic low resources simulation.
    [ ] 0x00080000 DDI compliance checking (additional).
    [ ] 0x00200000 NDIS/WIFI verification.
    [ ] 0x00800000 Kernel synchronization delay fuzzing.
    [ ] 0x01000000 VM switch verification.
    [ ] 0x02000000 Code integrity checks.

  Internal Flags:

    [X] 0x00100000 Extended Verifier flags (internal).

    [X] Indicates flag is enabled.

  Verifier Statistics Summary

    Raise IRQLs:                                     0
    Acquire Spin Locks:                        2584402
    Synchronize Executions:                         25
    Trims:                                       82571

    Pool Allocations Attempted:                2157532
    Pool Allocations Succeeded:                2157532
    Pool Allocations Succeeded SpecialPool:    2157532
    Pool Allocations With No Tag:                    0
    Pool Allocations Not Tracked:                23951
    Pool Allocations Failed:                         0
    Pool Allocations Failed Deliberately:            0

  Driver Verification List

    MODULE: fltmgr.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (      341 /       20 )
        Current Pool Bytes:        (   332774 /     4612 )
        Peak Pool Allocations:     (      562 /       26 )
        Peak Pool Bytes:           (   636174 /     9732 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: wdf01000.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (     4546 /      355 )
        Current Pool Bytes:        (  2692248 /    40448 )
        Peak Pool Allocations:     (     4549 /      357 )
        Peak Pool Bytes:           (  2693008 /    91312 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: storport.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (      235 /        0 )
        Current Pool Bytes:        (   783363 /        0 )
        Peak Pool Allocations:     (      241 /        5 )
        Peak Pool Bytes:           (   791875 /     1946 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: ndis.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (     2227 /      156 )
        Current Pool Bytes:        (  3683268 /    24266 )
        Peak Pool Allocations:     (     2282 /      158 )
        Peak Pool Bytes:           (  3734920 /    24598 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: amdpsp.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        3 /        0 )
        Current Pool Bytes:        (     1280 /        0 )
        Peak Pool Allocations:     (        3 /        0 )
        Peak Pool Bytes:           (     1280 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: rtcx21x64.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        0 /        0 )
        Current Pool Bytes:        (        0 /        0 )
        Peak Pool Allocations:     (        1 /        1 )
        Peak Pool Bytes:           (       16 /       70 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: nvlddmkm.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (    25633 /      577 )
        Current Pool Bytes:        ( 20734262 /  3084452 )
        Peak Pool Allocations:     (    25895 /      580 )
        Peak Pool Bytes:           ( 20806706 /  3543498 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: ucmcxucsinvppc.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        0 /        0 )
        Current Pool Bytes:        (        0 /        0 )
        Peak Pool Allocations:     (        0 /        0 )
        Peak Pool Bytes:           (        0 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: amdpcidev.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        0 /        0 )
        Current Pool Bytes:        (        0 /        0 )
        Peak Pool Allocations:     (        0 /        0 )
        Peak Pool Bytes:           (        0 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: amdgpio2.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        0 /        0 )
        Current Pool Bytes:        (        0 /        0 )
        Peak Pool Allocations:     (        0 /        0 )
        Peak Pool Bytes:           (        0 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: amdgpio3.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        0 /        0 )
        Current Pool Bytes:        (        0 /        0 )
        Peak Pool Allocations:     (        0 /        0 )
        Peak Pool Bytes:           (        0 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: nvhda64v.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (       33 /       12 )
        Current Pool Bytes:        (    28880 /     7536 )
        Peak Pool Allocations:     (       34 /       13 )
        Peak Pool Bytes:           (    29904 /     8064 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: mt7612us.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        7 /        0 )
        Current Pool Bytes:        (    11424 /        0 )
        Peak Pool Allocations:     (        7 /        0 )
        Peak Pool Bytes:           (    11424 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: dump_dumpstorport.sys (load: 0 / unload: 0)

    MODULE: dump_stornvme.sys (load: 2 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        0 /        0 )
        Current Pool Bytes:        (        0 /        0 )
        Peak Pool Allocations:     (        0 /        0 )
        Peak Pool Bytes:           (        0 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: dump_dumpfve.sys (load: 2 / unload: 1)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        1 /        0 )
        Current Pool Bytes:        (    16400 /        0 )
        Peak Pool Allocations:     (        1 /        0 )
        Peak Pool Bytes:           (    16400 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: amdryzenmasterdriver.sys (load: 1 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (       39 /        0 )
        Current Pool Bytes:        (  1777457 /        0 )
        Peak Pool Allocations:     (       40 /        0 )
        Peak Pool Bytes:           (  1813473 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0

    MODULE: ntiolib_x64.sys (load: 2 / unload: 0)

      Pool Allocation Statistics: ( NonPaged / Paged )

        Current Pool Allocations:  (        0 /        0 )
        Current Pool Bytes:        (        0 /        0 )
        Peak Pool Allocations:     (        0 /        0 )
        Peak Pool Bytes:           (        0 /        0 )
        Contiguous Memory Bytes:              0
        Peak Contiguous Memory Bytes:         0
 
Driver Verifier looks to be configured properly. The dump this morning was another PTE corruption, also flagged by the triage analysis as a RAM issue...
Code:
FAILURE_BUCKET_ID:  MEMORY_CORRUPTION_ONE_BIT
Note that it wasn't a BSOD generated by Driver Verifier, it was just another of the BSODs you've been having.
 
I'll keep driver verifier enabled for a while, but I'm not optimistic about its chances of revealing the issue at this point. And if it doesn't, where do I go from there?

What changed prior to the BSOD occurrences:

  • Updated MSI motherboard BIOS to latest version to support 5600X CPU.
  • CPU was upgraded from 3200G to 5600X.
  • New aftermarket CPU heatsink + fan was installed, and front case acrylic panel was removed.
  • Motherboard was struck a few times when my screwdriver slipped during the process of changing out the CPU and heatsinks.
  • NVMe SSD was replaced with a new higher capacity one.
  • Windows 11 was freshly installed on the new SSD.
  • RAM speed was increased via BIOS (first manually, then XMP).
I did also manually update my NVIDIA driver, but I don't remember if I did that before or after I started having BSODs.

Just in case, I ran chkdsk and it didn't discover any problems. But the MSI M371 is a relatively new product without any reviews by major tech publications, so I wonder.

The page table is stored on the system drive, right? So if the new SSD is the problem, then if I installed Windows 11 on my SATA SSD instead and used that as my boot drive, the problem would go away, right? That might be the next thing I try after giving driver verifier more time.

If that doesn't work, then I'll probably open up my system, remove the heatsink and CPU, and photograph the motherboard and check for any signs of damage, and inspect the socket for any obvious issues.

If that doesn't reveal anything, I'll likely just have to give up and live with the intermittent BSODs on startup. As long as they don't occur while I'm actually using the PC, it's not too bad.

If the CPU is the issue, I'm just going to have to live with it, since I can't afford to replace it. The heatsink and case modification are very unlikely to be the cause of the issue, unless somehow this is causing the RAM to overheat when it wasn't before.

Edit: I came across a related post somewhere where a person advised the victim to run "sft /scannow", and I figured it wouldn't hurt and ran it. It reported back that it had found and fixed corrupted system files. I also ran "DISM /Online /Cleanup-Image /Restorehealth" because that sounds good, right? 🙄 But it seems to have gotten stuck at 62.3% and I don't actually know what it's doing. I killed that process after reading about other users being stuck at that exact same percentage for hours.

I've also updated my motherboard chipset drivers to the latest.
 
Last edited by a moderator:
If Driver Verfier doesn't flag any drivers, and you're certain you have loaded every third-party driver at some point in the testing, then we will be back looking at RAM. The page table corruptions that you're seeing are going to be either a rogue third-party kernel mode driver (which Driver Verifier should find), or it's flaky RAM (that Memtest86 couldn't detect).

In the first instance I would remove and reseat each RAM stick. It's entirely possible that your working inside has moved a RAM stick a little. It happens.

Next, I would definitely run the RAM at stock frequency, any kind of overclocking can make marginal hardware fail. If it's stable at 2400MHz but not when overclocked then you need better RAM.

If running at stock frequencies doesn't help then I would next suggest removing one stack of RAM for a few days. I don't know how many RAM sticks you have, but running with each stick out for long enough to have usually had a BSOD will confirm 100% whether one stick is responsible.

Don't lose heart, we're not done yet. If none of the above works then we'll move on to a clean boot and isolate third-party services to see whether one of those is responsible.
 
If Driver Verfier doesn't flag any drivers, and you're certain you have loaded every third-party driver at some point in the testing, then we will be back looking at RAM.
Definitely can't say I'm certain of that yet, since I don't know when exactly a driver is triggered. I'll give it some more days of regular PC usage before I conclude that third-party drivers are most likely not at fault.

In the first instance I would remove and reseat each RAM stick. It's entirely possible that your working inside has moved a RAM stick a little. It happens.
That's a good idea. I do think I bumped the RAM a bit when I was working in there, so if driver verifier doesn't reveal anything but the blue screens continue, I'll open up my PC, reseat the RAM and the NVMe drive, and inspect the motherboard for damage while I'm at it. I'll leave removing the CPU cooler to inspect for damage underneath it as a last resort, since it's a pain to reapply and I have my CPU thermals where I like them and don't really want to inadvertently mess that up.

Next, I would definitely run the RAM at stock frequency, any kind of overclocking can make marginal hardware fail. If it's stable at 2400MHz but not when overclocked then you need better RAM.
Agreed, this is also worth doing. Though it would be odd if that were the problem since the RAM wasn't running at stock with the previous CPU and there were no issues.

If running at stock frequencies doesn't help then I would next suggest removing one stack of RAM for a few days. I don't know how many RAM sticks you have, but running with each stick out for long enough to have usually had a BSOD will confirm 100% whether one stick is responsible.
I have 2x8GB in there (motherboard only has two slots). I do also have 2x4GB RAM sticks lying around somewhere that I know worked fine. So if I could find those (it would take some digging) and swapped them out and still got the BSODs, that would prove pretty definitively that the current RAM sticks aren't at fault, I think. But I could also try running one stick at a time if it comes to that.

Don't lose heart, we're not done yet. If none of the above works then we'll move on to a clean boot and isolate third-party services to see whether one of those is responsible.
I appreciate your extensive help. It'll be slow going from here on out, since I seem to get a BSOD at a rate of less than one a day, so isolating the problem is going to take quite some time. I wish there were a reliable way for me to trigger the on-startup crash, but it seems to be random and very intermittent, and I have yet to actually witness it live, so I don't even know if Windows manages to get to the log-in screen first or not before crashing when it does happen.
 
I've suddenly started encountering a new, possibly unrelated problem where keyboard input suddenly stops working until the PC is restarted. It's happened twice now, and unplugging and plugging back in the keyboard did nothing. The second time I plugged in a different keyboard I had into a different USB port and the computer did not respond to key presses from that one either. Could this be caused somehow by having driver verifier enabled?
 
No, I very much doubt that the keyboard issue is related to Driver Verifier, but it is an interesting symptom.

Let's see how the earlier suggestions pan out first (reseating RAM, stock frequency, etc.). I would advise against adding the 4GB RAM sticks however, because that changes the hardware platform and may introduce new problems. I appreciate that running on only 8GB is going to introduce bottlenecks and performance issues, but you really don't want to be adding new hardware at this stage.

The key to using Driver Verfier effectively is to use every single app and every single device that you have. Drivers are loaded when the service or device they manage is running. Make a plan to run every app using as much of the app's functionality as you can. Access every device too, even printers and scanners. Using the PC as you normally do when you get BSODs is good, but you probably don't know (yet) exactly what sequence or combination of app and device usage is causing the BSODs. That's why it's better to create a testing plan to be sure you use ever app and device.

BTW. It's worth checking that the new CPU and the RAM you have are on the QVL for the motherboard. The QVL indicates which components have been tested and verified as working properly with the motherboard and it's generally best to stick with those.
 
The keyboard thing hasn't happened again, so hopefully it doesn't become a recurring issue, as that would be very annoying.

The key to using Driver Verfier effectively is to use every single app and every single device that you have. Drivers are loaded when the service or device they manage is running. Make a plan to run every app using as much of the app's functionality as you can. Access every device too, even printers and scanners. Using the PC as you normally do when you get BSODs is good, but you probably don't know (yet) exactly what sequence or combination of app and device usage is causing the BSODs. That's why it's better to create a testing plan to be sure you use ever app and device.
When I /query the verifier, and it shows a list of the drivers it's monitoring, it states the number of times the driver has been loaded and unloaded. If all of the drivers indicate that they've been loaded at least once, that would mean that if a driver was the problem, a BSOD would have occurred, right? Because when I query it now, it indicates that all drivers except one have been loaded at least once. The one that hasn't been loaded is dump_dumpstorport.sys, and I can't seem to find any information about that one online except that it might be for memory dumps when the system crashes.

BTW. It's worth checking that the new CPU and the RAM you have are on the QVL for the motherboard. The QVL indicates which components have been tested and verified as working properly with the motherboard and it's generally best to stick with those.
MSI's official website lists the 5600X as supported, as does this website, Pangoly. The memory isn't specifically mentioned that I could find on MSI's website, but the Pangoly one lists it as officially (QVL) compatible for Zen 2 Ryzen processors, but doesn't have any way to check memory compatibility for the 5600X's processor family.

Oh, and I got another startup BSOD this morning that I once again didn't witness. Same bugcheck as usual. Seems to happen at a rate of once every other day, which is going to make the problem isolation process very slow.
 
The dump-dumpstorport.sys driver you can ignore. Its a Microdoft driver involved in the dump writing process.

It may be useful for you to upload the full kernel dump from your most recent BSOD. Its the file C:\Windows\Memory.dmp and it will be large.

Try those earlier suggestions and I'll reach out to a colleague to see whether he has more info on these PTE corruptions you're getting.
 
Last edited:
I tried using all the devices hooked up to my computer, and opened up nearly every app installed on my PC at once (not including portable apps), including a bunch of system tools, for a total of nearly 60 apps. I didn't get any BSODs, though.

It may be useful for you to upload the full kernel dump from your most recent BSOD. Its the file C:\Windows\Memory.dmp and it will be large.
I looked and didn't find a DMP file at that location. I could configure the system to write a full kernel dump on crash; right now it's only set to write the small memory dump.

Try those earlier suggestions and I'll reach out to a colleague to see whether he has more info on these PTE corruptions you're getting.
Thanks. I'll reseat the RAM (and SSD) and wait a couple of days to see if I get another BSOD.

Edit: Well, prior to reseating the RAM, I tried rebooting my system four or five times to see if I could get it to BSOD, and it wouldn't, regardless of whether I pressed the reset button on my case or shut down properly, turned off and on the surge protector it's plugged in to, then booted it back up. Very annoying that I can't get it to reliably reproduce. It seems the only way to get it to happen is to turn off my PC at night, shut off the surge protector (which I do at night), then in the morning turn on the surge protector, turn on the PC, go do other things for a bit, and then turn on the monitor to be greeted with a BSOD, and even then it's only maybe.

After rebooting a few times with no luck as far as reproducing the issue, I opened up my PC and reseated the RAM and M.2 drive, and took a few photos of the motherboard while I had the PC open. There was one place with a scuff mark of some sort (above the MSI logo), but I'm not sure if it's just dirt or perhaps a scratch on the motherboard. There may be more scuffs or scratches obscured by the CPU cooler, but putting that on was a royal pain so I'm loathe to remove it if I can at all avoid it.

I guess I just play the waiting game now.
 
Last edited by a moderator:
  • Updated MSI motherboard BIOS to latest version to support 5600X CPU.
  • CPU was upgraded from 3200G to 5600X.
  • New aftermarket CPU heatsink + fan was installed, and front case acrylic panel was removed.
  • Motherboard was struck a few times when my screwdriver slipped during the process of changing out the CPU and heatsinks.
  • NVMe SSD was replaced with a new higher capacity one.
  • Windows 11 was freshly installed on the new SSD.
  • RAM speed was increased via BIOS (first manually, then XMP).
So many changes it could have been caused by any one of them.

Scuff mark unclear, could just be dirt. looks to extend to the MSI logo (I try not to take photos of my board, it always shows dust I can't see myself)
I would be tempted to install windows on old drive and see if you still get the errors. You correct the page file is on C so its always possible its a bad ssd.
 
I would be tempted to install windows on old drive and see if you still get the errors. You correct the page file is on C so its always possible its a bad ssd.
Yep, this is also something I want to do if other less interruptive troubleshooting efforts don't reveal the problem first. But I haven't had a blue screen yet since reseating the RAM and SSD. How maddening would it be if this whole problem was simply caused by accidentally nudging a RAM stick or not quite seating the NVMe drive right? At least it wouldn't require replacing any components, but still... I have a hard time imagining how a not properly seated component could 99.999% work instead of it being all or nothing. Maybe it was dust or another obstruction that I ended up removing when reseating the RAM?

Still, I need to give it a few more days before I conclude the BSOD occurrences have stopped. According to the schedule, I should get another one tomorrow morning.

The BSOD seems to only happen when I'm not looking, so if I just always watch my computer start up, the problem is solved! :tongueout: "Doctor, it hurts when I do this." "Then don't do that!"
 
The BSOD seems to only happen when I'm not looking, so if I just always watch my computer start up, the problem is solved!
Well, so much for that. I got a BSOD this morning, right on schedule, and this time I watched it happen. It occurred immediately after start-up completed and before the log-in screen could be shown.

Guess the next test is to stock-clock the RAM and wait.

Edit: Stock speed without XMP is only 2133. Oof.
 
Last edited by a moderator:
Well, my PC missed its "scheduled" BSOD this morning.

Let's suppose the BSODs don't come back with my RAM at 2133 MT/s. How badly is the performance of my 5600X hamstrung at that speed?

If the BSODs don't show up again, I'll try testing the limits by increasing the RAM. Or I could instead just turn XMP back on and be done with it. I think I can live with having a BSOD every other day on startup.
 
I have had some responses from other skilled BSOD analysts on the private forum we use. The only suggestion they have, other that what I've already suggested and which you've done, is to check the CPU. Prime95 would be the tool to use there, AMD don't have a dedicated processor diagnostic tool.

If you've never run Prime95 before then ask here before you do, there are some precautions you will want to take first. That the frequency of BSODs has changed does however suggest that the issue may well be RAM (assuming it's running at stock now?)

Keep us posted
 
Thanks. It is running at stock now (2133 MT/s).

I have not run Prime95 before, so it would be good to know what precautions to take.

No BSOD today either. Unfortunately, I'll be away from home for the next 5 days, so I won't be able to use my PC to test for blue screens during that time. But it is looking kind of like the RAM speed might have something to do with it. If so, it's really weird that the memory tests ran without encountering errors.

Once I get back, I'll keep using my PC like normal for a few more days, and if I don't encounter any BSODs, I can be pretty sure that running at stock speeds "fixed" the problem. I'll then try setting it at 2933 MT/s like before and see if it's stable at that speed.
 
Prime95 is a stress test for your CPU, and for RAM to some extent, so leave RAM at stock frequency when running Prime95.
  1. Download Prime95 from here.
  2. Run all three tests (small FFTs, large FFTs, and Blend) one at a time for at least an hour each test - longer if you can.
  3. This WILL make your CPU run hot, so also run a temperature monitor (like CoreTemp) to keep an eye on temps.
  4. If Prime95 generates errors, if the PC crashes or BSODs, or if the CPU temp approaches 95°C (Tmax for your 5600X is 95°C), then stop testing and let us know what happened.
  5. If your PC can sustain each test for at least an hour each test than your CPU is probably fine.
 
Prime95 is a stress test for your CPU, and for RAM to some extent, so leave RAM at stock frequency when running Prime95.
  1. Download Prime95 from here.
  2. Run all three tests (small FFTs, large FFTs, and Blend) one at a time for at least an hour each test - longer if you can.
  3. This WILL make your CPU run hot, so also run a temperature monitor (like CoreTemp) to keep an eye on temps.
  4. If Prime95 generates errors, if the PC crashes or BSODs, or if the CPU temp approaches 95°C (Tmax for your 5600X is 95°C), then stop testing and let us know what happened.
  5. If your PC can sustain each test for at least an hour each test than your CPU is probably fine.
Thanks, I'll try that out when I get back.
 
Prime95 is a stress test for your CPU, and for RAM to some extent, so leave RAM at stock frequency when running Prime95.
  1. Download Prime95 from here.
  2. Run all three tests (small FFTs, large FFTs, and Blend) one at a time for at least an hour each test - longer if you can.
  3. This WILL make your CPU run hot, so also run a temperature monitor (like CoreTemp) to keep an eye on temps.
  4. If Prime95 generates errors, if the PC crashes or BSODs, or if the CPU temp approaches 95°C (Tmax for your 5600X is 95°C), then stop testing and let us know what happened.
  5. If your PC can sustain each test for at least an hour each test than your CPU is probably fine.
Okay, I ran the Prime95 stress tests. I used CoreTemp to monitor temperatures. Results:

Small FFTs test: 280-288 tests in 3 hours, 57 minutes - 0 errors, 0 warnings - max temperature 65 C
Large FFTs test: 55-58 tests in 1 hour, 57 minutes - 0 errors, 0 warnings - max temperature 74 C
Blend test: 131-135 tests in 2 hours, 37 minutes - 0 errors, 0 warnings - max temperature 75 C

There were no BSODs during any of the tests. I suspect the higher temperatures on the Large FFTs and Blend tests were because, while CPU utilization wasn't nearly always 100% like in the small FFTs test, the clock speeds were much higher, boosting to 4.5 GHz and above compared to the pretty consistent ~3.9 GHz of the small FFTs test.

At this point it's looking pretty safe to say that the issue was simply my RAM being unstable above 2933 MT/s despite being rated for 3600 MT/s. I think I'll set the RAM speed to 2933 MT/s since it was stable at that when I was using my 3200G and then just assume the BSOD saga is over unless I encounter another one.

Side note: I encountered some claims on Hacker News that running Prime95 stress tests for prolonged periods of time can damage your CPU, though there was disagreement among the users there about that. Still, it's a bit concerning that I might have just reduced the lifespan or otherwise harmed my CPU. Is that claim baseless? I haven't manually changed voltages or overclocked the CPU, the only overclocking is done by the automatic speed boosting.
 
If you're happy running the RAM at 2933MHz, and it's stable there, then that's a good workaround. Always assuming that you can't RMA it of course?

IMO there is no concrete answer to the question whether Prime95 reduces the life of your CPU. Electronic components don't like heat and Prime95 does make your CPU run hot, so there is a good argument not to run Prime95 unless you really need to. In this case we needed to be sure that your CPU was good and Prime95 is the best way to do that - as long as you monitor CPU temps.

I would never advise anyone to run Prime95 unless they have a very good reason. Troubleshooting a potentially flaky CPU is a sufficiently good reason in my book, and any potential risk to the longevity of your CPU is worth it to either eliminate it or to prove it faulty.

There is a tendency these days, and particularly online, for opinions to become polarised one way or the other. In life however you mostly find that the truth is in the grey area in between. I suspect the truth about Prime95 is in that grey area, so use it when you need to, otherwise don't.
 
If you're happy running the RAM at 2933MHz, and it's stable there, then that's a good workaround. Always assuming that you can't RMA it of course?
Yeah, I'm okay with it. I wouldn't consider it worth buying a whole new set of RAM over or trying to RMA it; it's been a couple of years since it was bought, I wasn't the one who bought it (it was a gift), and I'm not sure I still have the packaging, and more to the point, I can't be bothered sending it in over this even if the option is available. If the RAM were defective even at stock speeds it'd be a different story.

I manually set the clock speed in the BIOS, but didn't mess with any timing values since I don't know what I'm doing there. Is that going to be a problem? They're set to auto. The odd thing is their auto values are lower than the XMP values. I wonder if the memory would work at higher speeds if the timings were set differently. However, I don't really want to play the guess and check game anymore, so if it's stable at 2933 I'll just leave it at that. Edit: I went back into the BIOS and checked again and the XMP profile timings were the same as the base timings. Weird. I could have sworn I saw lower base timings. Maybe it's because I saw the default timings when my RAM was still at the base 2133 speed, and they automatically changed when I upped it to 2933? Or maybe I simply misread the values.

IMO there is no concrete answer to the question whether Prime95 reduces the life of your CPU. Electronic components don't like heat and Prime95 does make your CPU run hot, so there is a good argument not to run Prime95 unless you really need to. In this case we needed to be sure that your CPU was good and Prime95 is the best way to do that - as long as you monitor CPU temps.

I would never advise anyone to run Prime95 unless they have a very good reason. Troubleshooting a potentially flaky CPU is a sufficiently good reason in my book, and any potential risk to the longevity of your CPU is worth it to either eliminate it or to prove it faulty.

There is a tendency these days, and particularly online, for opinions to become polarised one way or the other. In life however you mostly find that the truth is in the grey area in between. I suspect the truth about Prime95 is in that grey area, so use it when you need to, otherwise don't.
I probably shouldn't have run it, then. I was 99% sure the issue was RAM speed, but ran the test anyway just in case and out of curiosity about what temperatures the CPU would reach. However, since it only got up to 75 C, maybe the wear was small.
 
Last edited by a moderator:
Oh, I think running it once for a couple of hours is fine. Any 'wear' is insignificant over the life of the PC. The issue with Prime95 is running it often and/or when you don't need to. IMO it was necessary to run it in your case to eliminate the CPU. It's important to keep things in perspective.

FWIW I run Prime95 for an hour on every new build - to be sure it's stable and that the cooling is effective.
 
Oh, I think running it once for a couple of hours is fine.
Well, in my case it was closer to about eight and a half hours straight since I did the three tests more or less back to back. But from the sound of it I shouldn't be too worried anyway.

I did create another thread on this subject in a different sub-forum just to get more viewpoints on the subject.

Thanks for all your help, I'm glad I was able to finally pin down the issue.