Question Random BSODs related to memory, yet no errors reported in Memtest86+ ?

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
D

Deleted member 2969713

Guest
I recently upgraded my CPU and heatsink with some difficulty, and a number of times accidentally struck the motherboard with the screwdriver when it slipped. I'm worried that this is the cause of the crashes I'm experiencing, but I wonder if perhaps someone who knows more than me would know different. I also changed my NVMe SSD and reinstalled Windows at the same time as the CPU swap.

I've experienced two system crashes in the last couple of days. They're explained in more detail below.

After installing the CPU, I first manually set the memory in the BIOS to 3200, since I read that that's what the Ryzen 5600x, my new processor, officially supports. Then I experienced a BSOD, and the screen went by too fast for me to see any error code. But I learned you could view the error using Event Viewer. "The bugcheck was: 0x0000001a (0x0000000000041790, 0xffffe5000c52ed20, 0x0000000000000000, 0x0000000000000001)." According to Microsoft, it's a memory management error, indicating that a page table was corrupted.

After that happened, I went into the BIOS and disabled my manual clock setting and enabled the first XMP setting, which set the memory to 3600. Then I ran memtest86+ for 6 hours and eight passes, with no errors detected. I played a PC game for a while in the evening with no issues.

This morning I started up my PC, and when I turned on my monitor I was greeted with another BSOD (I had disabled auto-restart on crash). It was another memory-related crash. I went into Event Viewer expecting to find the same bugcheck, but this time it was 0x0000001a (0x0000000000041792, 0xffff97000005cc88, 0x0000000000040000, 0x0000000000000000). This indicates a corrupted PTE (page table entry?).

Given that I ran memtest86+ for a long time with a thorough 8 passes and encountered zero issues, I'm skeptical that my RAM sticks are the real source of the problem, especially since this issue only started after swapping my CPU and installing a new cooler. It might be worth mentioning that I initially used the stock cooler, and had removed and reapplied it several times without applying new thermal paste because I didn't have any on hand and I was having trouble installing the heatsink.

I didn't encounter any crashes with the stock cooler installed, but I didn't use my computer much either, because I ran stress tests and watched as the CPU temperature exceeded 90 degrees Celsius and knew I had to at least buy and reapply thermal paste. But I decided to actually buy an aftermarket cooler since I had read they could be much better than the stock one, and I specifically bought the Thermalright Assassin X120 Refined SE tower cooler.

I installed the new aftermarket cooler without any major difficulties, and I also removed my case's front acrylic panel to provide better airflow. I ran a stress test and was pleased to see that the CPU temps were now hovering around 60 degrees after minutes of full load instead of exceeding 90.

But now I'm getting BSODs.

Sorry for the long post, but I'm hoping someone might have insights I do not. I'm currently typing this post on my PC after rebooting after the second BSOD and it hasn't crashed yet...

I suppose I could try running the memory at the stock speeds and see if the BSODs keep happening, but I don't really want to do that since without tweaking the speed is at 2400 or something low like that, and I was previously running it at 2933 with my Ryzen 3200g.
 
Solution
It's unwise to run without a paging file - because you won't be able to write any dumps. Dumps are written initially to the paging file. Enter the command sysdm.cpl at the Run command prompt, click the Advanced tab, click the top Settings button (Performance), click the Advanced tab, click the Change button in Virtual Memory. In there ensure that the top checkbox (Automatically manage paging file size for all drives) IS checked. Windows will then size the paging file appropriately and place it on your fastest drive - which should be the system drive in any case.

That most recent dump (the 0x7A) is very useful because it indicates that the problem was in paging in a paged-out page. That means that the failure was either in RAM or in the...
Well, I just got another BSOD, the first since January 15th. Same old type. So apparently my system isn't 100% stable even at 2933. But if it's happening at a rate of around once every two weeks, I can live with that.
 
Well, so much for "once every two weeks." I just got another BSOD this morning. Guess I'll just up the RAM speed again.

I think you're looking at a hardware issue for sure. If it's not RAM then it's likely to be the CPU or motherboard. How hard did you strike the motherboard with the screwdriver?
Hard to say. The screwdriver slipped when I was attempting to unscrew stubborn and very awkwardly placed screws on the AMD cooler fan shroud (which I didn't need to be unscrewing anyway, but I didn't realize it at first). I believe it slipped a few other times as well. I don't know exactly how hard it hit the motherboard, but it wasn't enough to leave deep gouges or obvious-at-a-glance scratches. I may have also messed something up when I was idiotically trying to screw in the stock cooler with no bracket in place and ended up bending the motherboard a little with the amount of force I applied.

As annoying as it is that I ultimately wasn't able to fix the problem, after a month of trying I think I'm done. Certainly the issue doesn't bother me enough to spend the money to buy a new motherboard and/or RAM over, especially if there's a chance the CPU is the cause. If I were getting BSODs during use instead of just at startup, though, I'd be looking into replacing the motherboard and RAM.

Next time I'm in the market for a new computer I'm thinking I'll just buy a pre-built or a laptop. :unamused:
 
The potential motherboard bending is very concerning. Also, BSODs on startup might suggest a CPU issue because the CPU is very busy at startup.

I would only ever advise a laptop if portability is a key factor. There are so many compromises in a laptop that I would avoid one unless there was no alternative. I don't know where you are, but you might look for a system integrator company. These companies build the PC to your spec and provide a warranty too. I know of a good one in the UK, but that's no good if you're in the USA.
 
I don't know where you are, but you might look for a system integrator company. These companies build the PC to your spec and provide a warranty too. I know of a good one in the UK, but that's no good if you're in the USA.
Yeah, I'm in the USA. But I probably won't be in the market for a new PC for many years, unless my current one starts to have more severe issues.
 
New info to report. After setting memory speed up to 3200 MT/s, I started getting BSODs about once a day, which was not unexpected. However, I recently tried enabling fast startup to see what effect, if any, it would have.

Today I got the usual BSOD on startup, except this time it was different.

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED
0x1000007e
ffffffff`c0000005
fffff800`3168f1cd
fffffb83`11746d58
ffffd780`1b7f2900

Here's the mini dump.

I think I'm going to try installing Windows on my SATA drive instead and see if that makes any difference.
 
That last idea might be wise. In this recent dump there are no third-party drivers named, although both Wdf01000.sys (the Windows Driver Foundation driver) and fltmgr.sys (the Windows filter driver manager) are called. That might indicate that there are some third-party drivers, either using WDF functions or installed as filter drivers. We've already all but eliminated third-party drivers with Driver Verifier however, so we can probably discount this.

The operation in progress during this BSOD was an NVMe drive access. We see the Windows drivers stornvme.sys, storport.sys and ntfs.sys called often. The actual problem occurs in a Windows function (nt!KxWaitForLockOwnerShip) and was an attempt to reference a non-canonical memory address (ie. an address not in the valid x64 address space)...
Code:
CONTEXT:  ffffd7801b7f2900 -- (.cxr 0xffffd7801b7f2900)
rax=ffffe78fe71c0f80 rbx=fffffb8311747018 rcx=fffffb8311747010
rdx=7470504102080000 rsi=0000000000000001 rdi=0000000000000000
rip=fffff8003168f1cd rsp=fffffb8311746f90 rbp=fffffb8311746fe0
 r8=ffffe78fe71c0f80  r9=0000000000000001 r10=fffff80031665c10
r11=ffffe78fe71c0f20 r12=fffff8003083d060 r13=ffffffffffffffff
r14=ffffc20b7da21100 r15=ffffe781017ec040
iopl=0         nv up ei pl zr na po nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00050246
nt!KxWaitForLockOwnerShip+0x3d:
fffff800`3168f1cd 48890a          mov     qword ptr [rdx],rcx ds:002b:74705041`02080000=????????????????
Resetting default scope
The address 0x74705041`02080000 is non-canonical; valid memory addresses must start either with 0x0000 (for user-mode code) or with 0xffff (for kernel mode code). The 0x7470 which this address starts with is invalid.

This address comes from the RDX register and if it's not a third-party driver that placed that invalid pointer in there (and Driver Verifier suggests not) then this may be a hardware bit-flip problem. If we look at the binary value of that address you can see what I mean...
Code:
10: kd> .formats 74705041`02080000
Evaluate expression:
  Binary:  01110100 01110000 01010000 01000001 00000010 00001000 00000000 00000000
You can see that for this to be a bit-flip we need a lot of bits to be set wrongly and I'm not convinced that that's what this is. I wonder whether the operation in progress being an NVMe drive access means that the NVMe drive (or its controller) is responsible for that address somehow?

I think installing Windows on to a known good SATA drive is an excellent idea. I'd want to be sure the drive was good first though. Also, only install Windows, drivers, and updates. Then test it thoroughly before installing anything else. I've seen many people who reinstall everything and end up reinstalling the problem. Make sure the SATA drive is the only drive in the system too....
 
You can see that for this to be a bit-flip we need a lot of bits to be set wrongly and I'm not convinced that that's what this is. I wonder whether the operation in progress being an NVMe drive access means that the NVMe drive (or its controller) is responsible for that address somehow?
Agreed. I'm thinking it has something to do with the NVMe. Whether that's due to the new NVMe drive being faulty or due to me having messed something up with the motherboard, I'm not sure.

I think installing Windows on to a known good SATA drive is an excellent idea. I'd want to be sure the drive was good first though. Also, only install Windows, drivers, and updates. Then test it thoroughly before installing anything else. I've seen many people who reinstall everything and end up reinstalling the problem. Make sure the SATA drive is the only drive in the system too....
Unfortunately I already completed the installation of Windows on the SATA drive and had installed a number of apps (Visual Studio, 7-Zip, VLC, StartAllBack) before I saw your post. I could refrain from installing further programs, but this morning I didn't get a BSOD on startup (with memory at 3200 MT/s) when I almost certainly would have prior to changing the drive Windows was installed on, so I think that's a good sign. And I do still have the NVMe drive in there. I tried installing Windows on the SATA drive with the NVMe drive still connected, but that resulted in the boot manager remaining on the NVMe drive, so I had to take it out, wipe the SATA drive, and install Windows on it again. Then I put the NVMe drive back in, transferred all my important files from it to the SATA drive (accidentally deleting all of my portable apps in the process <_<), and then used diskpart to completely clean the NVMe drive of all partitions. It now consists of one basic data partition covering the entire drive.
 
Aaaaaaaand I just got a MEMORY_MANAGEMENT blue screen on startup this morning. I also apparently got a BSOD a week ago that was "The disk subsystem returned corrupt data while reading from the hibernation file."

I give up. I don't want to keep trying to solve the issue anymore. I do appreciate all the help, though.
 
Maybe. But I honestly think I might have messed up the motherboard. After installing Windows on my SATA drive, things were fine until I turned off fast startup, at which point I started getting MEMORY_MANAGEMENT BSODs again. So I turned fast startup back on, and this morning I got a new one. Bug Check 0x139: KERNEL_SECURITY_CHECK_FAILURE, parameters: (0x0000000000000003, 0xffff940035f4f150, 0xffff940035f4f0a8, 0x0000000000000000).

Despite the new BSOD type, I still don't want to try putting more time into isolating the cause. If I start getting BSODs when I'm in the middle of doing something on my computer instead of just at start-up, then I'll start trying to isolate the cause again.

As it stands now, it still could be caused by any one (or any combination) of these things:

  • motherboard damage
  • faulty CPU
  • faulty memory sticks (unlikely given memory test results)
  • faulty PSU
  • faulty drive(s)
 
When you installed Windows how did you do that?
First I tried installing Windows on my SATA drive when both drives were still in there, and the NVMe drive still had its own installation of Windows. That resulted in the boot files remaining on the NVMe drive, so I removed that from the system and reinstalled Windows on the SATA drive. I then plugged the NVMe drive back in, copied my files that were on it, and wiped the drive of all files and partitions using diskpart.

To install Windows 11, I created a bootable USB using the installation media creation tool on Microsoft's website.

And did you test it fully before installing anything other than Windows and drivers?
I did not. As mentioned above, I had already installed some programs before you had posted the next day recommending that I don't.
 
When you reinstalled Windows on the SATA drive did you delete all partitions on there (via a custom, install)?

I would strongly suggest that you test the system with ONLY Windows/drivers/updates installed. I've seen many systems where the user kept reinstalling the problem......
 
When you reinstalled Windows on the SATA drive did you delete all partitions on there (via a custom, install)?
I remember using the installer's tools to wipe the drive of all partitions before initiating the install. However, I don't remember whether I did that on the first install, on the second install, or both.

I would strongly suggest that you test the system with ONLY Windows/drivers/updates installed. I've seen many systems where the user kept reinstalling the problem......
At this point I think doing so would be more trouble than it's worth. I appreciate the advice, though, and like I said, if I start getting crashes in the middle of using my PC I'll go back into troubleshooting mode.
 
Fair enough, I know what it's like to troubleshoot a system that seems unfixable. Let us know if you need any further help (not that I've helped much so far!)...
Thanks! You've been very helpful. I just happen to have a system with a particularly stubborn and hard to pin down issue.
 
Bad news. I just got a BSOD while I was using my computer (browsing on Edge), not just on startup this time. KERNEL_DATA_INPAGE_ERROR (0x0000000000000001, 0xffffffffc0000005, 0xffff918fad96c080, 0xffffd9ecdff58000)

Great.

Minidump file.

Also, a zip of the last four minidumps as well, for what that's worth.

Once again it has something to do with the paging file. At least there's consistency, but if the drive were the issue, wouldn't installing Windows to a different drive have solved it? Windows wouldn't put the page file on the NVMe drive when Windows isn't installed on it, right? And if the memory sticks were the issue, surely multiple memory tests would have turned something up, or I would have been experiencing the issue prior to the "upgrade" at the end of last year?

Should I pop my 3200G in there to rule out the 5600X as the cause? Should I just scrap everything except my GPU and start all over? I don't have the money for that, unfortunately. I guess I could use my laptop that has Linux, but I'd sorely miss Visual Studio (not to mention my Steam games). Also, the laptop is running on an i5-4200M.

I'm starting to think I really did mess up the motherboard. But I'd hate to drop money on a new one only to discover that that's not the problem. I doubt it's the PSU, since the BSODs pretty much always have to do with memory or page files, and if the PSU were failing I should be getting all sorts of seemingly unrelated errors, to my knowledge.

What if I just... disabled the page file? I usually never get close to using up all my memory anyway. I understand that it's usually not a good idea, but in my case maybe it would help.

I'm going to run it with the page file disabled for a while and see how it goes. I know the risks, but if this stops the BSODs, programs potentially crashing due to there not being enough memory is better than the operating system crashing because of page file issues.

But even if that does work, that still means there's some fundamental flaw with my system, whether the CPU, motherboard, memory, or drives.

To isolate the memory as a cause, I'd have to dig out my old 4GBx2 kit from an overstuffed closet and use it for a while. Annoying, but the least interruptive check.

To isolate the CPU as the cause, I'd have to put my old 3200G in, meaning a reapplication of thermal paste and reinstalling of a heatsink, and use it for a while. Very annoying.

To isolate the drives as a cause, I'd have to take out all the drives and use my older, low capacity NVMe drive instead, which was nearly full before it was replaced with the 1 TB NVMe drive. And I'd probably have to reinstall Windows. Again. Very, very annoying.

To isolate the motherboard as a cause, I'd have to either buy a new motherboard and reinstall everything, or do all the preceding steps first. Most annoying.

And all this assumes that it's just one component that is the issue and not multiple, or a particular combination of them.

With any luck, I'll have isolated and solved the problem by the end of the year. 🙄

Could updating the BIOS potentially have screwed something up? If I remember right, I selected the latest BIOS version and not the earliest one that supported the 5600X. Maybe I should try switching to an earlier build?
 
Last edited by a moderator:
It's unwise to run without a paging file - because you won't be able to write any dumps. Dumps are written initially to the paging file. Enter the command sysdm.cpl at the Run command prompt, click the Advanced tab, click the top Settings button (Performance), click the Advanced tab, click the Change button in Virtual Memory. In there ensure that the top checkbox (Automatically manage paging file size for all drives) IS checked. Windows will then size the paging file appropriately and place it on your fastest drive - which should be the system drive in any case.

That most recent dump (the 0x7A) is very useful because it indicates that the problem was in paging in a paged-out page. That means that the failure was either in RAM or in the storage subsystem. The dump triage analysis indicates that this is most likely due to RAM in your case...
Code:
KERNEL_DATA_INPAGE_ERROR (7a)
The requested page of kernel data could not be read in.  Typically caused by
a bad block in the paging file or disk controller error. Also see
KERNEL_STACK_INPAGE_ERROR.
If the error status is 0xC000000E, 0xC000009C, 0xC000009D or 0xC0000185,
it means the disk subsystem has experienced a failure.
If the error status is 0xC000009A, then it means the request failed because
a filesystem failed to make forward progress.
Arguments:
Arg1: 0000000000000001, lock type that was held (value 1,2,3, or PTE address)
Arg2: ffffffffc0000005, error status (normally i/o status code)
Arg3: ffff918fad96c080, current process (virtual address for lock type 3, or PTE)
Arg4: ffffd9ecdff58000, virtual address that could not be in-paged (or PTE contents if arg1 is a PTE address)
You can see in argument 2 there that the exception code you received was a 0xC0000005, indicating an invalid memory reference. Note that we do not see any of the storage subsystem exception codes listed there, only the 0xC0000005. That says to me that this is much more likely to have been caused by bad RAM (or a bad memory controller or bad slot on the motherboard) than by a paging drive issue.

I very much doubt that this is a CPU issue, so I wouldn't advise swapping CPUs. I would try the 4GBx2 RAM sticks you have stuffed away and see how that goes. To be sure of full compatibility any RAM you use really needs to be on the QVL for the motherboard.

The other four dumps that you uploaded could also point at RAM. There are two PTE error bughckes (page table corruptions) and two corrupted list structures (structures in memory). Whilst both of these an be caused by flaky drivers they can also be caused by bad RAM.
 
  • Like
Reactions: satrow
Solution
It's unwise to run without a paging file - because you won't be able to write any dumps. Dumps are written initially to the paging file. Enter the command sysdm.cpl at the Run command prompt, click the Advanced tab, click the top Settings button (Performance), click the Advanced tab, click the Change button in Virtual Memory. In there ensure that the top checkbox (Automatically manage paging file size for all drives) IS checked. Windows will then size the paging file appropriately and place it on your fastest drive - which should be the system drive in any case.
Indeed. In any case, I still got a BSOD on startup this morning with the page file disabled (MEMORY_MANAGEMENT), so there's no point in keeping it disabled. I turned it back on.

You can see in argument 2 there that the exception code you received was a 0xC0000005, indicating an invalid memory reference. Note that we do not see any of the storage subsystem exception codes listed there, only the 0xC0000005. That says to me that this is much more likely to have been caused by bad RAM (or a bad memory controller or bad slot on the motherboard) than by a paging drive issue.

I very much doubt that this is a CPU issue, so I wouldn't advise swapping CPUs. I would try the 4GBx2 RAM sticks you have stuffed away and see how that goes. To be sure of full compatibility any RAM you use really needs to be on the QVL for the motherboard.
Thanks. I'll give the older, known working memory sticks a try and see how it goes. It sounds like if I still get memory BSODs even with those, that the memory slots or the memory controller are bad, meaning I'd need to replace the motherboard, right? Unless it's the storage subsystem. I really wish it would just be the memory sticks being bad after all, since it would be much easier to replace them (or just stick with 8 GB) than to replace the motherboard, but I'm not very hopeful about it, since they were working just fine until I made all the changes at the end of last year.

Edit: Alright, the 4GBx2 kit is in. Now I wait. I don't think it'll be safe to say the problem is gone until more than two weeks of normal usage have passed without any system crashes.

Oh, and regarding the QVC list for memory for my motherboard, there unfortunately isn't one provided for 5000 series Ryzen processors that I can find.
 
Last edited by a moderator:
Still no crashes as of yet, but it's too early to say for certain whether the problem is gone. The 4GBx2 kit is clocked at its rated 3000 MT/s.
 
As I still haven't gotten any crashes (though I was away from my computer for the last few days, so it wasn't quite two full weeks of testing), I think it's safe to say that the problem was with the 8GB x 2 memory sticks, likely some weird instability-causing compatibility issue with the 5600X. It's quite frustrating that the extensive memory tests I ran did not reveal any problems. It would have saved a lot of time if they had.

I'm okay with being back to 8 GB total RAM again, I never felt constrained under that amount of memory and never felt like I took proper advantage of having 16 GB anyway. And I wouldn't want to drop the money on another 8GB x 2 kit only to encounter the same issues. Plus, I can't afford it right now.

@ubuysa Thanks for helping me ultimately track down the problem.
 
Last edited by a moderator:
  • Like
Reactions: satrow and ubuysa