Question PC crashing - Memory Management and IRQ/Driver issue ?

thhm42

Reputable
Nov 3, 2019
23
0
4,510
Hi all. I've been having some issues as of late with my PC that I use for my CCTV and storage system. It's normally on 24x7 but has been crashing lately.

I've used whocrashed and it looks like i've got a memory issue and an IRQ/Driver issue but I don't know which driver. I'm not sure if the memory is bad or if the IRQ issue is related?

I've got the .dmp files. Would that help someone to provide some guidance to me?
THe system is a Ryzen 7 with 8c/16 thread and 32GB of memory. Standard graphics as it's normally not used as a standard PC.

View: https://imgur.com/a/afv9cGg


Is the WhoCrashed info

Thank you for any guidance
 

ubuysa

Distinguished
WhoCrashed and BlueSreenView are pretty useless tools when it comes to analysing BSODs. They use only the data provided by the bugcheck and that's rarely enough to even get close to the real problem. Please ALWAYS upload the minidump files to a cloud service with a link to them here, so that we can properly analyse them.
 

ubuysa

Distinguished
Thanks for the dumps. I do agree now that I've had a look at them that bad RAM the most likely cause, I'll explain in detail why that is in a second. I wanted first to answer your question about 'the IRQ issue' because you seem interested?

I think you're referring to the 0xA bugscheck - the IRQL_NOT_LESS_OR_EQUAL BSOD? What this is telling you is that a page fault occurred whilst running at an elevated IRQL. This is not allowed. The IRQL (interrupt request level) is an interrupt prioritisation mechanism used at the processor (hardware) level. It defines the range of interrupts that the processor can accept. The lowest level (all interrupts accepted) is the NORMAL level (IRQL 0) and any IRQL above that is known as an 'elevated IRQL'. The most commonly used elevated IRQL is DISPATCH_LEVEL (IRQL 2). One of the things a processor cannot do at IRQL 2 is an I/O to the pagiong file, so a page fault cannot be allowed. Drivers that run at elevated IRQLs allocate storage in non-paged pools to avoid this possibility.

You can see in this 0xA dump that the IRQL level is 2 (argument 2) and in the stack trace (which you read from the bottom up) you can see a page fauilt. That's what caused thjis BSOD....
Code:
STACK_TEXT:
fffff183`09aca468 fffff800`35a11729     : 00000000`0000000a 00000000`00000d4c 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
fffff183`09aca470 fffff800`35a0d2e3     : ffff9280`00000000 00000000`00000008 00000000`0000ffff 00000000`00000000 : nt!KiBugCheckDispatch+0x69
fffff183`09aca5b0 fffff800`358f6346     : 00000000`00000000 ffffb50d`d2ea1508 00000000`00000000 fffff183`09aca801 : nt!KiPageFault+0x463
fffff183`09aca740 fffff800`35a2fe4b     : fffff183`0000002f fffff800`00000000 00000000`00000001 ffffb50d`d2ea11c0 : nt!KeAbPreWait+0x6
fffff183`09aca770 fffff800`35948262     : ffffb50d`d2ea1508 00000000`00000010 00000000`00000001 fffff800`35cfd700 : nt!KeWaitForSingleObject+0x1eff7b
fffff183`09aca860 fffff800`35cfd738     : ffffffff`ffffffff 00000000`00000001 ffffb50d`d23808f0 00000000`00000000 : nt!AlpcpWaitForSingleObject+0x3e
fffff183`09aca8a0 fffff800`35c18955     : ffffffff`00000001 00000000`00000000 ffffb50d`d23808f0 ffffb50d`d23808f0 : nt!AlpcpCompleteDeferSignalRequestAndWait+0x3c
fffff183`09aca8e0 fffff800`35c1b6af     : 00000183`165eff1c 00000000`00000001 fffff183`09aca990 fffff183`09aca988 : nt!AlpcpReceiveMessagePort+0x265
fffff183`09aca950 fffff800`35c1b4fb     : fffff183`09acaa30 00000183`165eff1c fffff183`09aca990 00000000`00000000 : nt!AlpcpReceiveLegacyMessage+0x11f
fffff183`09aca9f0 fffff800`35a10ef8     : ffffb50d`d2ea1080 00000018`131ff828 fffff183`09acaaa8 00000000`00000000 : nt!NtReplyWaitReceivePortEx+0xcb
fffff183`09acaa90 00007ff8`9942d544     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x28
00000018`131ff808 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007ff8`9942d544
A page fault occurs because the page containing the virtual address referenced was not in RAM. That could be because it was never allocated, because it's paged out, because the address referenced is wrong (a pad pointer), or because the RAM is bad.

The first three of those reasons are quite common third-party driver screw-ups, only the fourth is a hardware problem. That's why the cause of this particular BSOD is almost always a third-party driver. Typically we'd see the driver on the call stack leading up to the bugcheck - but here we don't. There are no third-party drivers referenced at all, all the function calls are kernel calls (nt!....). That means that bad RAM is more likely given what we see in this dump.

The other two dumps are 0x1A - MEMORY_MANAGEMENT and a 0xBE - ATTEMPTED_WRITE_TO_READONLY_MEMORY.

The 0x1A has an argument 1 value of 8886 which indicates that two pages on the standby list that were supposed to have identical page priority values don't have identical page priority values. The differing values are captured in argument 4. The mosty likely cause of that is a bad RAM page.

The 0xBE dump also has no third-party drivers on the call stack leading up to the bugcheck...
Code:
0: kd> knL
 # Child-SP          RetAddr               Call Site
00 fffff802`48c74448 fffff802`43a7a1e8     nt!KeBugCheckEx
01 fffff802`48c74450 fffff802`4382474f     nt!MiRaisedIrqlFault+0x141740
02 fffff802`48c744a0 fffff802`43a0d1d8     nt!MmAccessFault+0x4ef
03 fffff802`48c74640 fffff802`4383ceb6     nt!KiPageFault+0x358
04 fffff802`48c747d0 fffff802`438c2086     nt!KiTryUnwaitThread+0x186
05 fffff802`48c74830 fffff802`438c1c2c     nt!KiTimerWaitTest+0x1e6
06 fffff802`48c748e0 fffff802`438c0d3d     nt!KiProcessExpiredTimerList+0xdc
07 fffff802`48c749d0 fffff802`43a0202e     nt!KiRetireDpcList+0x5dd
08 fffff802`48c74c60 00000000`00000000     nt!KiIdleLoop+0x9e
You can also see (in the way I've displayed the stack here) that the failing function was frame 4 - because frame 3 (the next call) is a page fault. We can dispay the details of frame 4 to see what happened...
Code:
0: kd> .frame /r 4
04 fffff802`48c747d0 fffff802`438c2086     nt!KiTryUnwaitThread+0x186
rax=0000000000000001 rbx=ffffd78a945d6080 rcx=0000000000000000
rdx=0000000000000000 rsi=ffffd78a945d6250 rdi=fffff80243841495
rip=fffff8024383ceb6 rsp=fffff80248c747d0 rbp=fffff802408c5180
 r8=0000000000000102  r9=0000000000000000 r10=0000000000000000
r11=fffff80248c748c0 r12=0000000000000102 r13=0000000000000000
r14=fffff802408c5180 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
nt!KiTryUnwaitThread+0x186:
fffff802`4383ceb6 f0480fbaaf107c000000 lock bts qword ptr [rdi+7C10h],0 ds:002b:fffff802`438490a5=c7fffffbe8850f00
You can see at the bottom there that the BSOD happened whilst attempting to execute a LOCK instruction using the RDI register and an offset as a pointer to a lock in memory, but the resulting pointer address (fffff802`438490a5) conatins an invalid memory location (c7fffffbe8850f00). It's invalid because it's non-canonical - it's not is the valid range of allowed virtual addresses. You can check this by displaying the page table entry for it...
Code:
0: kd> !pte c7fffffbe8850f00
                                           VA c7fffffbe8850f00
PXE at FFFF9BCDE6F37FF8    PPE at FFFF9BCDE6FFFF78    PDE at FFFF9BCDFFFEFA20    PTE at FFFF9BFFFDF44280
Unable to get PXE FFFF9BCDE6F37FF8
WARNING: noncanonical VA, accesses will fault !
It's prety clear that the address that was intended was 0xFFFFFFFBE8850F00 but that several bits in the first byte have not been set properly...
Rich (BB code):
0: kd> .formats c7fffffbe8850f00
Evaluate expression:
   Binary:  11000111 11111111 11111111 11111011 11101000 10000101 00001111 00000000

0: kd> .formats fffffffb`e8850f00
Evaluate expression:
  Binary:  11111111 11111111 11111111 11111011 11101000 10000101 00001111 00000000
This can only have been cvaused by a bad RAM page.

I've gone into a lot of detail because you seem to be interested in the detaail? The conclusion is that you have bad RAM and I'd suggest you test in with Memtest86...
  1. Download Memtest86 (free), use the imageUSB.exe tool extracted from the download to make a bootable USB drive containing Memtest86 (1GB is plenty big enough). Do this on a different PC if you can, because you can't fully trust yours at the moment.
  2. Then boot that USB drive on your PC, Memtest86 will start running as soon as it boots.
  3. If no errors have been found after the four iterations of the 13 different tests that the free version does, then restart Memtest86 and do another four iterations. Even a single bit error is a failure.
Let us know how that goes.
 
Last edited:
  • Like
Reactions: 35below0 and thhm42

thhm42

Reputable
Nov 3, 2019
23
0
4,510
You sir, are a legend. Thank you for sharing this information with me. While i don't understand all of it, I do trust your experience.

I guess what's left is to run the memtest and see if I can isolate which stick of memory has an issue, then replace.

How do I mark the correct answer? Thank you again ubuysa!
 

ubuysa

Distinguished
TBH if Memtest shows up a flaky RAM stick then you're better off replacing both. RAM needs to be in matched pairs (or quads). If you can find another stick with exactly the same part number (KHX3200C18D4/8G) then you might be OK, but if not then buy a pack of two matched RAM sticks. It's also worth ensuring that any RAM you do buy is on the QVL of the motherboard - this is RAM that has been tested and certified as compatible.
 

thhm42

Reputable
Nov 3, 2019
23
0
4,510
Appreciate the update ubuysa. I've got 4x8 sticks so i'll be doing the test, then seeing if I have to replace 1 or 2 of the sticks.