Question I'm getting multiple BSODs ?

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Jun 16, 2024
60
2
35
I've been having multiple different blue screens for the last couple of months. A lot have been solved due to finding out that my old motherboard and most of my old parts were fried during a surge but some keep persisting and calling back mostly to Ntoskrnl.exe. I have taken the pc to a tech a few times and he cannot seem to get the pc to blue screen at all. He had it almost a week last time running stress test and had no issue.

I bring it home and a few hours later it has another crash. I have only one original part from the fried motherboard era and that is my CPU which the tech is certain has no issues. He thinks it might be environmental but I got a uninterruptible power supply on his recommendation due to currently living with not so good electrical system (landlord ties his electric fence into the same power as the house). As of now I have no idea what is wrong. I'm not really that good at reading minidumps and nothing I do seems to fix it or find a root issue.

The blue screen codes I get the most of recently are:
DPC_Watchdog_Violation and IRQL-Not_Less_or_Equal with a smattering of System_Service_Exception, System_Thread__Exception_Not_Hanndled and a single Memory_Managment.

My system specs are:
MB: MSI MAG Z790 Tomahawk Max Wifi
CPU: Intel Core i5-12600KF
CPU Cooler: Noctua NH-U12S redux with NF-P12 redux - 1700pwm
RAM: Teamgroup TForce Vulkan Alpha ddr5-5600 32GB (dual channel)
GPU: Nvidia Geforce RTX 3060 ti
Power: Segotep 750w
Drive's: Crucial P3 4TB, Crucial T500 2TB(C drive)

Minidumps:
https://drive.google.com/file/d/1Khk0LYhd0vzl0bcjDl7xryZkvDLsOhbR/view?usp=sharing
https://drive.google.com/file/d/1YSxOYtPOWlrPtHJhaKrM_bJq0SmRWy8M/view?usp=sharing
https://drive.google.com/file/d/1dskVfBivoYiY6vxJ0sJI7ZFcNUB_5zkh/view?usp=sharing
https://drive.google.com/file/d/1_S7G7JU-GcTsXxwEhiSREq-DUu1ldZ_R/view?usp=sharing
https://drive.google.com/file/d/1SXV4S6GPh1whzn1_IXjM4rWnL_VgzeQj/view?usp=sharing
https://drive.google.com/file/d/13qVF9RM0Btrnl_D-RjlMfM1jR-TeUO0A/view?usp=sharing
 
Both those 0x133 dumps happened because a collection of DPCs (the back end of device interrupt processing) ran for too long. The only way to debug that particular BSOD is with the kernel dump, it's the file C:\Windows\Memory.dmp. Upload that to a cloud service and I'll take a look.
Will do. Trying to upload now but I'm not sure when It will be done as my upload speed is not good.
 
Just chiming in with my own observations. Of the 12 mini dump files provided so far, 10 of them happened on logical core 8 - which is possibly interesting.

The MEMORY.DMP callstack is showing more of the functions for the thread on which the bugcheck was processed than the corresponding mini dump shows. That thread is running on logical core 8 and was trying to acquire a spin lock for 2 minutes in the MEMORY.DMP - which is the timeout period for the DPC_WATCHDOG_VIOLATION with Arg1 equal to 0x1.

Attempting to acquire a spin lock happens at DISPATCH_LEVEL which means any DISPATCH_LEVEL or lower interrupts would be masked on that core while nt!ExpWaitForSpinLockSharedAndAcquire was running. it was attempting to acquire the spin lock which means no DPCs, APCs, or user code would run on that core for 2 minutes. Spin locks are not supposed to be held for more than 25 microseconds so it not being able to acquire the spin lock for 2 minutes is a big red flag. There are DPCs in that processor's DPC queue according to the !dpcs extension so they would not have been able to be processed. Higher level interrupts would run on that core and then drop back to DISPATCH_LEVEL which would then resume trying to execute nt!ExpWaitForSpinLockSharedAndAcquire until the 2 minute watchdog timeout caused Windows to bugcheck the system.

So, the question is why is that spin lock not becoming available for nt!ExpWaitForSpinLockSharedAndAcquire. That I'm not sure about. Does anyone know how to check the status of and/or what owns a spin lock in WinDbg?
 
  • Like
Reactions: ubuysa
The cause of the most recent 0x133 BSOD needs the kernel dump as I mentioned. This is because it happened when a collection of DPCs ran for too long rather than just a single DPC running for too long. We need the kernel dump to be able to access all processors and thus all running DPCs.

Without boring you the technique for this type of BSOD is to dump the WMI trace records from the kernel dump and then extract and export the DPC trace records in a format that the Windows Performance Analyzer (WPA) can read. We then use WPA to graphically and numerically display all the running DPCs, this allows us to see whether one is contributing more to the total run time than others.

Microsoft recommend that no DPC runs for longer than 100 microseconds (0.1 milliseconds) and in the WPA ouput we can see the total run time for each DPC - in the Duration (Fragmented) (ms) SUM column, which I've expanded in the image below.

sDWi2Mw.jpg


You can see clearly that the problem DPC here is in the nvlddmkm.sys Nvidia graphics driver, which ran for 12.1 ms, or 12100 microseconds - 120 times longer than Microsoft recommend.

The next longest running is dxgkrnl.sys, the Windows DirectX kernel, but that of course calls nvlddmkm.sys and so is dependent on it. The tcpip.sys driver runs longer than recommended, but not overly excessively so. In any case we were probably streaming here and so tcpip.sys may well be delayed by nvlddmklm.sys.

It's clear that the problem in this long running DPC group is nvlddmkm.sys. The installed version of this driver appears current, it's dated June 2nd 2024...
Code:
8: kd> lmvm nvlddmkm
Browse full module list
start             end                 module name
fffff804`4c2b0000 fffff804`4fdb4000   nvlddmkm   (deferred)            
    Image path: \SystemRoot\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_cc569e59ca39c5fe\nvlddmkm.sys
    Image name: nvlddmkm.sys
    Browse all global symbols  functions  data
    Timestamp:        Sun Jun  2 02:20:19 2024 (665BACB3)
    CheckSum:         039EC89C
    ImageSize:        03B04000
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4
    Information from resource tables:

BUT (and it's a big but)...

In addition, and potentially another cause of these DPC BSODs is your RAM. I can see from the dump that you have 32GB in two TeamGroup UD5-5600 DDR5 16GB sticks running at 5200 MT/s....
Rich (BB code):
[Memory Device (Type 17) - Length 92 - Handle 003eh]
  Memory Error Info Handle      [Not Provided]
  Total Width                   64 bits
  Data Width                    64 bits
  Size                          16384MB
  Form Factor                   09h - DIMM
  Device Set                    [None]
  Device Locator                Controller1-DIMMB2
  Bank Locator                  BANK 0
  Memory Type                   22h - DDR5
  Type Detail                   0080h - Synchronous
  Speed                         5200MHz
 Manufacturer                  Team Group Inc
  Serial Number                       
  Asset Tag Number                       
Part Number                   UD5-5600           
  Attributes                    1
  Extended Size                 0
Configured Memory Speed       5200
If you check the spec for your i5-12600KF CPU you'll find that the maximum DDR5 transfer rate that the CPU will support is only 4800 MT/s. You are thus running your RAM faster than the CPU supports and this may well be the root cause of all your problems. I've seen this many times before (I even did it myself in one build!). I would disable the XMP profile you have and run the RAM at stock 4800 MT/s. See whether it's stable then before looking at nvlddmkm.sys.
 
Hey ubuysa. I'm a friend of Hawk's, the one who's been helping them try to trouble shoot every possibility I could before sending them to a PC tech (Who was helpful, just not for this issue), and eventually here when the BSODs continued after faulty part replacements.

I just want to let you know that Hawk, as per your earlier reccomendation, had already disabled "XMP" as you had previously requested, before the last couple crashes they uploaded here.

It's worth noting that I say "XMP" because as mentioned, their motherboard only shows iEXPO.
They were also having these crashes before activating "iEXPO", as well as while using, and after disabling, iEXPO".

With iEXPO off, and no custom profiles selected, the RAM defaults to a stock speed of 5200 with no tampering or changing. In order to get lower to 4800 they'll need to manually downclock down to 4800, I'll help them with that later.

Also as per your request, Hawk has already uploaded the memory.dmp file for you, it's currently at the top of Page 2 of this thread.

We've a day or so of gaming ahead of us tomorrow (today, technically!) so once I'm done with my daily duties at home, if Hawk hasn't already downlocked to 4800, I'll help him do it over a call and then we'll see what happens.

Thank you to everyone in this thread so far, we've both been tearing our hair out at this since last year.
Much love.

Edit:
Spelling
 
Last edited:
Those three dumps still point very strongly at bad RAM. Here are the failure buckets from each dump...
Code:
FAILURE_BUCKET_ID:  0x1a_8887_ZERO_PAGE_CORRUPTED_IMAGE_hardware_ram
FAILURE_BUCKET_ID:  IP_MISALIGNED_GenuineIntel.sys
FAILURE_BUCKET_ID:  IP_MISALIGNED_GenuineIntel.sys
A misaligned IP means the instruction pointer is out of sequence, this is quite common when bad RAM is involved.

I would test your RAM with Memtest86...
  1. Download Memtest86 (free), use the imageUSB.exe tool extracted from the download to make a bootable USB drive containing Memtest86 (1GB is plenty big enough). Do this on a different PC if you can, because you can't fully trust yours at the moment.
  2. Then boot that USB drive on your PC, Memtest86 will start running as soon as it boots.
  3. If no errors have been found after the four iterations of the 13 different tests that the free version does, then restart Memtest86 and do another four iterations. Even a single bit error is a failure.
 
Those three dumps still point very strongly at bad RAM. Here are the failure buckets from each dump...
Code:
FAILURE_BUCKET_ID:  0x1a_8887_ZERO_PAGE_CORRUPTED_IMAGE_hardware_ram
FAILURE_BUCKET_ID:  IP_MISALIGNED_GenuineIntel.sys
FAILURE_BUCKET_ID:  IP_MISALIGNED_GenuineIntel.sys
A misaligned IP means the instruction pointer is out of sequence, this is quite common when bad RAM is involved.

I would test your RAM with Memtest86...
  1. Download Memtest86 (free), use the imageUSB.exe tool extracted from the download to make a bootable USB drive containing Memtest86 (1GB is plenty big enough). Do this on a different PC if you can, because you can't fully trust yours at the moment.
  2. Then boot that USB drive on your PC, Memtest86 will start running as soon as it boots.
  3. If no errors have been found after the four iterations of the 13 different tests that the free version does, then restart Memtest86 and do another four iterations. Even a single bit error is a failure.
Did 12 passes with 13 tests a pass with no errors.
 
Last edited:
Is the Intel processor diagnostic tool a reliable way to reproduce a crash? The memory.dmp file shows a couple of threads in the READY state on logical core 8 that I think are part of that test. The thread owner's processes were named Math_PrimeNum.exe and Math_FP.exe. All of the other cores were idle but logical core 8 was stuck trying to acquire a spin lock which prevent those READY threads from getting any CPU time.

There's that logical core 8 again and 2 of the 3 new dump files provided happened on the same physical core (which consists of logical core 8 and 9.) That's 13 out of 15 dump files provided so far which crashed on the same physical core.
 
Is the Intel processor diagnostic tool a reliable way to reproduce a crash? The memory.dmp file shows a couple of threads in the READY state on logical core 8 that I think are part of that test. The thread owner's processes were named Math_PrimeNum.exe and Math_FP.exe. All of the other cores were idle but logical core 8 was stuck trying to acquire a spin lock which prevent those READY threads from getting any CPU time.

There's that logical core 8 again and 2 of the 3 new dump files provided happened on the same physical core (which consists of logical core 8 and 9.) That's 13 out of 15 dump files provided so far which crashed on the same physical core.
So far every time I've run the diagnostic tool I've had a bsod. I'm currently running another memtest over the night but I can run the diagnostic tool in the morning before work to see if another bsod is caused by it.
 
Assuming it crashes again, please make the new memory.dmp file available for comparison. Compressing it and providing a link to the compressed file is certainly my preference.
 
Does anyone know how to check the status of and/or what owns a spin lock in WinDbg?
A spinlock is just a 32-bit pointer with a single bit set to 0x1 or 0x0, on debug builds, apparently the pointer is the address of the owner (_KTHREAD). There is no reason to actually keep that information in a release build, so your best option is to use !running -t and then check for threads which have acquired a spinlock or are running on a processor at >= DISPATCH_LEVEL.
 
  • Like
Reactions: cwsink
Is the Intel processor diagnostic tool a reliable way to reproduce a crash? The memory.dmp file shows a couple of threads in the READY state on logical core 8 that I think are part of that test. The thread owner's processes were named Math_PrimeNum.exe and Math_FP.exe. All of the other cores were idle but logical core 8 was stuck trying to acquire a spin lock which prevent those READY threads from getting any CPU time.

There's that logical core 8 again and 2 of the 3 new dump files provided happened on the same physical core (which consists of logical core 8 and 9.) That's 13 out of 15 dump files provided so far which crashed on the same physical core.
Did I miss that the Intel Processor Diagnostic Test has been run, and failed?

That's a really good spot, I was concentrating on the 0x133 data in the kernel dump. A bad CPU would also explain everything were seeing. The Intel PDT should never crash, if it is then in my book that would be a CPU failure.

The two waiting threads on processor 8 do look like they could be part of a processor stress but I have no idea what they actually are.

I may well have disappeared down a rabbit hole here, if the Intel PDT is causing BSODs then the CPU is the problem. I think you (@cwsink) have a better handle on what's happening here than I do so I'll take a back seat for now.
 
So, this latest memory.dmp is mostly the same - watchdog timeout occurred on logical core 9 (same physical core as logical core 8) and there are two threads in the READY state owned by the same processes as mentioned before.

What's different is two other cores (logical cores 10 and 15) also seem to be stuck trying to acquire spin locks - though different spin locks from each other and the one involved in logical core 9, I think. Something odd is that logical core 15 is showing it having done no work for almost 3 minutes - which I would have thought would have triggered a watchdog timeout on that core if it was trying to acquire a spin lock that whole time. But maybe something was going on with that core that prevented it from doing work prior to trying to acquire a spin lock. The code seems to have terminated a thread before trying to obtain a spin lock so maybe that has some something to do with it.

That the same physical core is involved and doing the same thing as in the first memory.dmp, I'm pretty sure the cores are only executing Microsoft and/or Intel code, and it happened during a CPU stress test make me think there's something wrong with that core.

@HawKitsune - do you know if the tech who worked on your computer ran the Intel Processor Diagnostic Test?
 
So, this latest memory.dmp is mostly the same - watchdog timeout occurred on logical core 9 (same physical core as logical core 8) and there are two threads in the READY state owned by the same processes as mentioned before.

What's different is two other cores (logical cores 10 and 15) also seem to be stuck trying to acquire spin locks - though different spin locks from each other and the one involved in logical core 9, I think. Something odd is that logical core 15 is showing it having done no work for almost 3 minutes - which I would have thought would have triggered a watchdog timeout on that core if it was trying to acquire a spin lock that whole time. But maybe something was going on with that core that prevented it from doing work prior to trying to acquire a spin lock. The code seems to have terminated a thread before trying to obtain a spin lock so maybe that has some something to do with it.

That the same physical core is involved and doing the same thing as in the first memory.dmp, I'm pretty sure the cores are only executing Microsoft and/or Intel code, and it happened during a CPU stress test make me think there's something wrong with that core.

@HawKitsune - do you know if the tech who worked on your computer ran the Intel Processor Diagnostic Test?
Not recently to my knowledge. Up until about 3-4 weeks ago running the test wouldn't cause a bsod. He last had it about a month ago now I believe.
 
do you know if the tech who worked on your computer ran the Intel Processor Diagnostic Test?
Called the tech and got an answer from him. Has been a couple of months since he used. About 2 1/2 months. He believes that it isn't my CPU and never really went down that route to check if it was bad due to that belief.
 
Does the Intel Processor Diagnostic Tool run in safe mode? Search results suggest it does. The thought being that it almost has to be a hardware issue if the crash is reproducible in safe mode since it's about as minimal of a boot as possible.