Question I'm getting multiple BSODs ?

Page 4 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Jun 16, 2024
60
2
35
I've been having multiple different blue screens for the last couple of months. A lot have been solved due to finding out that my old motherboard and most of my old parts were fried during a surge but some keep persisting and calling back mostly to Ntoskrnl.exe. I have taken the pc to a tech a few times and he cannot seem to get the pc to blue screen at all. He had it almost a week last time running stress test and had no issue.

I bring it home and a few hours later it has another crash. I have only one original part from the fried motherboard era and that is my CPU which the tech is certain has no issues. He thinks it might be environmental but I got a uninterruptible power supply on his recommendation due to currently living with not so good electrical system (landlord ties his electric fence into the same power as the house). As of now I have no idea what is wrong. I'm not really that good at reading minidumps and nothing I do seems to fix it or find a root issue.

The blue screen codes I get the most of recently are:
DPC_Watchdog_Violation and IRQL-Not_Less_or_Equal with a smattering of System_Service_Exception, System_Thread__Exception_Not_Hanndled and a single Memory_Managment.

My system specs are:
MB: MSI MAG Z790 Tomahawk Max Wifi
CPU: Intel Core i5-12600KF
CPU Cooler: Noctua NH-U12S redux with NF-P12 redux - 1700pwm
RAM: Teamgroup TForce Vulkan Alpha ddr5-5600 32GB (dual channel)
GPU: Nvidia Geforce RTX 3060 ti
Power: Segotep 750w
Drive's: Crucial P3 4TB, Crucial T500 2TB(C drive)

Minidumps:
https://drive.google.com/file/d/1Khk0LYhd0vzl0bcjDl7xryZkvDLsOhbR/view?usp=sharing
https://drive.google.com/file/d/1YSxOYtPOWlrPtHJhaKrM_bJq0SmRWy8M/view?usp=sharing
https://drive.google.com/file/d/1dskVfBivoYiY6vxJ0sJI7ZFcNUB_5zkh/view?usp=sharing
https://drive.google.com/file/d/1_S7G7JU-GcTsXxwEhiSREq-DUu1ldZ_R/view?usp=sharing
https://drive.google.com/file/d/1SXV4S6GPh1whzn1_IXjM4rWnL_VgzeQj/view?usp=sharing
https://drive.google.com/file/d/13qVF9RM0Btrnl_D-RjlMfM1jR-TeUO0A/view?usp=sharing
 
I have thought for some time that this is most likely a bad CPU and nothing that's been tried since my last posting here has changed my mind.

I would suggest stressing the CPU with Prime95....
  1. Download Prime95 and a CPU temperature monitor (CoreTemp will do).
  2. Keep the temperature monitor running all the time you run Prime95. Your CPU will get hot!
  3. Run each of the three Prime95 tests (smallFFTs, largeFFTs, and Blend) one after the other for a minimum of 1 hour per test, 2 hours per test would be better.
  4. If Prime95 generates error messages, if the system crashes/freezes/BSODs, or if your CPU temp approaches 100°C (Tmax for your CPU), then stop Prime95 and let us know what happened.
Note that a properly cooled and stable CPU should be able to run all Prime95 tests pretty much indefinitely.

FYI: The small FFT test stresses the CPU more than RAM. The large FFT test stresses RAM more than the CPU. The Blend test is a mixture of the two.

Alternatively (or as well as) you could try the CPU stress test in OCCT. I used OCCT to check the stability of my recent new build.
 
Although we can never be 100% certain, if that were mine I'd be replacing the CPU. That last BSOD had a 0x80000003 exception, this indicated that either breakpoint or ASSERT instruction was encountered. That doesn't sound credible to me.

This BSOD occurred in processor 9 which I believe is the same physical core as processor 8 which we think had the problems earlier. That also suggests the CPU is at fault.

Can you confirm that it also will not coimplete the Intel Processor Diagnostic Test either?
 
I bought a new CPU and will have to get it installed next week as I'm not confident doing it myself. I want to thank y'all for the help in figuring out my PCs issues over the last few weeks. Hopefully my issues will be fixed with the new part.
 
@EggShell - Thank you for your thoughts. Your comment narrowed my focus.

!running -t on the memory.dmp linked in this reply shows only core 8 as not running the core's idle thread at the time of the bugcheck. The only threads that show up when running the !ready extension shows:
Code:
8: kd> !ready
KSHARED_READY_QUEUE fffff80425b4b040: (00) ********----------------------------------
SharedReadyQueue fffff80425b4b040: No threads in READY state
Processor 0: No threads in READY state
Processor 1: No threads in READY state
Processor 2: No threads in READY state
Processor 3: No threads in READY state
Processor 4: No threads in READY state
Processor 5: No threads in READY state
Processor 6: No threads in READY state
Processor 7: No threads in READY state
KSHARED_READY_QUEUE ffff8005627ae040: (00) --------********--------------------------
SharedReadyQueue ffff8005627ae040: No threads in READY state
Processor 8: Ready Threads at priority 9
    THREAD ffff800573cd9080  Cid 07f4.3f20  Teb: 0000003b5071a000 Win32Thread: ffff800571d3e710 READY on processor 8
    THREAD ffff800575960080  Cid 16e0.3224  Teb: 000000050ac75000 Win32Thread: ffff8005789ce4b0 READY on processor 8
Processor 9: No threads in READY state
Processor 10: No threads in READY state
Processor 11: No threads in READY state
Processor 12: No threads in READY state
Processor 13: No threads in READY state
Processor 14: No threads in READY state
Processor 15: No threads in READY state

So, two threads in the READY state on processor 8 but the thread on core 8 has been waiting on a spin lock for two minutes. If I'm understanding correctly, this is a deadlock as that thread will never be able to acquire the spin lock since no thread is running anywhere that could make it available. Is that correct? If so, the only scenarios I can think of that might cause that is the thread that held the spin lock exited without releasing the spin lock (perhaps crashed) or a bit flip is causing the spin lock to be incorrectly interpreted as unavailable. I guess either could be caused by a faulty CPU core but the former seems more likely.

After extracting the Circular Kernel Context Logger as an etl and opening it in WPA with the graphs I think might be relevant I see this when I select core 8 in the middle graph. It looks like core 8 is actually doing work for a short period at the beginning of the graph. If I select the item at the end of that period and then drag a selection to the end of the trace the duration of the selection is two minutes - the watchdog timeout period.

The !thread extension shows the thread running on core 8 as having done 2:00.375 minutes of KernelTime work at some point (which I suppose isn't necessarily the same two minutes in the etl trace) while also showing no work being done for the last two minutes (according to the Ticks value.) My interpretation of the etl graph would be that it wasn't doing any work.

Do you know whether or not a thread waiting on a spin lock is registered as "doing work" in such a trace? Meaning, does it make sense for core 8 to look like it's not doing anything while it waits for a spin lock?
 
  • Like
Reactions: ubuysa
Got my PC to the tech today. He will hopefully have the part changed before the weekend as he currently has a backlog of things to do. After that hopefully my issues are solved.
 
!running -t on the memory.dmp linked in this reply shows only core 8 as not running the core's idle thread at the time of the bugcheck.
Unfortunately, it looks like the OP has deleted the dump file from their Google Drive now.

If so, the only scenarios I can think of that might cause that is the thread that held the spin lock exited without releasing the spin lock (perhaps crashed)
I should imagine that is the most likely scenario, some thread has raised processor 8 to DISPATCH_LEVEL so no other threads can be scheduled on that processor, probably through a spinlock and then just exited or crashed without ever releasing it. The DPC watchdog would then come along, notice that a processor has been stuck at DISPATCH_LEVEL for over 2 minutes and then issues the bugcheck.
 
  • Like
Reactions: cwsink
Unfortunately, it looks like the OP has deleted the dump file from their Google Drive now.
Got overwritten by the other memory dump I uploaded to drive. Seems I forgot to change their names to not do that, my bad.

On another note I got my PC back today with the new CPU installed. I'll be testing it after I get off work later tonight.
 
Thank you, @EggShell. Do you happen to know if waiting on a spin lock is considered "work" when it comes to event tracing data? The graphs suggest not but I haven't been able to find anything that says whether it does or not. I think part of the confusion when looking at DPC_WATCHDOG_VIOLATION with Arg1 = 1 crashes for some of us is whether or not data is erroneously not being collected for some reason or the core is actually stuck waiting for something and not doing anything that gets recorded as a DPC or ISR doing actual processing. I can provide a link to the extracted etl trace if that would help.

@HawKitsune - would it be okay for me to link the memory.dmp file and/or extracted etl if EggShell wants to have a look?

Hoping the new processor takes care of it, either way.
 
So new CPU is installed and alot of issues I was having seem to be fixed. A lot of apps and things that caused blue screens for me no longer do so.
Saying that I now have gotten this https://drive.google.com/file/d/1yQ6hjz3VAHlZxlEAiYbPHHP0W4U6MB8I/view?usp=sharing

I got it trying to install a game on steam. Happened when steam was making room on my SSD for the game. I'm currently going to go an update any drivers that are needed as it's an IRQL_NOT_LESS_OR_ EQUAL bsod.
 
So new CPU is installed and alot of issues I was having seem to be fixed. A lot of apps and things that caused blue screens for me no longer do so.
Saying that I now have gotten this https://drive.google.com/file/d/1yQ6hjz3VAHlZxlEAiYbPHHP0W4U6MB8I/view?usp=sharing

I got it trying to install a game on steam. Happened when steam was making room on my SSD for the game. I'm currently going to go an update any drivers that are needed as it's an IRQL_NOT_LESS_OR_ EQUAL bsod.
Did a SFC scan and had some corrupt files.
 
  • Like
Reactions: cwsink