Need help: Recurring BSOD/hang on new-ish system after CPU replacement

hlygrail

Prominent
Apr 19, 2017
8
0
510
This one has me stymied and losing sleep, so I'm putting a request for help out there...

New high-end rig built in August 2016 performed flawlessly and then in January didn't come back after a shutdown to move some external cabling. Mobo went back to Gigabyte, they reported no issues and returned. Short version, CPU was actually dead, new one from Intel came back. Now, after about a week of new CPU being installed, I'm getting random freezes while playing certain games (severe lag for a second or two followed by the audio glitching and then a BSOD), but seems perfectly fine in others -- I played the latest DOOM for 2 hours last night with no issues. Worse, while just idle overnight, system will hang and I have to reset it in the morning. Event logs confirm previous bugcheck/BSOD is the same. I've got minidumps enabled for now, and the bugchecks are always one of these two:

The bugcheck was: 0x00000139 (0x0000000000000003, 0xffff9300bdc22940, 0xffff9300bdc22898, 0x0000000000000000)

The bugcheck was: 0x00000096 (0xffffd789c73f8ce0, 0xfffff800f75d4550, 0xfffff800f75d4280, 0xfffff800f72c0bd0).

I have already stripped video drivers away in Safe Mode and reinstalled. Memtest reports no memory issues (I run >24h memtest before deploying any new box, and did so again after the replacement CPU came back). Prime95 reported no issues when it ran for 24h as well. CPU cores are ~86°F at idle (expensive Noctua cooler on here) and even under full load with 3 other Win10 VMs and DOOM running at the same time, I've never seen anything higher than ~124°F. CPU maxes at ~50°C -- everything I've thrown at this shiny nVidia 1060 can't get it higher, and the fans almost never come on. Disk drive is a 512GB NVMe drive (M.2, Samsung) and reports temps in the ~105°F range, which is fine for stick memory.

Operating System: Windows 10 Professional Edition build 14393 (64-bit)
CPU Type: Intel Core i7-5930K @ 3.50GHz
Number of CPUs: 1
Cores per CPU: 6
Hyperthreading: Enabled
Motherboard: Gigabyte X99-Phoenix SLI-CF
Memory: 32GB Corsair Vengeance LPX 32GB (4 x 8GB) 288-Pin DDR4 SDRAM DDR4 3000 (CMK32GX4M4B3000C15), currently running at 2133 speed
Videocard: NVIDIA GeForce GTX 1060 6GB (MSI)
Hard Drive: NVMe SAMSUNG MZVLV512 (512GB)
nVidia Drivers: Latest WHQL v381.65
Monitor1: Acer 27" connected via DisplayPort, 2560x1440
Monitor2: PoS off-brand 24" connected via DVI, 1680x1050

Sleep mode is disabled in Win10 -- screen saver kicks in, but that's it, and I usually just turn the monitors off.

Is there someone that can crack open these two latest minidumps and conclusively point to an offending driver or piece of hardware that would cause both of these crashes/hangs? I'm out of ideas, and have already spent half cost of the motherboard on shipping RMAs and cross-ships around.

https://drive.google.com/open?id=0B5CNt1yPGF8JY2hMZFhOVjRLYTQ (bugcheck 0x00000096)
https://drive.google.com/open?id=0B5CNt1yPGF8JR1A4SzBleGk0LUk (bugcheck 0x00000139)

Any help is appreciated. I need my sleep back!
 
Solution
it was specific to the power driver and the service that was calling it. (easy tune)
the debugger said it was corrupting windows kernel memory.
here is the error that verifier trapped:

A driver tried to map a physical memory page that was not locked. This is illegal because the contents or attributes of the page can change at any time. This is a bug in the code that made the mapping call. Parameter 2 is the page frame number of the physical page that the driver attempted to map.

basically, it would corrupt memory over time


first bugcheck was the system trying to process something that was invalid
the second was the system thinking a driver overran a stack. or maybe attempted to release a object it did not own.
It is a pretty common driver mistake, it releases a object and it works, then it releases the same object a second time by mistake and the system calls a bugcheck. (verifier.exe see below, should find this type of error)

in both cases the systems were up for hours (6 hours and 9 hours) before the system bugchecked.
I would run the verifier and change the memory dump type to kernel so the problem can be isolated to a driver.


both of the bugchecks would require a kernel memory dump in order to figure out the actual cause. You should change your memory dump type to kernel reboot and see if you can get another bugcheck.

you might also start cmd.exe as an admin and run
verifier.exe /all /standard
and reboot. it will make windows check for common device driver errors and will call a bugcheck when the driver makes the error rather than later when the system attempt to process bad driver handles.

Note: be sure to know how to get into safe mode in case your system bugchecks during the next boot. this is so you can turn off verifier.exe via
verifier.exe /reset

you have to turn it off after you are done testing or your machine will run slowly until you do.
(it does a lot of error checking on drivers)

what is this driver for:
C:\Program Files\MEMU\MEmuHyperv\MEmuDrv.sys Mon Nov 2 05:11:35 2015


also have NTIOLib_X64. running but I do not see where the driver is being loaded from>
(overclocking driver)
machine info:
BIOS Version F4
BIOS Starting Address Segment f000
BIOS Release Date 07/13/2016
Manufacturer Gigabyte Technology Co., Ltd.
Product X99-Phoenix SLI-CF
Processor Version Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
Processor Voltage 80h - 0.0V
External Clock 100MHz
Max Speed 8000MHz
Current Speed 3500MHz



 

hlygrail

Prominent
Apr 19, 2017
8
0
510
Super helpful -- THANK YOU for taking the time to poke at those. That gives me some pointers. I will enable the driver verifier and kernel mode dumps and see what the next crash brings.

To your specific questions:

MEmu is an Android emulator and I guess it has its own hypervisor layer. I can uninstall that, but it's been installed since around September, but latent/unused from the application layer since at least October. No major loss if I nuke it to peel off a possible onion layer. I'll uninstall that just on principle for now.

The NTIOLib_x64 is a driver from the MSI Gaming App that comes with the MSI nVidia 1060 6GB video card -- this is what enables some of the fan-control (and LED controls for the physical card itself), and overclocking (which I'm not doing). I can uninstall that, but I think I'll wait for some more incriminating evidence. That has been installed since the beginning as well, but I did update it recently -- this was AFTER the bugcheck/BSODs started, though, so I don't think this is the/the only root cause. It would not surprise me to learn that this stuff from MSI is sub-par on the coding front, but it didn't cause any issues until after the replacement CPU.

I will note that I see this as an Informational entry in the System logs, but these entries exist for every reboot prior to January as well -- they're just much more frequent now due to the crashes/hangs and me hitting the reset button:

A service was installed in the system.

Log Name: System
Source: Service Control Manager
Date: 4/19/2017 2:39:21 PM
Event ID: 7045
Task Category: None
Level: Information
Keywords: Classic
User: SYSTEM
Computer: xxxxxxxxxxxxxx
Description:
A service was installed in the system.

Service Name: NTIOLib_ACTIVE_X
Service File Name: C:\Program Files (x86)\MSI\MSI OC Kit\ActiveX_Service\NTIOLib_X64.sys
Service Type: kernel mode driver
Service Start Type: demand start
Service Account:


I'm somewhat certain I'll get another crash, so once I have that I'll post again. I was playing Elder Scrolls Online during the free weekend last weekend... it would make the system crash almost every time after ~30min, but sometimes even less. While it only takes ~10s for this system to boot, I still need a stable system that can stay up 24x7 to keep other things going.
 
some of the overclocking drivers still tweak voltages even when you don't overclock. I would remove it until you figure out the problem.



 

hlygrail

Prominent
Apr 19, 2017
8
0
510
Well that didn't take long. Enabled verifier, crashed immediately on reboot at the screen before the login prompt. Memory_management error, though, which is different BSOD than I've seen before. It also panicked without any mouse or keyboard input, so imagine this is a driver being loaded and doing something bad.

Kernel dump is below -- appreciate any insight you can give me. FYI, MEmu was uninstalled prior to this reboot, so should have been out of the picture. No other changes other than enabling the standard verifier options. Had to safe-mode to turn the verifier off to boot (repeated panic 2nd time just to be sure).

https://drive.google.com/open?id=0B5CNt1yPGF8JZWRjd0ptTlY1Rms << kernel dump (~125MB compressed)

Help MUCHLY appreciated... I'm almost to the point of pulling hair out.

 
looks like your wireless driver is also disabled.
---
i would install most of the driver updates, you have some motherboard drivers from 2013. (applecharger.sys) used to violate the USB power specifications so you can power charge apple devices faster.
--------
I would expect that the mother board power driver just does not match your updated bios. I would remove the driver or look for a update on the motherboard vendors website. (some motherboard drivers have to match the bios versions)

something called EasyTuneEngineService.exe
talking to a driver called
C:\Windows\gdrv.sys Wed Jul 3 21:27:55 2013
Gigabyte Easy Saver - mobo power utility driver

machine info:
BIOS Version F4
BIOS Starting Address Segment f000
BIOS Release Date 07/13/2016
Product X99-Phoenix SLI-CF

 

hlygrail

Prominent
Apr 19, 2017
8
0
510
Drivers should all be the latest -- I went to the website on building (didn't use the CD included). Don't see anything newer out there, either (see http://www.gigabyte.us/Motherboard/GA-X99-Phoenix-SLI-rev-10#support-dl). Which driver, specifically, is a 2013 driver? (And just because the mobo mfr posts a driver with a 2016 date doesn't guarantee it's not a repackaged 2013 driver...) Is it possible that Windows Update may have rolled something back?

I removed the Gigabyte EasyTune Service and EasyTune itself. The .sys driver is still there, but could just be that it's locked and gets removed on next reboot. Will kick the pig over and see if it stays stable -- so far only the Verifier-induced panics today, but it doesn't seem to get unhappy unless I'm doing a lot of graphics-card/gaming stuff (and even then, 2 hours of Doom last night was fine).
 

hlygrail

Prominent
Apr 19, 2017
8
0
510
I disabled the Wireless adapter yesterday while sniffing around. System is hard-wired and has two physical GigE ethernet ports. I had accidentally plugged into the "other" one when I moved it back into my office after the CPU swap so it was getting a DHCP addr instead of the static one I had assigned. Panics/BSODs were happening well before that change, and after, so that shouldn't be a factor.

I did re-install the chipset drivers, just because... and rebooted. gdrv.sys is still sitting in C:\Windows dir, but shouldn't be loaded by anything now.

Did you find anything specific in the previous kernel dump? Or am I just waiting for the next crash to see what the next onion layer is?
 
it was specific to the power driver and the service that was calling it. (easy tune)
the debugger said it was corrupting windows kernel memory.
here is the error that verifier trapped:

A driver tried to map a physical memory page that was not locked. This is illegal because the contents or attributes of the page can change at any time. This is a bug in the code that made the mapping call. Parameter 2 is the page frame number of the physical page that the driver attempted to map.

basically, it would corrupt memory over time




 
Solution

hlygrail

Prominent
Apr 19, 2017
8
0
510
Any chance you can give the specific gory details of this corruption? I'd like to report it -- this software gets used by pretty much everyone who buys a Gigabyte motherboard, so it ought to work properly.

That said, there is at least small, non-zero chance that some kind of file corruption happened between the original CPU dying and later. I haven't seen any more BSODs today, so maybe that was the only monkey in the system. I'll probably re-install it if things are good for a week or two and see if it reoccurs.

Thanks a ton for the help!!!
 
you would just need to tell them how to reproduce the bug. it is just good practice to have any device driver pass the simple windows verifier.exe test.

lots of third part drivers get lazy and pass user mode windows handles as kernel mode handles.
it works for a while until the memory manager needs memory and pages to disk. when it gets paged back into physical memory it will be at a new memory location but the service just uses the old memory handle and corrupts the data that belongs to the driver windows gave the memory allocation to.




 

hlygrail

Prominent
Apr 19, 2017
8
0
510
Fair enough. I was thinking, though, that I had no issues previously when this system was built. But when you called out that there were some "old" drivers around, I noticed that even though there wasn't anything in Device Manager not showing a proper driver installed ... I also discovered that the directory that was supposed to be there with all of the Intel management stuff (e.g. C:\Program Files (x86)\Intel\Intel(R) Management Engine Components) was GONE. So I reinstalled the chipset drivers.

Theory here... but if those chipset drivers went missing, then it would certainly be possible that EasyTune was trying to call something that wasn't there (or not loaded/running in memory), which caused the brokenness opportunity to exist. Definitely still a bug w/ EasyTune that I'll call out to them, which rendered an otherwise perfectly stable system into a completely unreliable system.

I've had no issues after 2 days, so will let it run for another week and then reinstall EasyTune now that the underlying stuff is back in place and see if that causes any issues.

In any case, thank you TONS for helping me ID the problem. I knew it couldn't be hardware, because everything's been tested or replaced that needed to be, and all the remaining tests were good. If you send me a PayPal address, I'll send you $10 for your time -- my sanity and sleep preservation is worth far more than that!
 
most likely easy tune called something that was there when the system first booted, then the memory manager moved it to a page file because it was not being used. Later the memory manager gave the memory to another driver and easy tune later used the old memory address to change the data causing the crash in the second driver.