• Happy holidays, folks! Thanks to each and every one of you for being part of the Tom's Hardware community!

Question WIN7 won't boot without Hyper-threading Enabled

tfontaine1

Commendable
Jan 4, 2017
7
0
1,510
So I have a custom computer that I'm having an issue with (by custom I mean it's fully custom, designed off of an Intel reference board). I have a few of these boards and they all seem to work perfectly fine. However, I have a couple that freeze when booting into windows and I'm looking for a hand to narrow down what the issue might be.

Originally I noticed both systems would freeze loading WIN10, reboot and fail again with BSOD error: DPC Watchdog Violation. So, I put WIN7 on both systems and they booted fine which I thought was a little weird. I decided to run a stress test on the systems and they both failed the CPU and Memory Tests, but still functioned. I figured the issue could be thermal so I cleaned up and redid all the thermals and got the same failure. I then went into the BOIS and started throttling stuff back, one item at a time. When I disabled hyper-threading both systems froze booting into WIN7 just like they did when trying to boot WIN10. This pretty much threw out my thermal issue idea since Enabling hyper-threading allowed me to boot and disabling it cause it to freeze.

What am I missing that could possibly cause the issue? Whats different between WIN10 and WIN7 that would allow one to boot and not the other, but when disabling hyper-threading cause WIN7 to fail? I'm a hardware guy, so I don't really know OS's in depth all that much.

  • I checked all the power rails on the motherboard on my o-scope and they are all running fine with no drops or spikes.
  • The dump files from Windows aren't giving me any good info (On the WIN7 failure is says there could be a possible issue with tixhci.sys, but it is up-to-date).
  • I swapped the processor and get the same issue.
  • I swapped the RAM with no change.
  • I disconnected all USB devices to see if maybe the USB lines were going down and got nothing.

What should my next steps be?
 
Here is the MEMORY.DMP that I got from the WIN7 boot failure if that helps at all?

I checked tixhci.sys and it's up to date and everything, so I'm not sure what to do with this info really.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Microsoft (R) Windows Debugger Version 10.0.17763.1 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Users\tfontaine\Desktop\MEMORY.DMP]
Kernel Summary Dump File: Kernel address space is available, User address space may not be available.

Symbol search path is: srv*
Executable search path is:
Windows 7 Kernel Version 7601 (Service Pack 1) MP (4 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7601.24150.amd64fre.win7sp1_ldr_escrow.180528-1700
Machine Name:
Kernel base = 0xfffff80003a15000 PsLoadedModuleList = 0xfffff80003c54c90
Debug session time: Wed Jun 27 13:23:33.840 2018 (UTC - 5:00)
System Uptime: 0 days 0:00:03.229
Loading Kernel Symbols
...............................................................
......................................................
Loading User Symbols

Loading unloaded module list
.
***
  • *
  • Bugcheck Analysis *
  • *
***

Use !analyze -v to get detailed debugging information.

BugCheck 50, {fffff88006d3affc, 0, fffff880050b2814, 0}

*** ERROR: Module load completed but symbols could not be loaded for tixhci.sys
Probably caused by : tixhci.sys ( tixhci+3d814 )

Followup: MachineOwner
---------

2: kd> !analyze -v
***
  • *
  • Bugcheck Analysis *
  • *
***

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except.
Typically the address is just plain bad or it is pointing at freed memory.
Arguments:
Arg1: fffff88006d3affc, memory referenced.
Arg2: 0000000000000000, value 0 = read operation, 1 = write operation.
Arg3: fffff880050b2814, If non-zero, the instruction address which referenced the bad memory
address.
Arg4: 0000000000000000, (reserved)

Debugging Details:
------------------


KEY_VALUES_STRING: 1


STACKHASH_ANALYSIS: 1

TIMELINE_ANALYSIS: 1


DUMP_CLASS: 1

DUMP_QUALIFIER: 401

BUILD_VERSION_STRING: 7601.24150.amd64fre.win7sp1_ldr_escrow.180528-1700

SYSTEM_MANUFACTURER: Patrol PC

SYSTEM_PRODUCT_NAME: Rhino

SYSTEM_SKU: RHM1

SYSTEM_VERSION: 1

BIOS_VENDOR: Patrol PC

BIOS_VERSION: 4.2

BIOS_DATE: 11/15/2017

BASEBOARD_MANUFACTURER: Patrol PC

BASEBOARD_PRODUCT: Rhino

BASEBOARD_VERSION: 1

DUMP_TYPE: 1

BUGCHECK_P1: fffff88006d3affc

BUGCHECK_P2: 0

BUGCHECK_P3: fffff880050b2814

BUGCHECK_P4: 0

READ_ADDRESS: fffff88006d3affc

FAULTING_IP:
tixhci+3d814
fffff880050b2814 418b0c00 mov ecx,dword ptr [r8+rax] MM_INTERNAL_CODE: 0 CPU_COUNT: 4 CPU_MHZ: 8f6 CPU_VENDOR: GenuineIntel CPU_FAMILY: 6 CPU_MODEL: 3d CPU_STEPPING: 4 CPU_MICROCODE: 6,3d,4,0 (F,M,S,R) SIG: 22'00000000 (cache) 22'00000000 (init) DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT BUGCHECK_STR: 0x50 PROCESS_NAME: System CURRENT_IRQL: 0 ANALYSIS_SESSION_HOST: TYLER-PC ANALYSIS_SESSION_TIME: 02-20-2019 12:56:55.0213 ANALYSIS_VERSION: 10.0.17763.1 amd64fre TRAP_FRAME: fffff88006d136e0 -- (.trap 0xfffff88006d136e0) NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=fffff88006cfb000 rbx=0000000000000000 rcx=0000000000000010 rdx=fffffa80052768d0 rsi=0000000000000000 rdi=0000000000000000 rip=fffff880050b2814 rsp=fffff88006d13878 rbp=fffffa8005273510 r8=000000000003fffc r9=00000000000000ff r10=000000000000ffff r11=0000000000000001 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl nz na po nc tixhci+0x3d814: fffff880050b2814 418b0c00 mov ecx,dword ptr [r8+rax] ds:fffff88006d3affc=???????? Resetting default scope LAST_CONTROL_TRANSFER: from fffff80003b8d080 to fffff80003ab98a0 STACK_TEXT: fffff88006d13588 fffff80003b8d080 : 0000000000000050 fffff88006d3affc 0000000000000000 fffff88006d136e0 : nt!KeBugCheckEx fffff88006d13590 fffff80003ac5a96 : 0000000000000000 fffff88006d3affc 0000000000000000 0000000000000000 : nt!MmAccessFault+0x2330 fffff88006d136e0 fffff880050b2814 : fffff8800508565a 0000000000000000 fffffa80052736e8 fffffa800524df60 : nt!KiPageFault+0x356 fffff88006d13878 fffff8800508565a : 0000000000000000 fffffa80052736e8 fffffa800524df60 fffffa80052736a0 : tixhci+0x3d814 fffff88006d13880 fffff88005096c4c : fffff880050d01b0 fffffa80052736e8 8000000000000003 8000003f00000000 : tixhci+0x1065a fffff88006d139f0 fffff880050c4301 : 0000000000000000 fffff880050d0218 0000000000000000 fffffa80052736e8 : tixhci+0x21c4c fffff88006d13b00 fffff880050c405e : 0000000000000001 fffff8800000a003 0003e8ff00000008 fffe4e000001fa00 : tixhci+0x4f301 fffff88006d13bb0 fffff80003d5dbe0 : fffff88002f78180 0000000000000001 0000000000000000 fffffa80052756f0 : tixhci+0x4f05e fffff88006d13c00 fffff80003abf8c6 : fffff88002f78180 fffffa80052756f0 fffff88002f87140 0000000000000000 : nt!PspSystemThreadStartup+0x194 fffff88006d13c40 0000000000000000 : fffff88006d14000 fffff88006d0e000 fffff88006d13c00 0000000000000000 : nt!KxStartSystemThread+0x16 THREAD_SHA1_HASH_MOD_FUNC: 48d15fea1e32b3c26a029d383a96270fdc096c17 THREAD_SHA1_HASH_MOD_FUNC_OFFSET: c31d3b424383db163e687b59317a1646a0bc8c5b THREAD_SHA1_HASH_MOD: 6e501dd4ddb2deb5337f06ecb72cc4e3e6494f6e FOLLOWUP_IP: tixhci+3d814 fffff880050b2814 418b0c00 mov ecx,dword ptr [r8+rax]

FAULT_INSTR_CODE: c8b41

SYMBOL_STACK_INDEX: 3

SYMBOL_NAME: tixhci+3d814

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: tixhci

IMAGE_NAME: tixhci.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 570fc25b

STACK_COMMAND: .thread ; .cxr ; kb

FAILURE_BUCKET_ID: X64_0x50_tixhci+3d814

BUCKET_ID: X64_0x50_tixhci+3d814

PRIMARY_PROBLEM_CLASS: X64_0x50_tixhci+3d814

TARGET_TIME: 2018-06-27T18:23:33.000Z

OSBUILD: 7601

OSSERVICEPACK: 1000

SERVICEPACK_NUMBER: 0

OS_REVISION: 0

SUITE_MASK: 272

PRODUCT_TYPE: 1

OSPLATFORM_TYPE: x64

OSNAME: Windows 7

OSEDITION: Windows 7 WinNt (Service Pack 1) TerminalServer SingleUserTS

OS_LOCALE:

USER_LCID: 0

OSBUILD_TIMESTAMP: 2018-05-28 21:56:37

BUILDDATESTAMP_STR: 180528-1700

BUILDLAB_STR: win7sp1_ldr_escrow

BUILDOSVER_STR: 6.1.7601.24150.amd64fre.win7sp1_ldr_escrow.180528-1700

ANALYSIS_SESSION_ELAPSED_TIME: 4fd

ANALYSIS_SOURCE: KM

FAILURE_ID_HASH_STRING: km:x64_0x50_tixhci+3d814

FAILURE_ID_HASH: {cae28625-72be-17b1-ac61-70a05cbea340}

Followup: MachineOwner
---------
 
If your CPU is "dirty" (check the dust in your fan) or not mounted correctly (lack of thermal paste), it will go too hot and you'll start experiencing such software failures. Open your PC, take a brush and use an aspirator to clean it, also clean the rotating arms of the fans using your brush (you may need to use some cleaning solvant on your brish like alcohol to unglue the accumulated dust).

Notice that Hyperthreading can cause your CPU to become hotter faster, but generally it should not cause such hangs.

You may need to disable some hardware tuning in your BIOS: if you have different types of DRAM modules, the "extreme" modes may not work. Make sure that DRAM modules of the same type are properly paired on the same bank (different DRAM modules have different timing requirements), because BIOSes can configure only the same timing for all modules on the same bank.
You should also use default BIOS settings for RAM (any other option is "overclocking" and it may be unreliable as parameters are hard to test to get stable configuration and would require more care about cooling, mounting, and cabling for power).

When booting in BIOS, you should have a health monitor that displays temperatures: make sure it does not go high too fast: this is a clear sign of dust in fans, or incorrect mouting of thermal paste, or possibly the fan over the CPU is no longer properly locked: if your CPU has movd to often, the thermal paste may no long be efficient and you may need to clean it and replace it.

Finally test your RAM modules: Windows comes with a memory test at boot time that you may try (running it for about 10 minutes is enough even if some stress tests may take hours), because the basic tests performed by the BIOS at boot time is very superficial. This test can also check the level 3 cache of your motherboard (the test can be running by enabling or disabling the cache: if it succeeds without the cache but fails with the cache enabled, the defect in cache requires changing cache chios or sometimes the whole motherboard... If this is the L1 or L2 cache, then you have a defect in the CPU, or your CPU was overburnt by a former too extreme overclocking or failure in your cooling).

Finally use "Gsmartmontools" to check your disks (notably your SSD: you may have already reports of wearing in SMART data with too many "remapped sectors, many CRC errors and too many written bytes: may be it's high time to replace the SSD before a complete failure; note that small SSD of 64GB or lower will fail sooner on Windows than 128GB or 256GB models: consider never using your SSD near the end of its capacity, you should reserve a space on your SSD with an empty partition about 15% of the available size or making sure that your SSD is never used at more than 80% of its maximum capacity: physical sectors will be recycled and rewritten much less often), and notably the SSD on which you've placed the memory paging files (and to minimize paging to disk, consider upgrading your RAM: 4GB is no longer enough, 8GB will be much more cumfortable and will preserve your SSD from early aging and uncorrected faults!).

If you use a 64GB SSD on a PC you use every day, the typical wearing level will be reached in about 3 years and SMART tools will start indicating some remapped sectors: if there are too many remapped sectors (they are remapped on a SSD by group of about 4096 sectors), and there's no longer any available physical sectors, the SSD will start becoming very unstable and you may loose all your data at any time without any possible recovery (this occurs more abruptly than on HDs, where you have early signs but most of the surface remains intact and usable even if the HD also remap failing sectors).

Use also disk cleaners: typical unmanaged Windows intallations cumulate about 15 to 20 GB of backups and old versions (sometimes much more): when your OS is stable after an installation or update, run Dism++ to remove old installations logs, backups, previous OS versions. Also your internet cache may be much larger than necessary. Saving 15-20GB of free space will significantly improve the performance on further writes and will decrease the frequency of trimming (remember that trimming freed sectors is much slower than writing to them).

A good installation on a SSD should not use it more than 2/3 of its total capacity and should keep about 50GB free (Windows including its paging files uses about 16GB or storage, but writes about 100 GB between each boot, and more if you perform frequent downloads or visit the web extensively to watch videos which are partially cached on disk, these caches are recycling always the sectors of your free space).

Consider also placing your user files outside the disk you use for the OS, or perform regular backups of your documents, audios, videos, and profile data. You may keep only the ongoing recent work documents on the SSD, but take the time to save and organize them for longer term on a another disk (not a separate partition of the same SSD!) which will be much less often used: if your main disk fails, it will always be simpler to replace it without care about what was on it given everything can be reinstalled cleanly from downloads.

There's no interest in placing your office documents, pdfs, worksheets, and development source files, or email archives on your SSD: placing them on a hard disk will have insignificant performance impact. As well most games use a lot of storage, and can be installed on a hard disk (consider placing your "Steam" library outside the main SSD (the performance impact is small and only when loading the game, but a hard disk is still much faster than a DVD...). You don't need the performance of the SSD for music, photos and video files, which are read sequentially at low speed. You may need it only if you use video editing but most often you're limited by the encoding speed of your CPU or GPU and there's no significant difference when writing the results to hard disk or a SSD.

If your SSD is small, consider placing the Internet cache for videos on a secondary drive. If you extend the DRAM, you may want to place your paging file on a secondary hard drive (which will be cached in RAM so paging will not be dramatically slow). As well consider placing the hibernation file on a hard disk (you generally don't need the best performance for waking up from hibernation but hibernating to a SSD will be much faster). Anyway, hibernation is always faster than a complete system reboot and Windows 10 requires full reboots much less often now. If you don't want to hibernate, consider using power saving options for a PC that will still be nearly instantly on when you need it. Make sure that "fast boot" options are enabled as it also minimize the number of writes on disk at each reboot.
 
Note also that secure boot in Windows 8 and Windows 7 are different: for secure boot in Windows 10 you need an GPT/EFI partitioning, but with Windows 7 you have to enable the CSM compatibility mode: EFI alone will not boot in Windows 7.

But if you enable CSM for booting Windows 7, the TPM security in Windows 10 will be temporarily disabled, and the next time you boot Windows 10 with CSM enabled the TPM will enter in "security recovery" mode, or each time you change the boot order in BIOS (including when booting frst from a removable drive or DVD).

If you have disabled CSM in your EFI boot configuration, Windows 7 will no longer boot.

Both Windows 7 and Windows 10 can boot in legacy BIOS/PCAT boot mode, but that mode cannot boot from a GPT partitioned disk this requires the CSM to be enabled in the BIOS, and it sets some limits on the placement of the "active" boot partition, which must be FAT32 formatted: this EFI system partition is small, typically 100 or 350 MB, and not mounted as the "C:" drive, it just contains the EFI boot loader, the small BCD store, and the eventual Bitlocker encryption boottime support which cannot be loaded from the encrypted "C:" drive, and support for booting from network or from a virtual drive in a VHDX file which can also contain Bitlocker-encrypted partitions).

Windows 10 has also added a "MSR" partition (which should be at least 16MB), which should before the main partition "C:". This partition contain no files, it's a special raw partition which is used temporarily for safely performing system maintenance (including repartitioning). Its content is always temporary but it should remain present and can be also used for example during system upgrades (it allows rollbacking safely in case of installation problems, and keeps transactional records of past data which is pending some change; it also allows Bitlocker to safely perform encryption of an initially unencrypted volume, keeping the progression state). If you have already rebooted succesfully your system, the MSR content no longer has any meaning.

Windows 10 also contains an optional "Recovery" partition (typically about 500MB), containing the WinPE environment (a minimal Windows kernel image with tools needed for performing offline maintenance of the system or used at boot time as a source during installation of Windows into the C: drive, or used to perform a rollback of the system). This partition is required if you enable Bitlocker on your C: drive, because WinPE cannot boot from an encrypted drive). You can see where the recovery environement is located (on the C: drive if your disk is not encypted and there was no recovery partition created; or in the recovery partition, formatted as NTFS but not encrypted) by using "ReAgentc /info" which will also check that it finds the porper partittion and the winpe image, as well as the entry created in the BCD (stored in the EFI system FAT32 partition, and that you can list using "BCDEDIT /ENUM ALL"): don't remove that partition if it already contains the WinPE environment, because once you've done that, the recovery environment will no longer be on your C: drive (it will be deleted by "ReAgentc -enable"): if you want to remove the Recovery partition to locate it elsewhere, first use "Reagentc /disable" to move its content back to C: If you loose the content of the recovery environment (containing its own EFI boot loader, the WinPE image, and the reagentc info XML file), you'll need another bootable media (DVD or USB) that contains a copy (you'll also need a backup of your Bitlocker encryption key on an USB drive to get access back to the content of your encrypted drives).

On small drives or SSDs, the Recovery partition is generally at end of the disk, but on large disks (several terabytes) or on hardware RAID arrays, it is usually located just before the "C:" drive, after the EFI partition and the MSR partition.
 
Anyway if you say that "stress test the CPU and memory, both fail", and swapping the CPU and memory does not change the issue, this is likely a problem near North Bridge: memory controlers, L3 cache on motherboard.
You said this is a "custom board" derived from an Intel reference. Most likely you have forgotten some drivers or BIOs settings to configure the memory correctly. It's very likely that your memory is not correctly regulated:
  • incorrect voltage settings
  • incorrect computing of clocks (possibly incorrect interpretation of DIMM infos, incorrect programmation of frequency regulators)
  • incorrect energy management settings
It is also likely that your experimental BIOS does not have the needed tables to support your modules and that the data tables built in the BIOS were not designed for your kind of memory device (CL states, and so on).

Possibly also
  • missing hardware tuning on the motherboard itself (no routing defined for some signals, missing pullups or pulldown on configurable pins), missing firmware in programmable chips.
  • missing drivers for the south bridge (did you ever change the vendor/model id and forgot to include the new id's in your tuned BIOS? may be you have the drivers but they don't qualify as they don't match a known id).
  • missing capacitors on the motherboard to stabilize the power, possibly also missing routing for alternate power or ground paths (overcurrents on some pins or paths)

Custom boards are not completely routed on purpose because this depends on which addon you'll integrate. power management and clock derivation and stabilization is one of the most impoart task you have to do on custizable motherboards. I suppose you got if from a development kit and have not complely read the specifications, requirements, or test procedures to make sure they fit your goals. This requires lot of physical measurement tools throughout the board and test procedures that will look at variosu conditions (idle or busy devices, power on sequences, distribution of power on by careful timing). You should not just take local numeric measures, but also look at wave forms (notably for clocks), and see if there are not undesirable "spikes" which can cause instability: if so you'll need more capacitors (customizable motherboars have various areas where you can add them depending on your needs, and you have the choice of components with different capabilities depending on frequency or the kind of cooling you want to implement to control temperature). As well most of these boards include places where you can add temperature sensors, but they are not mounted by default and it's up to you to integrate them, and make them monitorable by software/firmware so that they can regulate and conditionnaly insert wait states or slow down the clocks adaptatively. In some caes you'll need to integrate a plugin for these sensors and regulators in your custom BIOS...

For now all you can do is by starting by default basic BIOS settings (discard the proposed "Extreme memory profiles" and don't use them before you've made sure that every circuit is properly regulated in all conditions and well inside the documented tolerances. Tune these parameters only very progressively. This cannot be done instantly. And check the limited list of configurations for which your custom board was initially designed and tested: may be it was only done for specific brands of memory chips and specific frequencies/clock configurations and your RAM modules are not even part of this initial list. motherboard manufacturers frequently chip boards with small lists of compatible devices, then slowly extend their tests and add tunings in their flashable BIOS/firmware to follow what's available on the market (so in their support pages, they provide susscessive patches for about 2 or 3 years, until the boards goes out of market and prodution stops, or if there are too many failures on a series and want to sell something better that causes them less problems (problems are rapidely reportyed on the Internet, people can easily blacklist publicly the bad designs which were incurrently tuned and are unstable).

Your first goal should always be to improve the stability (overclokers will want to have access to other parameters because they'll continue the work themselves in their specific configurations, often using very costly but exotic cooling and powering options which are hard to find and too costly to produce and sell to the market at reasonable prices: they are a niche market inchiwhc the manufacturers will not assume the resulting faults if they're overburning their hardware prematurely).

You should not work on such design if you are not well equipped with lot of instruments (including thermal cameras to detect hotspots that must be fixed) and are not ready to customize also the thermal conditions with good quality materials and tools with high precision. You'll also need some robotics, and some fabric for specific parts, a choice for attachments, connectors, insulators, metal tapes, glues, fans, capacitors and physical tools... This requires patience and sometimes your experiments will crash definitely the device and you'll need replacements. This has a cost because you'll need to learn from your unavoidable "errors".

Given you're working on experimental custom product, I don't think anyone can help you unless he works directly with you to develop your own product: it's simply not possible to reproduce your experience.