Hey everyone,
Thanks for your replies. I have completed several Mem86 tests on both sticks of RAM (as well as each one individually). The test with both sticks resulted in one error, but the test with each stick individually revealed 0 errors. So that's been great.
@johnbl, new crashes have produced the following code:
ILE_IN_CAB: 031922-5125-01.dmp
BUGCHECK_CODE: 124
BUGCHECK_P1: 0
BUGCHECK_P2: ffffca02a2503028
BUGCHECK_P3: bc000800
BUGCHECK_P4: 1010135
CUSTOMER_CRASH_COUNT: 1
PROCESS_NAME: System
STACK_TEXT:
ffffa181
d1f2c948 fffff807
41b0178b : 00000000
00000124 00000000
00000000 ffffca02
a2503028 00000000
bc000800 : nt!KeBugCheckEx
ffffa181
d1f2c950 fffff807
448710c0 : 00000000
00000000 ffffa181
d1f2ca29 ffffca02
a2503028 ffffca02
a2503028 : nt!HalBugCheckSystem+0xeb
ffffa181
d1f2c990 fffff807
41c3faa3 : 00000000
00000000 ffffa181
d1f2ca29 ffffca02
a2503028 ffffca02
a12faa00 : PSHED!PshedBugCheckSystem+0x10
ffffa181
d1f2c9c0 fffff807
41b0317d : ffffca02
a2dcbb00 ffffca02
a2dcbb00 ffffca02
a12faa50 178bfbff
7ef8320b : nt!WheaReportHwError+0x393
ffffa181
d1f2ca90 fffff807
41b035c8 : 00000000
00000008 ffffca02
00000003 ffffc388
b1177000 00000000
00000008 : nt!HalpMcaReportError+0xb1
ffffa181
d1f2cbf0 fffff807
41b0345c : ffffca02
9fd47600 00000000
00000000 ffffa181
d1f2ce00 00000000
00000000 : nt!HalpMceHandlerCore+0x138
ffffa181
d1f2cc50 fffff807
41b028fb : ffffca02
9fd47600 ffffa181
d1f2cef0 00000000
00000000 00000000
00000000 : nt!HalpMceHandler+0xe0
ffffa181
d1f2cc90 fffff807
41b0527b : ffffca02
9fd47600 00000000
00000000 00000000
00000000 00000000
00000000 : nt!HalpHandleMachineCheck+0x97
ffffa181
d1f2ccc0 fffff807
41b65869 : 00000000
00000000 00000000
00000000 00000000
00000000 00000000
00000000 : nt!HalHandleMcheck+0x3b
ffffa181
d1f2ccf0 fffff807
41a260fe : 00000000
00000000 00000000
00000000 00000000
00000000 00000000
00000000 : nt!KiHandleMcheck+0x9
ffffa181
d1f2cd20 fffff807
41a25d28 : 00000000
00000010 00000000
00000000 fffff807
459ef000 00000000
00000000 : nt!KxMcheckAbort+0x7e
ffffa181
d1f2ce60 fffff807
4195eed6 : 00000000
00000000 00000000
00000000 00000000
00000000 00000000
00000000 : nt!KiMcheckAbort+0x2a8
ffffc388
b117d420 00000000
00000000 : 00000000
00000000 00000000
00000000 00000000
00000000 00000000
00000000 : nt!KxFlushSingleTb+0xca
MODULE_NAME: AuthenticAMD
IMAGE_NAME: AuthenticAMD.sys
STACK_COMMAND: .cxr; .ecxr ; kb
FAILURE_BUCKET_ID: 0x124_0_AuthenticAMD_MEMORY__UNKNOWN_FATAL_IMAGE_AuthenticAMD.sys
So I think it might be CPU/Motherboard related? CPU temperatures are absolutely fine so not sure what is causing this issue.
you would need to run some commands in the debugger
!errrec ffffca02a2503028
this will report why the cpu called the bugcheck
!sysinfo cpuinfo
here you would look at the cpu speed it was running at.
for internal cpu errors: the cpu is very sensitive to certain voltages applied to pins of the cpu. CPUs have primary cache memory banks and secondary cache memory banks and the cpus can run at various clock rates. The voltage on pins tells the connection between the cache banks how fast to run. If the voltage is incorrect for the current frequency of the cpu then the data in the transfer is locked in and sampled at the wrong time before the electronics have stabilized. (violation of electronics set up and hold time requirements)
basically, a binary zero value correlates to the bottom 1/3 of the voltage range,
the middle 1/3 of the voltage range would be undefined (often the bit gets locked in at what ever the last setting was)
and the top 1/3 of the voltage range is defined as a binary 1
the cpu voltage to certain pins set the time when a snapshot of the values are made. if this voltage is wrong, then the snapshot is made at the wrong time and you can get a 1 or 0 value when the voltage is in the undefined range.
the cpu does a checksum on the values and if it is correct or you get two errors that cancel each other out then the cpu continues on. if it detect a error it calls a bugcheck because it can not trust its data in side of the cpu cache.
problem: -as the temp of the cpu changes the timing window moves.
and more errors are detected.
- the voltage used on the pins depends on the electronics of the motherboard.
the starting voltage is looked up in a table in BIOS. something like at this cpu core frequency, then voltage should be X. This is tuned over time for each motherboard version and is why you have to update the BIOS to get updated tables. This is also why as intel releases new cpu's that are lower voltages or run at a higher frequency you get problems because the bios has to be updated or these bios tables will tell the motherboard to apply too high a voltage to the cpu and cause data corruption in the transfer of data between levels of cache inside of the cpu.
overclock tools, they throw in more problems since you can tweak voltages to the cpu pins. same goes for the new bios version that will automatically tweak voltages for you. for debugging, you never want to see overclocking tools and want people to set the bios to defaults so you have the best chance to have working hardware. The debugger can not detect these slight changes, all it can really look at is the cpu clock rate and maybe thermal zones.
most of the time the thermal zones in the debugger are useless. The cpu clock rate is only notable when it shows some rate that is not a multiple of 100Mhz.
or it is set to a range outside of the CPU normal range. ie cpu that runs at 3GHz but is running at 2.9 GHz. it could be overheated and the cpu is trying not to burn up or maybe the bios was not updated and does not know about your cpu version that runs at a higher clock rate.
for older systems, dust in cpu fans cause over heating and cpu cache memory errors, voltages from a power supply change over time and can cause these error. The thermal paste on the cpu cooler get hard and does not make a good thermal connection.
I have seen older machines that the cpu cooler became disconnected on one side when the machine was moved. bad connection = bad cooling.
I have seen water cooled system develop vapor bubbles and not cool the cpu correctly have this type of errors.
capacitors In the power supply or on the motherboard can start to fail and voltages change over time.
anyway, hope some of this might help
just as a side note: the most common error I used to see with RAM (external memory) was incorrect setting of the command rate for the RAM. often RAM requires 2t or 2N clock rate for commands to be locked in but the BIOS defaults are set to 1T or 1N clock rate. This is for each tick of the main clock. It is often hard to find out what the clock rate should be for your actual memory chips.
often people reject memory as bad when this parameter is just not set correctly in the BIOS. there are about 12 parameters that can be set for RAM, there is no set order to the parameters so most people only check the first few parameters.
---------
when i googled your RAM the timings looked like they should be
17-18-18-36-2T timings
the last one it the command rate (2 clock cycles before the electronics setup and hold time are valid for this RAM to set up its commands to access a memory location)
you should confirm your bios is set correctly.
also, you might find that it will work correctly with 1N with single stick of ram but you have slow it down to 2N when the motherboard RAM get fully populated. the timing is affected by the distance the RAM stick if from the CPU. More RAM and greater distance from the CPU means you have to use a slower timings. Some motherboard vendors put this in the fine print of the manuals for the RAM setup.