Question 3600x BSOD/Random hard reboot investigation. Prime95 consistent core failure. Any tips?

twon069

Reputable
Sep 7, 2018
6
0
4,510
Looking for any sort of suggestion or guidance on this thus far.

Initial Setup

CPU
- Ryzen 5 3600x
Cooler - Noctua U9S push+pull
Motherboard - MSI B450i Gaming Plus AC (latest beta bios, issue occur on both latest non-beta and beta bios. Contacting AMD they told me to update to beta...)
Ram - Corsair 32GB 3200 C16 (CMK32GX4M2B3200C16)
PSU - Corsair SF600 Platinum
GFX - Gigabyte 1080 Turbo OC

*never OC'd, only XMP turned on

Issue

Started observing random BSOD and hard reboot (almost always overnight when I leave it on to do low load work such as download and/or automation task)
I then turned off XMP, and observed same result
I then did a clean Windows 10 Pro install and observed same result

Diagnostic

Memtest86 pass repeatedly (both XMP on and off)
OCCT pass for free 1 hour test (both XMP on and off)
Linpack Xtreme found no issue 1+ hour (XMP off, haven't tried XMP on)
Prime95. This is where I can get a consistent failure.
Ran PSU and GFX on another old Intel/DDR3 system on a Ubuntu USB boot, no issue with Prime95 over 8+hours on blend
Ran the original system on same Ubuntu USB boot, same failure with Prime95 as observed in Windows

Prime95 Failure

Blend will always produce failure on Worker #3 or #4. Sometime after long enough, it will fail both and Ryzen Master will show Core 2 idle.
Large FFT test also produces the same result consistently.
Other FFT settings seems to be less consistent so far, sometime it'll be fine for a long time, sometimes not and always fail on the same Core.
A strange thing is, I noticed sometime it'll fail Worker #3 immediately on stress start and almost always it'll fail consistently around 1hour 40min into the test when it's doing "Self-test 896k"

Fix Attempt


Bump SoC voltage in 0.0125v increment offset. All same result. I stopped at 0.05v
Bump VRAM voltage in 0.01v increment from 1.35v. All same result. I stopped at 1.38v

Contact with AMD/Retailer

I contacted AMD describing my issue. They told me to update BIOs to beta and give that a try, this resulted in same outcome. They then told me it'll be easier to contact retailer for warranty otherwise I have to pay shipping to Singapore... (I'm located in Australia)
I contacted my retailer (Computer Alliance from QLD, Australia) and they told me to ship CPU/RAM/Mobo to them so they can test it out. I did so and after a week they told me they couldn't reproduce the error and said it might be my PSU or GFX... (I'm a bit surprised they said it might be GFX when I described my consistent failure in Prime95... They used a generic 550w PSU and a GTX 1660), so they sent it back. I immediately was able to reproduce the issue again...



I've now purchased a new SF750 and will be trying that out when it arrives (although I'm skeptical since PSU and GFX is rock solid with an old Intel/DDR3).

Not sure what else I can try... I don't have a spare CPU/Ram/Mobo... getting really depressed about this whole thing, especially when retailer came back with no issue...
 
Pretty clear at this point by your testing that it's likely the PSU or CPU are defective.

Since you have a PSU on the way you might as well follow through with installing and testing to see if anything changes.

My money is on the PSU replacement solving the issue. Spontaneous CPU failures are extremely rare without a noticeable outside event like an AIO cooling failure or AC power issue like a lightning strike.

It's always possible some other component like the motherboard is the problem but at this point your testing method is spot on.
 
  • Like
Reactions: twon069
what the cpu temp? during prime95?

Stock cooler will be around 80~85deg, it'll lower it's clock depending on temp (3.95ghz~4.1ghz)
Noctua U9S will be around ~70deg, it maintains around ~4.1ghz
Prime95 failure occurs on both setup, always worker 3/4 and Core2

Pretty clear at this point by your testing that it's likely the PSU or CPU are defective.

Since you have a PSU on the way you might as well follow through with installing and testing to see if anything changes.

My money is on the PSU replacement solving the issue. Spontaneous CPU failures are extremely rare without a noticeable outside event like an AIO cooling failure or AC power issue like a lightning strike.

It's always possible some other component like the motherboard is the problem but at this point your testing method is spot on.

I've been stuck on this problem for so long it's annoying... I'm really hoping it's the PSU at this point... *fingers_crossed
The issue though, is that Prime95 large FFT test failure is the only repeatable and consistent failing point and large FFT points to more RAM usage, but Memtest always comes back solid... it's really messing with my mind...

Some extra information on voltage
Voltage on default during load/stress:
HWInfo
VCORE: 1.160~1.432v though it maintains around 1.376ish
DRAM: 1.2v solid on default
12V: 11.904~12.000v
5V: 5.040~5.080v
SoC: 1.016~1.024
3.3V: 3.344~3.360v

Ryzen Master (they give different readings...)
Peak Core(s) Voltage 1.37v
Average Core Voltage 1.37v
MEM VDDIO 0v
MEM VTT 0v
VDDCR SOC 1.025v
CLDO VDDP 0.9002v
CLDO VDDG 0.9504v

I got recommended trying out AIDA64 and doing a few Cinebench as well as Ryzen DRAM. Going to give those a go and report back...
 
UPDATE:
So it turns out there might've been a bug in the version of Prime95 I'm using, v30.3b6. I was told to grab the latest one on their forum, v30.6b4.

I did so and also changed my PSU to a new SF750 then observed the following:
  • 3/4 times it'll fail immediately on blend test on one of Core 2's worker, which starts with 480K FFT. (I also saw it once when I was doing custom 16-17K FFT test)
  • It'll now always fail on FFT 16K, sometime after Test 3, sometime a bit further, like after Test 4.
  • Changing Power Supply Control to Typical, bumping vcore by offset 0.0125v (1 step up) and lowering RAM clock made no difference. I also tried running each RAM by themselves in either DIMMA or DIMMB, none of the combination made a difference.
 
Last edited:
UPDATE:
Alright, so I've been consistently able to get 16K~21K in-place errors on Core 2 worker as per screenshots below on various RAM/PSU combinations on version 30.6b4, I also saw the same failure on version 30.3b6 for 16K~21K in-place. Prime95 forum did agree that it shouldn't be the bug that I've encountered before as every thread runs the same code, it shouldn't only fail on 1 particular core.
View: https://imgur.com/a/NzIDKGV

View: https://imgur.com/a/tOlEFiL

View: https://imgur.com/a/jzL2xNI


However, after contacting Computer Alliance again with all screenshot and recording, they don't seen convinced and the reply I got was,
You’re welcome to send it back in for testing.
Note that the problem must be reproducible during testing for this to be considered a warranty fault.
If the problem only exists in your home environment, then the issue evidently lies outside of the CPU/Mainboard.

Sigh... guess I'm screwed with more gambling...