Question Demanding games are crashing to the desktop on my new build ?

saperos · Aug 24, 2023

Hello,
I've built a new PC month ago, and I'm currently experiencing issues with it. Demanding games are causing crashes to the desktop, usually accompanied by a "bug report" from the game I was playing.

What's happening is that the game either freezes or the screen turns black for a few milliseconds, and then I'm brought back to the desktop. The Event Viewer sometimes shows an error, but sometimes it doesn't. I don't see any artifacts, and I've been testing it with the GPU at stock settings. The problem seems to occur only while gaming. I've tried multiple 3Dmark stress tests and running the Heaven Benchmark for over 40 minutes - it was fine.
Examples of games I've tested:

Witcher 3 on Ray Tracing Ultra preset with DLSS Quality
Metro Exodus on Extreme (I used its benchmark for testing). It crashes after around the 5th cycle.
- This one is causing BSOD sometimes
- On the other hand I was able to run it once without an error (30 loops...)
Guardians of the Galaxy with max settings and RT (here it crashes less frequently)

What seems to (sometimes) help:

Lowering the in-game settings (such as turning off RT, applying a frame limit)
Undervolting the GPU (although it still crashes sometimes
Generally, not pushing the GPU to its full potential (not allowing it to get too hot)

Here is the setup:

CPU: Ryzen 5 7600X
Motherboard: ASRock B650E PG RIPTIDE
RAM: G.Skill Trident Z5 Neo 32 GB (6000MHz, CL30)
GPU: EVGA RTX 3070 (used)
PSU: MSI MPG A850G PCIE5 850W

What I have tried:

Tested the GPU in a different PC- considering the symptoms I was sure the issue was with the GPU, but it seems it's not. I put the RTX 3070 into my old rig (Ryzen 3600, Corsair PSU 650W, 16 GB DDR4). I was unable to crash it once.
- It ran smoothly through 10 loops of the Metro benchmark (and then 30 additional loops)
- I also played Witcher for 30 minutes without a single crash.
- After testing I put it back to new rig and it failed again after few loops…
- I used HWInfo to log information from sensors on both tries (the GPU Hot Spot and temperature were sometimes bigger in old rig and it did not fail)
Tested the CPU
- I ran Cinebench for 30 minutes with
Tested the RAM
- 4h of Memtest86, 0 errors
Run Stress Tests in OCCT
- VRAM - 30 minutes, no errors
- PSU - 2x30 minutes, no errors
  - Metro Exodus benchmark run successfully for 30 iterations right after that....
  - ... to fail after 2-3 iterations 2h later after restart with BSOD and computer restart.
Checked cable connections in the new rig
- Made sure the GPU is connected using two separate cables from the PSU.
- Verified that motherboard and CPU cables are properly inserted.

At this point, I am clueless. I am thinking that maybe the PSU is malfunctioning, but I can’t find any clues in the HWinfo logs. Maybe you guys will find something.

saperos · Aug 24, 2023

Spam filter stopped me from linking my gdrive where I have majority of test results including HWInfo logs... Trying here: https://drive.google.com/drive/folders/1FmE0Z3owXsd14SH5G4Nn1Y1EMMEZBYoS

SorryBella · Aug 25, 2023

saperos said:
Tested the RAM

4h of Memtest86, 0 errors

This is fine to check if your RAM is going to run on stock JEDEC speed but it cant really detect minute issue with XMP settings. If you have tried stock speed then nevermind but if you havent try OCCT RAM stress test. There's also TestMem5 but it is finicky to set up just for a quick RAM check.

MemTestHelper/DDR4 OC Guide.md at oc-guide · integralfx/MemTestHelper

C# WPF to automate HCI MemTest. Contribute to integralfx/MemTestHelper development by creating an account on GitHub.

github.com

saperos · Aug 25, 2023

Hey @SorryBella , thanks for taking the time to reply!

I'm really puzzled by this whole situation. Today, I ran the Metro Exodus benchmark three times in a row for 30 loops each. I was testing to see if the problem might be related to the SSD I use for games. So, I moved Metro to my OS SSD and set the Power Limit to 112% (max) in MSI Afterburner to catch any issues faster. Strangely, the problem didn't occur at all during these tests. It completed all 30 loops without any crashes, and the hot spot temperature reached 93.1 degrees C – this was during the first run.

During the second run, I moved Metro back to my gaming SSD while still keeping the power limit at 112%. Again, the problem never occurred, and the hot spot temperature reached 93.1 degrees C.

For the third run, I kept Metro on the gaming SSD but reset MSI Afterburner to its defaults (setting the Power Limit to 100%). Once again, the problem didn't occur, and the hot spot temperature reached 90.1 degrees C.

One thing that's different from yesterday is that I plugged the PC directly into the wall instead of using the extension cord. I'm not sure if it matters; maybe it's just a coincidence. (My old PC was also connected to this cord, and I had no problems with the tests.)

I still plan to run the OCCT RAM Stress test, as the Memtest86 was run with XMP enabled in the BIOS.

saperos · Aug 25, 2023

@SorryBella
Update: I run the OCCT Memory test for 1h, chose 100% memory, Auto Instruction Set, Auto Threads. I was shocked to come back after an hour seeing 7242 errors... Image: https://drive.google.com/file/d/1Y2Yso_LwE-m4gQaRsa3dxl9CO04oT_ot/view?usp=drive_link

I run it for 10 minutes after that with 80% load, 0 errors

After That I tried 10 minutes with AVX2 Instruction set. Started to fail almost immediately! Image: https://drive.google.com/file/d/1m3OUp6Y8GsernWUFz_CKz2rdVChT-Fr7/view?usp=drive_link

SorryBella · Aug 25, 2023

saperos said:
Update: I run the OCCT Memory test for 1h, chose 100% memory, Auto Instruction Set, Auto Threads. I was shocked to come back after an hour seeing 7242 errors...

Yep, called it. Sounds like your RAM is indeed unstable at the EXPO setting. While the E chipset line have fewer BIOS updates than the normal chipset line, if you havent done an update to AGESA 1.0.0.7 equipped BIOS, id try that first before thinking of a RAM RMA.

saperos · Aug 25, 2023

SorryBella said:
Yep, called it. Sounds like your RAM is indeed unstable at the EXPO setting. While the E chipset line have fewer BIOS updates than the normal chipset line, if you havent done an update to AGESA 1.0.0.7 equipped BIOS, id try that first before thinking of a RAM RMA.

Can this truly be the root cause of my game crashes? I have turned off EXPO in bios, and same avx2 test finished without a single error!

I have recently upgraded bios - running v1.28, latest change log mentions updating AGESA to ComboAM5 1.0.0.7b more info here

saperos · Aug 25, 2023

@SorryBella looks like my ram is worse than I thought: With Expo turned off, it initially passed 30 min of avx2 test and 30 mins of auto test, but my game crashed again.

I run test again with some additional GPU load in the background (heavens) and test started to generate lots of errors and crashed OCCT...

OCCT: https://drive.google.com/file/d/1p_NskrVH7ap_MbsnqxLstyVWwvByjNw0/view?usp=sharing
Event Viewer Error: https://drive.google.com/file/d/1miCkyKk5L4YB-EVFUS2H7Hu653_h33Ng/view?usp=sharing

Is there anything else I need or I shall proceed with RMA? Looks clearly to me that this ram is broken.

SorryBella · Aug 25, 2023

So it still crashes dead stock? Yeah returning to the shop you bought it from would be a good idea. On all auto voltage it should function atleast on JEDEC speed but it seem even then its very random. The only other thing i could think out of my head is turning off RAM retraining.

saperos · Aug 25, 2023

@SorryBella - first thank you so much for all the invaluable help you are providing!

And yes, now using SSE instruction set and 95% load it is failing even without the additional load in the background.
You can see timings on the OCCT errors screenshot from my previous message

saperos · Aug 25, 2023

@SorryBella, apologies to keep tagging you... I thought I will test ram 1 stick at a time. I was expecting only one of them to cause problem, but both are showing memory errors after few minutes of testing, is that possible that both are malfunctioning, or the root problem is still elsewhere...

One thing I want to mention, my 7600x runs in "eco" mode, those settings are applied in BIOS:

Bios PBO.jpg

drive.google.com

Eximo · Aug 25, 2023

Potentially a bad CPU IMC, it does happen.

saperos · Aug 25, 2023

Hello @Eximo, thank you for your reply! CPU IMC suggests a CPU problem, do I understand correctly? Can I conduct any other tests to narrow down the list of potential problematic parts?

Eximo · Aug 25, 2023

Well if you can get some different DDR5 to test with you can be more certain. But I was saying the CPU could very well be the problem.

AM4 Ryzen chips also had a habit of losing memory channels and only being able to run single channel memory. And these CPUs are constructed similarly with the CPU cores and I/O / Memory controller on a separate chip.

If running JEDEC memory speeds helps, you can certainly do that temporarily.

Just depends on if you want to buy another CPU/Memory to test with or take it to shop to do the testing.

Doesn't rule out the motherboard though.

saperos · Aug 25, 2023

Nothing seems to be working, I tested two sticks together with Expo, without Expo, then 1 stick at a time, ram errors keep showing up (but only in OCCT, I ran Memtest86 without a problem as stated in 1st post)

saperos · Aug 26, 2023

@Eximo @SorryBella - I performed one final test that I thought of. I moved the single stick to channel 1 from channel 3, but the error flood started in OCCT again in less than 10 minutes from starting. I'm beginning to think there's nothing else to do but to start swapping out hardware... What's your initial guess?

If both sticks are failing when inserted one at a time (and tested in different slots), is it possible that both could be damaged?
Perhaps it's a faulty CPU (like the CPU IMC issue that @Eximo mentioned).
Or could it even be the motherboard?
Is it possible that the PSU can be the problem?

I was initially hopeful when I saw those RAM errors, but it seems like once again, anything might be causing them. 🙁

kerberos_20 · Aug 26, 2023

have you tryed single stick in all slots? if all slots would be showing errors, that would point to bad ram instead of imc

saperos · Aug 26, 2023

Hello @kerberos_20, thanks for jumping in!

So far I have tested stick 1 in slot 3 and then stick1 in slot 1.

Additionally tested stick 2 in slot 3

Will test remaining combinations.

saperos · Aug 26, 2023

@kerberos_20 - I tried and PC is not booting if: (red led is light on Mobo for CPU and RAM)
- 1 stick is in slot 0 (closest to CPU)
- 1 stick in slot 2
- 2 sticks (slots 0 & 2)

kerberos_20 · Aug 26, 2023

saperos said:
@kerberos_20 - I tried and PC is not booting if: (red led is light on Mobo for CPU and RAM)
- 1 stick is in slot 0 (closest to CPU)
- 1 stick in slot 2
- 2 sticks (slots 0 & 2)

try to reseat your CPU

saperos · Aug 26, 2023

kerberos_20 said:
try to reseat your CPU

Just finished reseating the CPU - cleaned the paste and applied once again. My hopes were high as I thought that one of the mounting screws for my Thermalright Peerless Assassin cooler was not properly screwed. I gave CPU a spin in Cinebench, it finished stability test without issues.

I run OCCT memory test after that. With 2 sticks (in slots 2 & 4), no EXPO and BIOS on completely stock settings it run for 15 minutes (on Auto, 95% load)
After that I turned on Heavens Benchmark in the background - OCCT test started throwing errors ~10 minutes later....

Is my testing procedure valid? Is it OK to add more load to GPU using heavens?

saperos · Aug 26, 2023

@kerberos_20, @Eximo I did one final test, I underclocked RAM heavily, to 3600Mhz, it was running for almost 40 minutes with Heavens in the background but started to fail eventually... Is broken RAM my best bet for now? I can order new sticks and test if it helps...

saperos · Aug 27, 2023

@kerberos_20 @Eximo @SorryBella I bought Kingston FURY 32GB (2x16GB) 6000MHz CL36 Beast Black
Will test it on Tuesday when it arrives. Hope the ram is to be blamed.

saperos · Aug 28, 2023

@kerberos_20 @Eximo @SorryBella sorry to keep tagging you folks, but at this point I am really desperate 🙁
While waiting for new RAM I started running every possible test from OCCT and spot that Linpack also throws an error from time to time. What can it mean?

Eximo · Aug 28, 2023

Not sure what the typical error rate is. DDR5 does have a form of ECC built in so it can carry on with minor errors.

Note it down, and see if it still does it when you test with the new memory.

Question Demanding games are crashing to the desktop on my new build ?

Proper

Proper

Proper

Titan

Titan

Champion

Champion

Titan

Share this page