Question In need of DIRE HELP with seemingly random system crashes

richardrosenman

Reputable
Jan 8, 2019
18
0
4,510
Hi all!

This is my last attempt before altogether ditching this computer as I don't know what else to do at this time.

To explain how we got here... I purchased a pretty crazy computer last summer for my work which is 3D animation and motion graphics. It's was quite the beast and here are the specs:

Threadripper 1950X CPU
4x GTX 1080 Ti
32GB RAM
1600 Watt EVGA PSU
SSD Boot Drive
4TB Secondary Drive
Windows 10 Pro

The apps I use mainly are 3ds Max, Redshift and the Adobe Suite, especially After Effects.

Now for the issue: Every once in a while, seemingly randomly, I'll be working and the the GPU signal cuts out. I lose all signal on both monitors (I have multi-monitor setup). No matter what I do, I can't get a signal back. The only thing I can do is forcibly turn off the computer power, and then start again.

There is no way to replicate this which makes it really difficult to solve. But it happens mainly when using 3ds Max so at first I thought it was perhaps software related. Today is happened using After Effects so it is system-wide.

I took it back to the store for a full-checkup months ago as they tested everything, including a thorough RAM check in case it had bad sectors which could be causing the crash. Everything came back clean.

Here's what I've been looking for when it happens to see if I can get some clues:

- When this happens, I look at the motherboard and GPUs and they all have the lights on. The fan is still spinning. So I don't think this is a power issue (ie. not enough of it) or I assume it would shut down everything.)


    • I thought perhaps it was a RAM related issue that would occur when the memory gets filled to a certain point and accesses a bad sector but as I said, they did a full thorough check and it came back clean. I also took out the modules and worked with one at a time to see if it happened and it does.

    • I realize the GPU drivers could very well be a culprit. I am using a somewhat older driver so that it's compatible with an older software I use. But when I took it to the store, they had updated it to a more modern version and the issue still persisted. I am still not using the latest drivers due to software compatibility but since the store tried already at the time and it didn't solve anything, I am doubtful it would help.

    • I read that the NVidia sound driver caused problems for some so I disabled those but it still happens.

    • I read I should install ONLY the GPU drivers and nothing else that comes with the NVidia package which I still have to try.


Based on this, can anyone help me figure out what else I could do to resolve this? I've been dealing with the issue for over a year and it's awful losing work every time due to unpredictable crashes. What else could I try to test the issue? What other clues can you gather from this??

I also checked the Windows system log today after it crashed but it doesn't give me any useful clues. It says power shut down unexpectedly which is presumably when I hard-rebooted it and the errors before that don't shed any light on what's happening.

Here are some of the logs, starting with the power shutdown which I assume is me turning it off for the reboot:

3:31pm - Audit events have been dropped by the transport. 0
3:31pm - The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
3:30pm - The description for Event ID 56 from source Application Popup cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.


Somebody... anybody... please help!!

Thanks,
-Richard
 
Last edited:

Ralston18

Titan
Moderator
Drives: make, model, capacity, how full?

PSU: Bronze, Gold...?

Look in Reliability History for error codes, warnings, and even informational events that correspond with the signal losses.

Event viewer may also be capturing errors. However, Event Viewer is not as user friendly as Reliability History.

And do remember that you can click on any given error for more information. That information may or may not prove directly comprehensible or useful.

You mentioned that the system was in for maintenance "months ago". Could need another check-up. However, you can eliminate some possibilities yourself.

Power down, unplug, open the case.

Clean out dust and debris.

Ensure that all cards, cables, RAM, and jumpers are fully and firmly in place. Look for signs of overheating around components.

With lots of heavy graphics work the PSU was likely working at the high end of its rated wattage. Over time the PSU degrades and starts to falter and fail.

Especially as the PSU nears its' designed EOL (End of Life).
 

richardrosenman

Reputable
Jan 8, 2019
18
0
4,510
Thanks for your reply. Some questions for you:

"With lots of heavy graphics work the PSU was likely working at the high end of its rated wattage. Over time the PSU degrades and starts to falter and fail. "

I used a power calculator today and with all my parts, it said the power usage should be approx 1350 watts. My PSU is an EVGA 1600 G2 that supposedly provides 1600 watts. So it should be well within power consumption. But even if it wasn't, wouldn't a loss of power be reflected in the rest of the machine also shutting down? ie. Fans, lights, etc, all stop working?

The OS drive is a Samsung SSD 960 EVO 500GB and it is only 3/4 full.

The secondary drive is a SATA ST4000DM004 drive 4TB with only 1TB used.

I can clean it out but is it likely that's the cause? Do you think the GPU drivers could be at fault here, or the power, or something else? If you were to rate them in level of probability?

I just checked reliability report but it shows the most recently power shutdown I listed in the services. What else could I check for you that might give us some insight?

Remember that the maintenance they did, did not resolve the problem. They could not reproduce it so all they did was update the GPU drivers but the issue persists.

Short of completely replacing the entire machine, I don't know what else to do at this point. Please advise?

Regards,
-Richard
 

richardrosenman

Reputable
Jan 8, 2019
18
0
4,510
I’m going to add an update to this to see if it sheds any clues for anyone.

I can now induce the crash.

If I set up an After Effects render to go, when reaching the computationally-taxing part, it will crash. Once again, I lose signal to the monitors and if I’m playing music, it stops.

This is also consistent with using my CAD software as well which also induces the same crash when using computationally-taxing high density models.

These seem to be specific to CPU computations, not GPU computations. I can render with the four GPUs and it’s fine. No crashes with those.

So what does this mean? Does this suggest that because it’s during the computationally-taxing sections it’s overheating the CPU? I know very little about this. Does Win 10 have a heat monitor I should use and, if so, what temps should my Threadripper 1950x and MB be at?

Please can I get some advice? I am really struggling here.

Regards,
Richard
 

DSzymborski

Titan
Moderator
Have you not tried simplifying the rig and pushing it with fewer parts? There's a lot going on with four GPUs at once. The proper thing to do would be to take out three of the GPUs and stress-test the PC with each of the GPUs in by themselves. Also, just use the SSD.

Also, I would try swapping out the PSU, preferably somewhere with a good return policy on the PSU. The EVGA G2 is an excellent PSU, but even excellent PSUs can fail prematurely and your symptoms are consistent with a failing PSU.
 

richardrosenman

Reputable
Jan 8, 2019
18
0
4,510
Also, I would try swapping out the PSU, preferably somewhere with a good return policy on the PSU. The EVGA G2 is an excellent PSU, but even excellent PSUs can fail prematurely and your symptoms are consistent with a failing PSU.

Hi, and thank you so much for the reply. I can certainly try a different PSU - I just need a clue as to where to start.

2 questions:

1 - As I explained earlier in the post, when the crash happens, I look at the computer and I can see the lights are still on and the fans are still spinning. Wouldn't a PSU failure cut out power to the entire machine?

2 - When I render with 4 GPUs at once, this presumably uses far more power than when just rendering with the CPU. Yet this doesn't crash the system. So if it was a power issue, wouldn't a higher draw in power with the 4 GPUs rendering trigger a crash?

-Richard
 

DSzymborski

Titan
Moderator
Hi, and thank you so much for the reply. I can certainly try a different PSU - I just need a clue as to where to start.

2 questions:

1 - As I explained earlier in the post, when the crash happens, I look at the computer and I can see the lights are still on and the fans are still spinning. Wouldn't a PSU failure cut out power to the entire machine?

2 - When I render with 4 GPUs at once, this presumably uses far more power than when just rendering with the CPU. Yet this doesn't crash the system. So if it was a power issue, wouldn't a higher draw in power with the 4 GPUs rendering trigger a crash?

-Richard

1. Not necessarily.

2. The first step is always to simplify and isolate as much as possible.
 

richardrosenman

Reputable
Jan 8, 2019
18
0
4,510
2. The first step is always to simplify and isolate as much as possible.

I have tried to remove some of the GPUs but had trouble as there is a lock on the MB that I have difficulty getting access to. I thought simply unplugging power to the GPU's would make them unavailable but the computer won't start with them plugged in, even if unpowered.

With regards to your suspicions that it's power related. Could this have something to do with this power-hungry computer not getting sufficient power from my wall outlet? I work in my condo so perhaps there isn't enough power available for the computer? It still doesn't make sense to me that it doesn't crash with 4 1080 Ti's rendering at full power but it does when rendering with the CPU only?

-Richard
 

richardrosenman

Reputable
Jan 8, 2019
18
0
4,510
So since I can't take out the GPU's, I disabled three of the four of them from the device manager and plugged my display into another one instead of the primary one.

I ran the test that usually induces the crash and it didn't crash. This doesn't mean it won't because it usually crashes with that test but sometimes doesn't. I will keep working for a while with just one GPU in use and see if it remains crash free.

If so, is it safe to assume it could mean one of two things:

1 - One of the GPUs is faulty? Perhaps the one that was being used as the primary?
2 - There is less strain on the PSU since it's only powering one? Is this true though? Because although I disabled the GPUs in the device manager, they are still powered.

Any clues based on this?

Thanks,
-Richard

P.S. I also can't see the BIOS screen anymore on startup as I guess I am no longer using the primary GPU, but once it loads Windows I can see the desktop.