[HELP!] S5520SC Workstation very unstable - Random BSOD/Reboot/Freeze

Mohamed_Hussein

Commendable
Mar 19, 2016
36
0
1,530
Hello everyone. It took me months to finally write this message. I’m ripping my hair apart trying to find what is wrong with my build. I have read tons of articles and posts all over the but no use. I'm not a hardware expert so I don't mess with my case a lot. I hope to find some help on how exactly to approach this situation without causing even more problems! I will try to write in FULL detail so anyone out there with the knowledge could hopefully help me. And for those of you who will skim through, I made the key points readable too.

First of all, here are my specs:
Case: SC5600 Chassis
Motherboard: Intel S5520SC Server board
CPU: 2x Xeon E5645 2.4 GHz
CPU Coolers: 1x Antech Kuhle H2O 620 + 1x Antek Kuhler H2O 650
RAM: 12x4GB RAM Chips
GPU: ATI Radeon HD 5770
HDD: WD Caviar Blue 1TB (Primary and System HDD)
OS: Windows 7 Ultimate x64

Here is my issue in a nutshell:
My machine runs fine most of the time, a few days, weeks, or months in a row. Then suddenly one or more of the following scenarios would start happening and sticks around for a while, repeating time after time until it drives me crazy! The results of such problems range from once costing me an HDD that had all of my work and personal data for 5 years and I miraculously managed to recover it, down to the mere annoyance and frustration of a PC randomly shutting down in your face and refusing to tell you why!

So, down to the details! Here are the symptoms detailed as much as I could analyze them:

1 – BSODs:

My machine starts showing all kinds of BSODs. Sometimes while I’m browsing in Chrome, sometimes while writing this article. Sometimes when it’s idle, and sometimes when it’s under some certain load. A few to be mentioned are:


MEMORY_MANAGEMENT
REFERENCE_BY_POINTER
IRQL_GT_ZERO_AT_SYSTEM_SERVICE
STOP: 0X0000001E (This one had no tagline)
APC_INDEX_MISMATCH



2- Random restarts:

It randomly restarts. Sometimes completely shutting off the screen like if I pressed the Reset button on my case, making two mild tick sounds coming from the mobo, then one additional tick sound after few seconds, and boots up. Normally when I push the power button to start it up it makes one tick sound.
Btw, sometimes when restarting it shows some crippled colors on the screen for a brief moment before restarting.

3- Freezing:

The screen fills up with crippled colors all over the place and the PC stays frozen until I press the Power button once, then it shuts down.
Even in other occasions it would simply freeze the screen without restarting or crippling the colors. Just freezes. Then I must push the power button for it to shut down too.
It’s worth mentioning that the grinding sounds coming out of my Xeons stop when any of these incidents occur, which I think means they have stopped processing any data.

Now to the really WEIRD part(at least to me): When these incidents happened lightly last year it could happen only once every few months. Now when they start happening even once in a day it’s almost impossible to keep the PC running normally for the rest of that day. I must shut it down completely for at least a day or two then when I come back it would work just fine! Then it keeps working for a random amount of hours until it’s “loaded” again, then it starts throwing all of these problems into me all over.

Also, after restarting or freezing it would sometimes give me Six long freighting beeps and refuses to boot. I looked into the manual and there is nothing about 6 beeps. Only 3 and it means memory issues. Does six mean memory issues but for 2 processors? It’s worth mentioning that when I leave it alone 10 minutes it boots again! EDIT: Also when I let it beep to the end it boots normally!But of course, keeps throwing the above mentioned problems through the work day nonetheless.

Resulting problems from the last incidents:

These happen occasionally and with no certain order. But they started showing up after the machine endured quite a lot of the above mentioned crashes.

1- Startup Repair and Failing to repair the problem:
Windows was not showing me Startup repair a few weeks ago, but now it does. And when I let it continue, it always tells me that windows cannot repair this error. So I simply reboot and run windows normally. Sometimes it would run, and sometimes it would FREEZE on the windows logo Screen.

2- Random problems with my installed applications:
The last time 3Ds max wouldn’t start until I deleted the FlexLM folder and re-entered the license. Other problems are minor but are still questionable?

What I tried to do to solve the problem:

• I replaced my HDD.
• I updated my BIOS (last year)
• I re-installed windows before replacing the HDD then cloned it to the new one after buying.
• I took out the RAM chips, cleaned and sanded them, dusted and sprayed my whole PC, and re-applied thermal paste to the CPUs (last year as well)
• I researched online A LOT and read lots of similar problems, only to find people solving it by replacing the motherboard! Which is the LAST thing I can really do right now. I won’t even find an identical board online.


Facts that are worth mentioning:

I'm no expert (obviously lol). I’m just an Architect and enthusiast who is starting up in the PC
building world. I know absolutely nothing about my hardware beyond the dozens of articles I
read about how it works.

My case has really BAD airflow and I don’t know how to fix that. The two CPU cooler fans are blowing in OPPOSITE direction. Plus, my case fan started screaming like hell after BIOS update so I unplugged it, forcefully leaving no good ventilation inside.

I am a 3D artist so I used this rig quite heavily. It never complained even with weeks of continuous 100% CPU activity in the summer (25C-35C ambient temperature). These issues started only two years ago and escalated drastically the last month. I think they have nothing to do with temperature.

This machine suffered from lots of electrical power outings and surges. The outings could occur every 15 minutes and stay out for 1 hour in 2014, resulting in an average of 10 power losses/Day for 3 months in a row! I was dumb enough to ignore that back then :( I guess I'm paying the bill now.

Useful info to help troubleshooting the issue:

My graphics card driver is up to date and I double checked that last week (actually AMD
discontinued support for my “ancient” card).
My idle stats shown using HWMonitor in this image

I never updated my windows since I installed it in March 2016.

I only replaced my PSU once during the last 6 years.

I checked the Kernel Log and it showed error 41 with empty parameters.


Finally, these are a few questions I think would help troubleshoot my issue:

• Does the crippled color incident mean it’s a GPU fault or it’s not necessarily the case? Knowing that the average temp of my GPU idling is 37C in ~16C room temp.

• Can this be one of the CPUs getting very hot exponentially fast at some point? I pushed the pump of my 650 cooler last week then it would make the PC boot up after refusing to boot.

• Could this be some electrostatic discharges that accumulate over the board and fluctuate by leaving it off for a few days? Where and how should I clear those charges manually if necessary?

• How can I know if it's NOT my PSU for sure? In fact, I run heavy duty applications and this never seems to cause any trouble 95% of the time. The crashes happen simply when they happen at the best description.

To be honest, this baby survived some really tough times with me. Through the last 6 years I mostly left it running for days and maybe weeks with the CPU rendering large animations and operating at 100%. Sometimes the heat was so excessive that it would shut itself down with a long beep, but I would start it up again in 10 minutes and it worked flawlessly. It’s also worth mentioning that all of that happened while the two CPU coolers (Antec Kuhler 620 water loops) were not even hung to the back of the case. They were simply dangling down facing the bottom of the case all these years and I didn’t know! (I am not the one who built it initially).
My point is: All of this never caused an issue and the performance was rock solid.

Thank you a lot for reading! Sorry for the LONG post, but I tried to lay it all out to make it easier for you. I truly appreciate your help as this is my main source of income and I use it for professional work.

All help is appreciated, thank you again.