Dell SC440 Server with 3050 Dual Core Xeon Processors...shuts down after about 20 minutes.

ukiltmybrutha

Distinguished
May 4, 2012
33
0
18,540
Server: PowerEdge SC440 with Xeon 3050 Dual Core 2.13 GHZ Processors

Issue: The Server shuts down after about 30 minutes with no warning whatsoever when the OS (W2K8R2) is running along with a few other applications like Acronis Backup and a Hyper V instance.

When I power it back up, I get an error indicating that a thermal event caused the system to shut down.

When I go into the BIOS, I get a "CPU0 Temperature is out of range error message".

I blew the CPU heat sink and every area VERY well with compressed air. The processor heat sink was about 30% blocked with dust. I also cleaned the heat sink and CPU then applied MX-4 heat sink compound using the line method.

I have used Dell Diagnostics to see what the issue is but the Server passes all diagnostic tests.

Using CPUID HW Monitor, I don't see any exceptional overheating prior to shutdown. The shutdown happens as low as 47C. The fan doesn't spin much faster prior to shutdown either.

Using HWINFO64, I turn the fan speed up to 2700 RPM which only delays the shutdown by a few minutes.

I am wondering if the CPU is permanently damaged?

I reimaged the server, but it still does the same thing.

The server will NOT shutdown if I leave it with just the BIOS running.

Any ideas what might be wrong here? This thing has been a nice, quiet, energy efficient, media server and VM sandbox over the years.

I'd rather keep it if possible.

Thanks.
 
ukiltmybrutha,

1. Just to eliminate it as a possibility, remount the CPU's. On this occasion clean the CPU surface and heatsink interface with denatured alcohol. If you confident not to distort the surface, using a block held very evenly level and with dampened scouring powder, polish the heatsink interface surface. Apply an thin X-shape of thermal paste that extends right to the corners of the CPU and again on the heatsink. Using a folded over business card, spread the paste so it makes a thin coating over the entire surface of both parts. Don't use too much, the idea is that is is only filling the tiny ridges and valleys in the metal. Screw down the heatsink with only a couple of turns in a diagonal pattern - 1,3,4,2. The idea is to keep the pressure even over the surfaces. Don't over-tighten.

2. On a server, the fans should be highly audible all the time. Go to BIOS and under Thermal, and select the highest fan idle settings. Go to BIOS and under Thermal (which may be under Advanced Options or Options), and select the highest fan idle settings. On a server they will really roar. all the time.

3. If the the fans are not speeding up, there's a possibility that the algorithm that converts the sensor data to temperatures is corrupted. When it sends the incorrect temperature data,d BIOS is causing a thermal shutdown. Servers are set to shutdown at far lower percentage of maximum rated CPU temperatures to protect the processors.

To check this possibility, download, install, and run Intel Processor Diagnostic.

This will run a stress test on the CPU's that will display the results in an unusual way without displaying the number: Expected temperature more than 1 Degree below maximum / Actual Temperature is 40 degrees below maximum. As the test progresses, it will say 34 degrees below maximum, 30 degrees below maximum and so on. At the end is the actual temperature is more than 1 degree below the maximum it will give the CPU a PASS.

If this test will complete, then you know that the sensor conversion to degrees is incorrect and BIOS is reading this and going into it's thermal protection mode based on incorrect information. This is far more likely if the version of Windows or Server is an old version recently installed and it needs to be fully updated.

This sequence of events above happened to me two days ago. I installed a 2013 copy of Win 7 Professional (HP recovery partition) on an HP z620. HWMonitor indicated the Xeon E5-2690 was running at 82C- 10C over the rated maxumum and idling at 63C. Like your SC440, the fans bnever audibly sped up and there no other signs of thermal problems. Passmark Performance showed low performance for the E5-2690. However, Intel Diagnostic, during the stress test started at 43 degrees below maximum and ended at 23 degrees below maximum. Just at that time, Windows went through the first substantial update. When I restarted, HWMonitor showed the E5-2690 was idling at 34C. I ran Passmark it idled at 42C and and in three-five minutes,settled back to 35-36C- actually a bit lower than I expected for an LGA 2011 8-core. The CPU mark went from 14041 to 14791. So, it was a magic update and I think it was the sensor to temperature number conversion.

Let us know what happens.

Cheers,

BambiBoom

 
Thank you BambiBoom!

I have some comments. This particular server is notoriously quiet. There are many articles to substantiate that even with the fans running full tilt. This is why it makes a great media server in part.

As to point 1: Thank you, awesome information.

As to point 2: The BIOS does not support thermal functionality nor fan control speeds. I use HWINFO64 in the OS to speed up the fans. I don't really see any need to mess with the BIOS since I have complete control over fan speed in the OS do you?

As to point 3: The fans do speed up when I use HWINFO64, but the server still shuts down...it is just delayed a bit longer. Thank you for letting me know about Intel Processor Diagnostic. I am definitely going to try that and let you know what happens.

However a few important updates:

For the price of the subject server, I just got myself another used on off of Ebay and things are getting interesting.

a) I noticed 4 blown caps in the original server's motherboard...the 1800 microfarad ones that go bad. I pulled them and ran the system without them. The system shut down even faster. I ordered new caps, but in my experience caps aren't always necessary in every application and especially for such a quick test. FYI, the error that I am getting is supposed to be very synonymous with these 4 bad capacitors. It is all over google.

b) I put the CPU from the old server in the new one without any regard to the old thermal paste...just did a quick and dirty swap to see what would happen. Well the new server shut down almost immediately. I then followed a reasonable protocol of new heat sink compound, dust removal from new heat sink, cleanup with denatured alcohol etc. on the new heat sink. The new server has been running fine with the old server's processor under heavy load for 12 hours at this point e.g. running backups, virtualization, downloads, and 1080p media service. Temps did not exceed around 33C-36C whereas before they approached 47C-50C under the same circumstances.

I hate ambiguous situations, but the issue seems to be potentially fixed. I'd still like to resolve the issue with the original server though.

Note: The "new" server came with a completely different style heat sink than the old one even though the model number is the same:

Here is a pic I found of the style I received in the "new" server:

http://4.bp.blogspot.com/_oMvVfvjDdrY/SiIK4cq6t8I/AAAAAAAAAHs/kfc3AF_wDA0/s320/P1010531.JPG

Here is a pic of the style that I had in the old server:

http://www.aliexpress.com/item-img/PE-SC440-Server-heat-sinks-JT147/1914000096.html

I don't want to automatically assume that the old server's heat sink is not as good as the "new" server's simply because it is bigger.

I am surprised that the "new" server had a larger heat sink because the "new" server only had an E2180 processor while the old one had a faster Xeon 3040 processor.

Does anyone know which heat sink is "better" and why?

Thanks!





 


ukiltmybrutha,

I'm very pleased to read of progress. It does seem the blown capacitors would be a problem. I should think that is a special stress on the CPU's and though those Xeons are very rugged, it's not worth learning the limits.

My current project, the HP z620 (purchased the 2nd E5-2690 last evening) with the instrumentation thermal problem, does have BIOS fan control under Options /Thermal and one chooses the desired number of asterisks. Full asterisks = continuous 3690RPM. The z620 is the only workstation or server that had any kind of fan control- none of the motherboards support PWM. My tendency has been to consider auxiliary rear panel case air extraction fans, e.g., 90mm PWM that run off a Molex 4-pin and linked to a front panel controller to inline speed controls. This is especially important in system with DDR2 RAM which is hot, hot, hot and I have a Precision T5400 (2X X5460) in which the RAM seems always to run at 65- 70C. There are memory fans and DDR2-667 is rated to 100C, but that should never be tested. The aux. fans would have to be set, and possibly fussed with, but I'm "thermally anxious". The aux. fans might be a consideration for your servers.

I call the Aluminum heatsink, the "Organ Pipes" version and the Steel /Copper one the "Office Building Model". It's difficult to believe that a server would have the cast Al one and not the uprated Steel /Copper, which was used in workstations such as Precision T3500 with the 130W CPU's. The SteeI/ Copper is better because of the substantially increased surface area of the Steel fins and the thermally circulating air in the Copper tubes which are highly thermally conductive= far superior heat dissipation. I'd suggest replacing any of the Organ Pipes with Office Buildings. The last Office Building I bought was only $12, shipping included. I have four of those: Precision T5500, T3500, and PowerEdge 2600 - which is noisier than a being 747. I'm thinking of replacing it with a Precision T7500 with 2X X5670's and PCIe SSD running RAID 5 off a PERC H310. By the way, if the BIOS in eh Poweredge- (sold new with PERC 6/i or H200) will recognize a PERC H310, these will convert the disk system from 3GB/ to 6GB/s. Without other changes an H310 increased the disk score of a Precision T5500 from 1940 to 2649. The last one I bought on Ebahhh was only $38. with shipping.

Cheers,

BambiBoom

 


Thank you for this info. Best of luck on this build.