Thermal Event Diagnosis and Fix

SpaceGoat

Reputable
Jan 6, 2016
1
0
4,510
Dear all,

To begin with, my system specs:

Intel DX58SO Motherboard
Default BIOS
Intel (R) Core (TM) i7 CPU 950 @ 3.07Ghz (8 CPUs)
4 GB Corsair DDR3 RAM
Nvidia Geforce GTX 970 GPU (up to date drivers)
Corsair 650 Watt PSU

Now the problem. On the 1st (great start to a new year!), my computer suddenly powered off while I was playing a game (graphic intensive - Helldivers on all settings maxed to be exact). The crash was more of a freeze. My mouse and keyboard stopped working. Ctrl-Alt-Del also did not work. After the reboot I was prompted that my computer had shut down due to a Thermal Event. (I had not recently added any hardware, the computer had not suffered any hard knocks, I had not recently installed or uninstalled any drivers).

I looked up thermal event and the internet told me that this problem is caused due to the CPU overheating due to the thermal paste connecting the heatsink fan and processor wearing out/eroding etc. So I took my machine to a hardware shop that deals with assembly, and I got them to take apart my entire PC, dust/blow all components. I also got the Heatsink removed, the old thermal paste cleaned (there was very little left to clean and I was concerned to note that at some spots there was no gunk at all), and fresh thermal paste applied after which they put my machine back together.

Even while at the shop, I noted that the problem with the boot was persisting (after the thermal paste reapplied and the heatsink and fan reseated). The system would boot erratically. Sometimes it would boot to windows. Other times it would not boot at all (even the display was not coming on - the monitor kept insisting there was no signal). I got the system tested with a spare GPU and the erratic booting persisted. I tested the system with known good RAM sticks at the shop itself and the erratic booting remained.

However, towards the end (with my RAM stick inserted and my GPU inserted) the computer booted 4-5 times successfully which led me to believe the problem was resolved. I brought the computer home and tried booting it and was confronted with the same problem (erratic booting - at times it boots, at times it does not).

Now, this is what is happening:

Upon turning on the computer one of the following happens:
a. Either the fans and LEDs power on for a second and then immediately power off. There are no beeps. It happens within a second. This can happen 3-4 times in a row. I have checked to see that ALL the fans are powering up initially (though only for less than a second) including the CPU fan, Back fan, Front fans, GPU fans and PSU fan.
b. At times the computer will power on again after powering off initially as stated in (a) and will try to boot.
c. Most of the times the computer gives me a long two tone beep (repeated 4 times - since I have no other way to convey the beep its diiiing doooong diiing doooong diing dooong diiing dooong).

I gathered from the online manual for the DX58SO motherboard that the said motherboard apparently only has 2 beep codes. One is three short beeps for memory failure (unless my memory fails me) and the other is stated as a Siren (which I presume refers to the beep I am hearing). The siren is supposed to mean CPU overheating.

The strange thing is that while diagnosing and trying to see if the problem could somehow be related to the GPU (I did not know about what the beep codes on my MOBO meant as of then, and I had just gotten the new thermal past and reseated heatsink etc.) I read that I should try monitoring temp and forcing fan speeds with MSI Afterburner. I installed it, set my fan speed to 90%, configured the software to start minimized with windows. After that, for the next 4 days, I used the computer without any problems. The computer is usually in heavy use for gaming sessions of 3-4 hours minimum and at times more than that. I run games like Helldivers, HOTS, DOTA, Mount and Blade etc at maxed out settings. I routinely multi task on it with several tabs in chrome usually open while a game is running. I mostly run games in Borderless Fullscreen. None of this may be relevant, I am only pointing out that I had 4 days where I used my computer extensively, ran high load applications, and it worked like a charm. During this time there were no failed boot attempts, no freezes/ crashes / slowdowns of any kind.

I continuously monitored my GPU and CPU temps by routinely alt-tabbing out of games while playing. My GPU was stable around 34C and my CPUs were all around 64-67 on one day, and around 40-50 most other days. Today when I checked it last my CPU was idling at 31 C to 34 C.

Last night again while playing a game (HOTS this time) the freeze occurred and on startup I got the Thermal Event message again. After the problem occurred once last night, and the computer finally booted I played games on it for a good 3 hours without any problem. Today my computer booted normally when I switched it on in the afternoon, I gamed for about 2 hours and the computer finally froze again. (I may also mention that I notice no perceptible drop in framerate or performance before the freeze, and while playing the games there is no tearing/glitch on the screen). When I could finally get it to windows again, the computer froze when I was not even playing and was using chrome to read up on the issue (I had quite a few tabs open - at least 10-12).

This is what I have done since:
1. Reset BIOS to factory settings. (when the initial crash had occurred I had been using the manufacturer recommended updated BIOS)
2. Done a clean reinstall of windows 7.
3. Cleared the BIOS event log (the computer booted twice normally after this, but the third time it was the same problem) as I had read that clearing the log had resolved the issue for some, though it makes no sense to me.
4. Vacuumed the inside of my case thoroughly making sure to try to vacuum all around the heatsink and the fan (several times).
5. Checked boot with different RAM sticks.
6. Checked all the clamps on the heatsink to see if any clamp was not properly fixed/attached and open and close all the clamps to make sure they were tight.
7. Resetting the CMOS battery - I took it out, left system for 5 minutes, plugged it back in, booted, set date and time etc.

Having read that a way of checking whether the processor was overheating and diagnosing what was causing it was to let the computer idle in BIOS while monitoring system temps and fan speeds. I was doing this today and the system powered off while idling in BIOS!

When it crashed, my BIOS monitoring utility was giving the following read outs:
CPU Fan Speed - 1080
Aux Fan Speed - 0000
Front Fan Speed - 2340
Rear Fan Speed - 0540
Processor Thermal Margin - 46C
IOH Temperature - 47C
Motherboard Ambient Temperature - 35C
Voltage Regulator Temperature - 38C

V12.0 --------- 12.250V
V5.0 -----------5.051V
V3.3------------3.266V
V1.1 ------------ 1.106V
Vccp (? I am unsure if it was vccp voop or something) --------1.172V

Another strange thing is that today for the first time, while booting I heard the Three Short Beeps which is supposed to be for No Memory according to the mobo manual. This happened only once. It did not happen before and has not happened since. Normally the behaviour of the computer is exactly as stated above - the siren beep, if it gets to that at all.

At times, after the siren beep, the computer will boot, sometimes tell me there was a thermal event, other times ask me if I want to boot Windows normally and twice starting a scandisk check. If i get to windows, then the computer will typically keep running fine without showing any signs of discomfort and MSI tells me that the temps are stable and in the range as stated above. I have checked the inside of the case and there is no perceptible heat. There is no perceptible slowdown before the freeze (at least it was not enough for me to notice).

I have no real technical education in computers but have been using them extensively for nearly 2 decades so I have reasonable layman's familiarity with hardware and software and have delved deep within my case on many occasions. Can someone advise me how to isolate the issue any further? Are there any components that from my above steps can be considered to be safe (so I can focus my amateur diagnostic efforts on the other components)?

Finally - I wanted to test the rig after switching it to a new motherboard, but I was told that my processor is not compatible with any mobos being marketed right now and that I would need to upgrade the mobo/cpu combo as the current gen mobos were not compatible with my processor. I really distrust this advice. Could someone here chime in on this issue as well?

I have lurked on these forums before to resolve my hardware issues and have mostly benefited as even if I dont get a direct answer to my query the discussions in the threads are useful for further diagnostics. But I am at my wit's end here. I really dont want to start throwing money around to replace components not only because I can only spend freely next month but also because I really want to figure out what is causing this and replace the right component if any replacement is required.

My apologies if I have posted this in the wrong forum (the number of categories is quite overwhelming and I tried searching for old similar threads but they were all in different categories) and for the really long post (just wanted to provide whatever input I could initially). I would really appreciate some help here as the computer is pretty much the only stress relief I have (blowing up online people).

Many thanks,

The Goat from Space