[SOLVED] Server shuts down unexpectedly when running simulations

May 17, 2021
2
0
10
Hello,

We are using a server with two Xeon CPUs and one NVIDIA TITAN PASCAL GPU to run simulations.

Recently, the server shuts down unexpectedly during simulations (this happens randomly, sometimes multiple times during the day, sometimes it remains on for several days).

I am attaching a weird screenshot from HWiNFO, which I believe hints towards an overheating issue

View: https://imgur.com/a/Hxm2vVZ


The temperature readings are extremely high, yet very stable. I suspect these might actually not show real readings (like some kind of disabled sensors). Is this possible?

Any help is appreciated, I am trying to nail down the issue to one of the following:
  1. GPU hardware malfunction/overheating
  2. CPU overheating
  3. Faulty sensors on motherboard ? (I doubt it)
  4. ...?
 
Solution
Given the 98C temp shown (mainboard), the shutdowns could be simply a response to crossing 100C or whatever the thermal limit is....

Perhaps the chassis is unable to cope with the heat if such issues only occur during heavy GPU tasking...; many servers equipped with actual GPUs need an adequate amount of airflow and/or internal shrouds to direct airflow out of the chassis, and, it is possible that many smaller chassis are not equipped for such heat dissipation...
Given the 98C temp shown (mainboard), the shutdowns could be simply a response to crossing 100C or whatever the thermal limit is....

Perhaps the chassis is unable to cope with the heat if such issues only occur during heavy GPU tasking...; many servers equipped with actual GPUs need an adequate amount of airflow and/or internal shrouds to direct airflow out of the chassis, and, it is possible that many smaller chassis are not equipped for such heat dissipation...
 
Solution
May 17, 2021
2
0
10
Given the 98C temp shown (mainboard), the shutdowns could be simply a response to crossing 100C or whatever the thermal limit is....

Perhaps the chassis is unable to cope with the heat if such issues only occur during heavy GPU tasking...; many servers equipped with actual GPUs need an adequate amount of airflow and/or internal shrouds to direct airflow out of the chassis, and, it is possible that many smaller chassis are not equipped for such heat dissipation...
Thank you for the swift reply.

I was guessing the same, it just seemed really weird to me that the temperature did not fluctuate at all around, rather remained fixed at 98C.