Hello,
We are using a server with two Xeon CPUs and one NVIDIA TITAN PASCAL GPU to run simulations.
Recently, the server shuts down unexpectedly during simulations (this happens randomly, sometimes multiple times during the day, sometimes it remains on for several days).
I am attaching a weird screenshot from HWiNFO, which I believe hints towards an overheating issue
View: https://imgur.com/a/Hxm2vVZ
The temperature readings are extremely high, yet very stable. I suspect these might actually not show real readings (like some kind of disabled sensors). Is this possible?
Any help is appreciated, I am trying to nail down the issue to one of the following:
We are using a server with two Xeon CPUs and one NVIDIA TITAN PASCAL GPU to run simulations.
Recently, the server shuts down unexpectedly during simulations (this happens randomly, sometimes multiple times during the day, sometimes it remains on for several days).
I am attaching a weird screenshot from HWiNFO, which I believe hints towards an overheating issue
View: https://imgur.com/a/Hxm2vVZ
The temperature readings are extremely high, yet very stable. I suspect these might actually not show real readings (like some kind of disabled sensors). Is this possible?
Any help is appreciated, I am trying to nail down the issue to one of the following:
- GPU hardware malfunction/overheating
- CPU overheating
- Faulty sensors on motherboard ? (I doubt it)
- ...?