Question Threadripper 2950x ridiculous temp readings

Mar 22, 2019
10
0
10
I built a machine a while back with a TR2950x with the ASROCK X399 phantom gaming 6 motherboard (Firmware P1.10) and enermax liqtech tr4 360 cooler running ubuntu server 18.04. It has been very unstable. The temperature in the bios reads 70C instantly on boot after being powered off for a night. lm-sensors reads idle temps of 50-70C. I am doing machine learning on my GFX card and using about 280% cpu (100% per thread. Maxing all threads will read 3200%) and the temperature reading in lm-sensors goes up to ~85C. I swapped the cooler out for the corsair and have the same problem. If I run the CPU up to 2700% the cpu temp reads ~95C. Stopping the load instantly drops it to ~70C. Is it possible I have a faulty processor or motherboard? Has anyone else seen anything like this?

Clarification: I am not overclocking
 
Last edited:
Mar 22, 2019
10
0
10
Neither Ryzen master nor HWInfo appear to run on linux. It still seems hot at load even if the temps are offset +20 degC. I am taking CPU temp from the k10temp-pci-00cb and k10temp-pci-00c3 measurements. I'll put the bigger liquid cooler back in on monday and see if those numbers make more sense.


Make sure you have the latest BIOS and install the AMD chipset drivers.
Download Ryzen Master or HWinfo and check the temps with them.
Some motherboards utilities and even the BIOS could be adding 20° C to the actual CPU temps.
Also make sure you have installed the cooler correctly and applied thermal paste covering the whole lid .
 
Mar 22, 2019
10
0
10
Somethings wrong, a max voltage at stock of under 1 volt? I've never seen that before.

If you are at 70C in the BIOS, somethings majorly wrong. I think you are thermal throttling 24/7. Try re-installing the CPU block on the CPU.

I did that last night and replaced the cooler with a different one this morning. That reading was from lm-sensors. I just rebooted it and bios reports 71C and 1.408V. I will try new thermal paste as well.
 
Mar 22, 2019
10
0
10
Somethings wrong, a max voltage at stock of under 1 volt? I've never seen that before.

If you are at 70C in the BIOS, somethings majorly wrong. I think you are thermal throttling 24/7. Try re-installing the CPU block on the CPU.

Also: I'm skeptical of the 70C reading because the bracket the holds the processor down is cool to the touch.
 
Mar 22, 2019
10
0
10
So I decided to stay late tonight and redo the cooler. I cleaned both the processor and heatsink surface with windex, wiped dry, and applied a thin layer of thermal paste. I spread it thin and smooth with a stiff zip tie. Instantly on powerup (straight into UEFI) the temp read 65C and seems stable there. Booting to Ubuntu server and running sensors the temperature has dropped to 52C and is continuing to fall.
Here is the output of sensors:
Code:
$sensors
k10temp-pci-00cb
Adapter: PCI adapter
temp1:        +51.5°C  (high = +70.0°C)

nct6779-isa-0290
Adapter: ISA adapter
Vcore:                  +0.40 V  (min =  +0.00 V, max =  +1.74 V)
in1:                    +1.07 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
AVCC:                   +3.33 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:                  +3.31 V  (min =  +2.98 V, max =  +3.63 V)
in4:                    +1.82 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:                    +0.82 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:                    +1.21 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
3VSB:                   +3.46 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:                   +3.26 V  (min =  +2.70 V, max =  +3.63 V)
in9:                    +0.00 V  (min =  +0.00 V, max =  +0.00 V)
in10:                   +0.83 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in11:                   +0.93 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:                   +1.68 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:                   +0.93 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:                   +0.85 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
fan1:                  1184 RPM  (min =    0 RPM)
fan2:                  1296 RPM  (min =    0 RPM)
fan3:                  2854 RPM  (min =    0 RPM)
fan4:                     0 RPM  (min =    0 RPM)
fan5:                     0 RPM  (min =    0 RPM)
SYSTIN:                 +25.0°C  (high =  +0.0°C, hyst =  +0.0°C)  ALARM  sensor = thermistor
CPUTIN:                 -62.5°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN0:                +15.0°C    sensor = thermistor
AUXTIN1:                +35.0°C    sensor = thermistor
AUXTIN2:                +30.0°C    sensor = thermistor
AUXTIN3:                +34.0°C    sensor = thermistor
PCH_CHIP_CPU_MAX_TEMP:   +0.0°C
PCH_CHIP_TEMP:           +0.0°C
PCH_CPU_TEMP:            +0.0°C
PCH_MCH_TEMP:            +0.0°C
intrusion0:            ALARM
intrusion1:            ALARM
beep_enable:           disabled

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +51.5°C  (high = +70.0°C)
 
Mar 22, 2019
10
0
10
I reinstalled the Enermax Liqtech 360 and gave the machine to our IT department, They cleaned the cooler and cpu with isopropol alcohol (Apparently Windex is bad for CPU cleaning!). The idle temp still reads about 50°C, but under heavy load (3200% CPU or 100% on all cores), the cpu stabilized at ~85°C and the computer did not go unresponsive. We now suspect it was the nightly tensorflow build I was using (at the time of install tensorflow mainline did not support CUDA 10). Updating to mainline tensorflow 1.13.1 seems to have fixed the instability; however, I have been doing less model training as of late. I will update this if the instability shows up again.
 
Mar 22, 2019
10
0
10
So the instability has not been resolved with the update, it just locked up during a training session. I know this is not the place to get help on software stability issues, but I will post the latest system stats I had running for reference in case anybody else runs into this issue.

Sensors:
Code:
$ watch -n 0.5 -c -d sensors

Every 0.5s: sensors                   machinelearing: Mon Apr  1 17:41:44 2019

k10temp-pci-00cb
Adapter: PCI adapter
temp1:        +66.2°C  (high = +70.0°C)

nct6779-isa-0290
Adapter: ISA adapter
Vcore:                  +0.70 V  (min =  +0.00 V, max =  +1.74 V)
in1:                    +1.07 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
AVCC:                   +3.31 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:                  +3.31 V  (min =  +2.98 V, max =  +3.63 V)
in4:                    +1.82 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:                    +0.82 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:                    +1.21 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
3VSB:                   +3.44 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:                   +3.26 V  (min =  +2.70 V, max =  +3.63 V)
in9:                    +0.00 V  (min =  +0.00 V, max =  +0.00 V)
in10:                   +0.78 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in11:                   +0.80 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:                   +1.67 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:                   +0.92 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:                   +0.73 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
fan1:                  1173 RPM  (min =    0 RPM)
fan2:                  1290 RPM  (min =    0 RPM)
fan3:                  2848 RPM  (min =    0 RPM)
fan4:                  1264 RPM  (min =    0 RPM)
fan5:                  1259 RPM  (min =    0 RPM)
SYSTIN:                 +27.0°C  (high =  +0.0°C, hyst =  +0.0°C)  ALARM  sens
or = thermistor
CPUTIN:                 -62.5°C  (high = +80.0°C, hyst = +75.0°C)  sensor = th
ermistor
AUXTIN0:                +15.0°C    sensor = thermistor
AUXTIN1:                +38.0°C    sensor = thermistor
AUXTIN2:                +37.0°C    sensor = thermistor
AUXTIN3:                +41.0°C    sensor = thermistor
PCH_CHIP_CPU_MAX_TEMP:   +0.0°C
PCH_CHIP_TEMP:           +0.0°C
PCH_CPU_TEMP:            +0.0°C
PCH_MCH_TEMP:            +0.0°C
intrusion0:            ALARM
intrusion1:            ALARM
beep_enable:           disabled

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +66.2°C  (high = +70.0°C)

NVIDIA-SMI:
Code:
$ watch -n 0.5 -c -d nvidia-smi
Every 0.5s: nvidia-smi                                                       machinelearing: Mon Apr  1 17:41:44 2019

Mon Apr  1 17:41:44 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:41:00.0 Off |                  N/A |
| 76%   51C    P2    67W / 300W |  10870MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     96278      C   python3                                    10859MiB |
+-----------------------------------------------------------------------------+
Top:
Code:
$ top -d 0.5
top - 17:41:44 up 3 days,  3:55,  6 users,  load average: 3.17, 2.09, 2.06
Tasks: 406 total,   2 running, 229 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.2 us,  0.2 sy,  0.0 ni, 96.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
GiB Mem :   62.713 total,   33.439 free,    4.864 used,   24.411 buff/cache
GiB Swap:    0.000 total,    0.000 free,    0.000 used.   56.620 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 96278 root      20   0 37.445g 4.536g 1.085g R 100.0  7.2 191:25.45 python3
 95106 rcreng    20   0 2576.9m 490.7m  70.4m S   1.9  0.8   0:20.22 tensorboard
  1291 root      20   0 5034.8m  68.0m  38.1m S   0.0  0.1   2:00.91 dockerd
 96225 rcreng    20   0 4693.1m  48.7m  29.4m S   0.0  0.1   0:00.34 docker
  1240 root      20   0 5192.3m  38.9m  24.9m S   0.0  0.1   5:06.30 containerd
 53196 rcreng    20   0  364.0m  23.1m  16.2m S   0.0  0.0   0:27.33 smbd
   694 root      19  -1  100.9m  23.1m  22.1m S   0.0  0.0   0:00.97 systemd-journal
  1760 root      20   0  347.1m  20.1m  17.5m S   0.0  0.0   0:00.94 smbd
 
Mar 22, 2019
10
0
10
With the continued crashes, I gave in and installed windows on the computer with the hope of a helpful bluescreen. The temps in Ryzen master look perfectly normal idling just below 30C (HWinfo shows this as TDie, Ryzen master shows this as the only temperature). HWInfo shows Tctl to be in the lower 50s C at idle (consistent with the reported temps from kmod on linux). Thank all of you for the help!