3rd Vega 56 cause PC hard shutdown during ETH mining

relo.cart

Prominent
Dec 12, 2017
39
0
560
Hello all,

I am desperately need help for the situation I have right now, since there is no logic answer for what happened...

Here are my setting:

CPU: Ryzen 5 1600
Ram: Vengeance LED 16GB
Mobo: MSI X370 GAMING PRO CARBON
GPU: SAPPHIRE Radeon RX Vega 56 X 3, 2 on PCIE, 1 connect to PCIE via riser cable.
PSU: Toughpower Grand 1200 Watt

I upgrade from my old Alienware x51 r2, so I cloned the OS (Windows 10 Home) on the new SSD, and then update all drivers, everything works fine for ETH mining.

Since the OEM Win10 will expire due to hardware change, I bought a OEM Win10 Pro, and did a fresh install of OS with the Fall Creator update 1709. Everything was running until I started ETH mining again, and here are details:

(1) System can recognize all 3 GPU, and driver installation all went well.
(2) As soon as I start Claymore miner ver 10.00, within 3 - 6 minutes, the PC will have a hard shutdown, but the weird part is that the ram LED will still be on, and I have to turn of the power for at least 5 seconds in order to turn the machine back on.
(3) The PC will reboot fine, but sometimes the 3rd GPU will be missing and become other device in Device Manager, I have to uninstall from other device and reboot to get the OS recognize again.
(4) Another weird part is that once I reload the cloned OS copy (the OS I had from Alienware without/ with FCU1709) the problem is gone. But no matter which version of Windows (home/pro) I did fresh install, mining will cause hard shutdown, no exception.
(5) I loaded the cloned OS this morning again, and use Crimson balace setting for mining, within 5 mins the PC shutdown again. However, by using custom mode and lower the voltage and underclock, the machine seems running fine before I left for work.

Here are the myths:
(1) I am really not sure if this is a OS issue or hardware issue at this point, because the cloned old OS seems running fine without any problem, while any fresh install of Windows 10 (does not matter Home or Pro, FCU 1709 or older version) will reproduce the shutdown issue.
(2) Remove the 3rd GPU seems fix the problem in fresh build OS, but I noticed that after running over 24 hours, it will shutdown again, but I only tested it once.
(3) Not sure if it is an issue with the MSI MOBO, after installing the 3rd GPU I can't see BIOS screen anymore at start, under Legacy Boot Mode; I can ONLY have access BIOS by turning on WHQL mode in the BIOS. (Have anybody seen problem like this?)

Any insight will help, please!

Jonna
 
Solution
Finally problem resolved.

The whole problem has NOTHING to do with hardware at all, RIP my SSD after swapping for a bad PSU...

It is SOFTWARE issue with AMD drivers in Windows 10, and here are the steps after reinstall Windows or Update to FCU 1709:

1. Roll back or uninstall below drivers in Device Manager:
-- "AMD SMBus" - roll back to 1/18/2015
-- AMD GPIO Controller - roll back to 4/1/2017
2. Install AMD Chipset drivers: https://support.amd.com/en-us/download/chipset?os=Windows+10+-+64
3. Install the latest Video card driver

The problem is caused by poor power management from the Windows default chip driver, which trigger the MOBO into power surge protection.

It has been 72 hours, and the PC is running fine without any...
Actually sounds like the rail you have the GPU's on is overloading. it's Two rails there, the first being 40amps the second being 85amps. If you overburdened the 40amp rail, it will shut down. IF the 85 amp rail is weak, it will shut down. The way you describe it working OK when you underclock and undervolt, well that indicates you problems are PSU related or at least related to how everything is connected.

"The rail distribution on this unit has all the peripheral connectors on the 85A rail (12V2). Only the CPU connectors are on 12V1. "

Anyhow, it's not unusual for these cheaper units to not put up the amps or to get weak.
the review for this PSU
http://www.jonnyguru.com/modules.php?name=NDReviews&op=Story&reid=243

 


But why the old OS will not cause the problem? Not until this morning, the old OS alone will not reproduce the issue. Did OS has anything to do with power management for the PSU?

Here is the image of the MOBO I have:
msi-x370-gaming-pro-carbon.jpg

The third card is connected via riser cable to the bottom PCIE (black one), not sure if that is an issue here.

Thanks for your reply!! 😉
 
Probably not the old OS per se but the power management implementation it was using was more frugal.

The key here is that when you under volt, everything is stable.

The part of your description where you discuss RAM LEDS's being on etc.. perfectly in keeping with one rail (or one entire PSU in the case of multi PSU's) powering down while the motherboard carries on. The fact that you have to press the power button for five seconds and force a hard shut down, that often happens when the OS or other function is still using the CPU, meaning that rail is still on and active and the other is shut down from being overburdened.

The load here is within spec of that power supply, but it's not a great power supply and could just be plain faulty.
 
For the fresh installed OS (both Home and Pro), even I put everything in low voltage, PC will still shutdown, no exception...

Maybe my description is not clear about the shutdown, here is the rephrase:
(1) All fan and MOBO LED will go off
(2) Only RAM LED will stay on
(3) I have to turn off the switch on PSU and wait at least 5 second, then the PC will turn back on, or else it won't.

Here is what I will try tonight:
(1) Switch riser cable
(2) Switch PSU connector cable
(3) Try another PSU, I can borrow a HX1000i

Let me know if above is okay to try. :)
 
Do you have any displays connected to the GPUs? I messed around with Eth mining on my RX 580 a while back, and I found that it would inevitably crash my computer. From what I remember, the fans would still be running, but there'd be no display and the PC would be completely unresponsive and I'd have to hard shut it down. After doing some research, I found a number of threads saying that AMD cards can end up crashing if they're mining without having anything plugged into their display connectors (in my case it was because I was turning off my monitor when leaving it to mine). I think I read that some miners would just buy these little plugs that fool the card into thinking it has a display connected.

Edit: example of what I am talking about:
https://bitcointalk.org/index.php?topic=1957323.0
https://www.reddit.com/r/MoneroMining/comments/75ne6g/hdmi_dummy_plugs_for_vega/
 
For my case, everything was running fine for weeks before I did the OS reinstall.

It is not a freeze or BSOD, after mining for about 3 - 5 mins, the machine will go off, fan will stop, MOBO light off, only RAM led will stay on, and I cannot turn the PC back on until manually switch off the PSU and wait at least 5 seconds.

Thanks for the input!
 
Not very helpful, but I am aware 1709 caused instability issues for many previously stable rigs.
Does the rig run stable at default GPU settings? You should be undervolting vega anyway for mining. You can undervolt and still overclock.
 
With fresh installed 1709, the default setting will cause the shutdown within 5 mins, very accurate timing.

I bought a AX1200i PSU last night, and for some reason my Cooler Master Trooper case IO board was on fire and burned my 1 TB SSD that plugin the XDock... This is totally desperate situation and I do not know what to do anymore...

Plan to get a new case today, and I hope the PC can run fine then, after waste more than $500 since last weekend...
 


I finally got the problem fixed with some great sacrifice.

As Mark stated, it is PSU issue in the end. I returned my old PSU to the store, and got a new AX1200i, but the first boot cause the IO board on Cooler Master Case catch fire, and burnt my 1 TB SSD docked on that. It also short out the PSU as well...

Then I got a EVGA SuperNOVA 1200 P2 last night, and it only comes with 6 single 8pin PCIE cable, which is disaster for cable management, but it turn out just fine, no incidence overnight.

The problem is that with before is that I used 2 cable with two 8pins for GPU0 and GPU1, while using 2 single 8 pin for GPU2 (the one cause the shutdown). I guess there is a amp difference between the 2 sets of cable, and it will freak the MOBO during GPU intensive use, and cause the emergent shutdown.

Thanks for everyone!
 
I gave up for this issue, because it just happened again.

I cannot update the system, because it somehow trigger a BCD error and any repair attempt will be denied by the hard drive.

I cannot do a fresh install on Windows 10, the pc will shutdown after mining for 1 hour...
 
If you wish to do a fresh install of Windows, do it with only the onboard video attached or get down to one discrete card and attempt that first.

Are you gaming on this unit or only mining? If only mining then a Linux miner might be an alternate
 
Finally problem resolved.

The whole problem has NOTHING to do with hardware at all, RIP my SSD after swapping for a bad PSU...

It is SOFTWARE issue with AMD drivers in Windows 10, and here are the steps after reinstall Windows or Update to FCU 1709:

1. Roll back or uninstall below drivers in Device Manager:
-- "AMD SMBus" - roll back to 1/18/2015
-- AMD GPIO Controller - roll back to 4/1/2017
2. Install AMD Chipset drivers: https://support.amd.com/en-us/download/chipset?os=Windows+10+-+64
3. Install the latest Video card driver

The problem is caused by poor power management from the Windows default chip driver, which trigger the MOBO into power surge protection.

It has been 72 hours, and the PC is running fine without any incidence at all. After update the chipset driver, RX Vega 56 can mine ETH/DCR at 37 MH/s.
 
Solution
FYI, you're probably better off using the AMD Beta Compute driver than the latest drivers. I get ~31 MH/s mining eth (not dual mining) with that driver and an RX 580, think I was only getting 27-28 or something like that with the latest Radeon Adrenalin release drivers (even when putting it in the optimize for compute mode in crimson settings).

Edit: Just occurred to me that the beta driver may have come out before Vega was released. So not 100% sure Vega is fully supported by the beta drivers. Just suggesting it's worth a shot, based on results for Polaris cards.
 


I agree with this, the beta shows better performance on the Polaris cards than the release drivers are with (compute).
With Vega the switches to enable faster compute are part of the initial driver complex, I didn't get a change with driver editions and the compute mode switch doesn't even show up anyhow.

 


Thanks for your help Mark! I wish this post can help others too!

 


I have tested the beta mining driver under FCU 1709 and I did not see any major improvement in term of hash rate. The major improvement I did see is from installing the chipset driver, and with the same overclock setting, the hash rate jumped from 93 MH to 113 MH. Therefore I think it is due to the voltage setting with the AMD mother support, not so much from the cards.