[SOLVED] AMD Vega Frontier: Bad Card or Driver Problem?

aTBkiLT

Commendable
Dec 13, 2016
3
0
1,520
System specs:
Windows 7
Intel 990x 1366 socket
24 GB RAM
1300 watt EVGA PSU (Brand new)
2 Nvidia 1080 TI cards
1 AMD Vega Frontier
Nothing overclocked
CPU and GPU are water cooled, temps well controlled

Nvidia driver 388.0

Windows up to date.

This machine is used primarily for distributed computing (BOINC). I recently upgraded an RX 480 to the Vega Frontier, which I got used. Before upgrading I had a 1080, 1080 Ti and the RX 480. That setup was 100% stable. When I added the Frontier card I also replaced the 1080 with a 1080 Ti. The Nvidia cards are 100% stable.

I have two problems which are probably related. First, I've used either DDU or the AMD driver removal tool before installing new drivers, and I've tried almost all of them that AMD has released in the last year. No driver shows Wattman controls, except for the current 18.Q4.1 which does show the fan control, but nothing else. 3rd party tools like OverdriveNTool and Afterburner show the controls for clocks etc. but don't allow me to change anything. Overdrive gives an message about the function isn't supported by the driver.

The main problem I'm having is under compute load, Vega Frontier locks up the system but doesn't usually give a BSOD. I've let the machine sit idle after a crash for 8 - 9 hours before without it giving a BSOD. The one time it did, the message said the driver was in an infinite loop. If audio is playing it will continue to play normally until it's finished. This may happen in a few minutes or a few hours. When it happens Windows won't see the card unless I power the system down.

I have tried almost every driver released in the last year. I've tried swapping power cables from a 1080 Ti, which has been 100% stable. Since I'd have to tear into the loop to more significant testing, like swapping to another PCIe slot, I was hoping to get an opinion on if it's a Windows 7 problem, driver problem specific to me (perhaps conflict with the Nvidia cards) or a faulty card.
 
Solution
Just in case anyone has a similar problem, I was finally able to get this resolved. I took the block off the card, cleaned it thoroughly and also thoroughly cleaned the PCB. I redid the paste and adjusted the thermal pads. I then tested it in two different systems. It failed in the first one, I found later that one did in fact have a PSU problem (burnt pin in the PSU itself). It's working fine in the third system, so even though I didn't see anything obvious there must have been a bit of corrosion or something causing the problem. As of now the card has been running under 100% load for over 72 hours, no issues.

I also finally figured out the Wattman voltage and clock speed control problem. Those features are only available the...

aTBkiLT

Commendable
Dec 13, 2016
3
0
1,520


Agree it sort of looks like a power issue, but I have not run all 3GPUs under load at once. I have run the Vega by itself under load and I don't remember exact numbers but the power draw is definitely less than both Nvidia cards running together. The PSU is hard to reach because of the way my case is, so I tried swapping the power cables from a 1080 Ti to rule out a loose connection or bad cable. Same result, 1080 Ti stable and Vega card locks up. A power problem also doesn't explain why Wattman doesn't show correctly.

I have not removed any card, that would be quite a chore since everything is water cooled. I feel pretty confident this isn't a PSU problem. I have considered pulling both Nvidia cards out and trying the Frontier by itself but that's kind of a nuclear option of last resort. I'd probably put the 480 back in before doing that.

[Edit 1-27-19] I don't see a way to post a new comment without it being the answer, and I don't have an answer yet.

I decided to remove the Frontier and put the RX 480 back in to make sure the motherboard was ok. After several hours, running all GPUs under load, no system crashes and just as importantly for distributed computing, no failed tasks. I'm considering my options, most likely ones are creating a Linux boot disc and trying the Frontier with Linux drivers, or uninstalling the Nividia driver to see if that's causing a weird problem. [/Edit]
 

aTBkiLT

Commendable
Dec 13, 2016
3
0
1,520
Just in case anyone has a similar problem, I was finally able to get this resolved. I took the block off the card, cleaned it thoroughly and also thoroughly cleaned the PCB. I redid the paste and adjusted the thermal pads. I then tested it in two different systems. It failed in the first one, I found later that one did in fact have a PSU problem (burnt pin in the PSU itself). It's working fine in the third system, so even though I didn't see anything obvious there must have been a bit of corrosion or something causing the problem. As of now the card has been running under 100% load for over 72 hours, no issues.

I also finally figured out the Wattman voltage and clock speed control problem. Those features are only available the gaming driver, and that's only available in Win10. The Pro drivers only have controls for the fan speed.
 
Solution