Question My Awesome Vega 64 Mystery (HELP!!)

Status
Not open for further replies.

lottaphotos4

Distinguished
Aug 27, 2013
35
2
18,535
NOTE: This post is being written with my DP cable plugged into my MOBO DP output because my GPU keeps shutting down.
Please excuse the length. Been working at this for a while, and I want to get past the initial suggestions, narrow things down.

Guys, been trying to figure this out, and I'm about to lose my mind. Made a couple of posts as I've tried to debug this, but this is the latest and greatest as of 14FEB23.
I would really appreciate it if somebody can help me pinpoint the issue.

I have been seeing this issue for about 6 months, occasionally, as described below. Have had several Driver and Win10 updates in that time. (See below)

My system (built by me DEC2019:
Intel I7-8700K (default, no overclocking, ever)
GPU Vega 64
MOBO ASUS HERO X Wifi (WIFI Disabled in Windows Settings, using hardline to Router)
WIN10
Corsair 750 HXI PSU
2x 8GB Corsair RAM
Corsair HI 151 Pro CPU Cooler
Multiple Case Fans
AMD Adrenaline Software for fan control

What I am seeing is occasional (random-ish) loss of signal from GPU. When I'm at IDLE, not under load.
System will boot up fine, GPU will work for a while.
If I go into a game (7 Days to Die for example), I can play without ANY signal loss for HOURS. Runs in-temp and at 144FPS.

Today, the moment I exited the game, the GPU lost signal.
Other days, if I boot the computer and just let it sit there, idle, doing nothing but displaying static desktop, sometimes it will be fine for hours, other times, not as long, eventually it will lose signal.

When it loses signal there are no warnings, no sounds.
Only way to recover is to force a reboot (button or PSU switch).
System will usually reboot with GPU working, but it will stop giving signal again, usually fairly swiftly once it has happened once for the night.
When it reboots, it boots like nothing bad happened. No messages, no safe mode, etc.

What I have tried/checked/verified, etc:

1) It is not a power issue.
-This is a 750w PSU. I have a wattage meter on my UPS. This system NEVER pulls more than 460w, even with the GPU maxed out.
-I am using two separate PSU cables (not the Y-split thing).
-I have the cable plugs separated (non-adjacent) on the PSU.
-PSU is running in factory/default "multi-rail mode".
-I have experienced this with both the original cable and a new one from the Corsair PSU OEM box.

2) Fairly certain this is not a thermal issue.
-I have Adrenaline overlay up, GPU Temp never gets above 70C
-I have checked temps with HW Monitor also, nothing above mid 80's on the GPU memory
-Most of the time, GPU Temp reading mid-60's, fan curve has them at 50% at 50c, 60% at 60c, and 70% at 65c
-Fans happy, nothing running "hot"

3) Drivers
-Since last year, I have updated the AMD drivers several times, using AMD website/Adrenaline Software
-Problems seem to have gotten worse since January update
-Tried reverting back to April2021 drivers (when things were fine), still seeing issue now
-Tried a DDU Uninstaller and THEN installing APRIL2021 drivers tonight, problem persists
NOTE on drivers: Noticed on the January2023 driver notes from AMD, they no longer list a Vega 64 under "units we test this update on", only 6600s, 6700s, etc.
Wonder if AMD just doesn't care about conflicts with Vega in their new drivers maybe?

4) Windows 10
-Yes, there have been several updates, including one in January 2023.
-Yes, there could be a conflict with AMD drivers.
Unfortunately, there is no option or way to remove the January update. When you highlight it, there is no UNINSTALL. Thank you Bill Gates.

I can run Furmark with no problems.
I can run Heaven Benchmark with no problems.

GPU Fan works when gaming.
Because of the issue, I've moved my DP cable from the GPU to the MOBO.
Right now, sounds like GPU fan is trying to start up, then stop over and over.
GPU "Radeon" light on front occasionally shuts off for 1/2 second, then turns back on.
While it's off, GPU "Tach lights" go from one red light on, to NO red lights, just a green one that flickers for 1/2 just to the left of the Tach lights (in sync with RADEON logo going off).

CPU H/W Monitor recognizes that the GPU is plugged in, but shows 0% for GPU Use and Memory (which makes sense since I'm not using it?)

So,...here's where I'm at:

I don't think it's the PSU.
I don't think it's thermal.
I don't think it's cables.

I thought it might be drivers, but after re-installing after DDU, I don't think it's drivers...UNLESS the WIN10 update from January royally screwed the pooch.

From all I've been reading, I know that some users report an issue with the HBM, and suggest undervolting or changing the State settings.
Tried raising the State settings so that States 1-5 all use the default value FOR state 5 (1401). Problem persists.

Aside from a general "what the heck is going on???", I'd really like to know, with some certainty, what is going on.
If somebody can clearly explain why this is a GPU hardware issue (HBM or otherwise), fine. I'll go buy a new GPU.
BUT.....I'd rather not feel MORE like an idiot, so if it's drivers, or something else that I CAN fix, before spending a wad on a new GPU, only for the problem to persist, I'd really like to know THAT.

If you're still reading at this point, they should give you a Tom's Award (Maybe Glutton for Punishment 2023?).

I would really appreciate it if somebody can tie everything I just said to some sort of proof, or way to prove what's going on, so I can stop getting these damn drop-outs.

Thank you in advance. Really hope there's a short, simple "Oh, yeah, here's what's going on" kind of explanation so I can make the right move from here.
 

Aeacus

Titan
Ambassador
Really hope there's a short, simple "Oh, yeah, here's what's going on" kind of explanation so I can make the right move from here.

Something to try:
  1. Try your GPU in 2nd PC and look if same issues appear there.
  2. Try 2nd, known to work, GPU (preferably Nvidia) in your system and look if issues remain.

These two tests are needed to rule out or confirm, if the issue is with GPU. Most likely it is. Though, there is a slim chance that issue is with CPU, MoBo or RAM.
 

Aeacus

Titan
Ambassador
Don't have another computer.
Don't have another graphics card.

In this case, haul your entire PC to PC repair shop and pay them to diagnose the issue.

Btw, Radeon GPUs are known for poor drivers, hence why i'm using and also suggesting to go with Nvidia. Intel ARC is also quite solid, since they got rid of their driver issues.
 

lottaphotos4

Distinguished
Aug 27, 2013
35
2
18,535
Respectfully, both your replies
a) Missed my question
b) Were not at all helpful

Clearly if I'd HAD spare hardware I would have used it, so telling me to do so was only an attempt by you to show you had "an answer". When you didn't.

And telling me to bring my hardware to somebody is even more worthless, especially in regard to what I was asking.

You don't know, so don't reply. THAT would have been MORE helpful.

Thank you again.
 

Aeacus

Titan
Ambassador
Respectfully, both your replies

I don't see any respect here, flinging down votes like no other and completely opposing the few remaining steps, than can be done.

Clearly if I'd HAD spare hardware I would have used it

Fact is, you DID NOT tell us that you don't have additional hardware and can not test it any further.

And telling me to bring my hardware to somebody is even more worthless, especially in regard to what I was asking.

If you are not willing to take necessary steps to fix your hardware, than this is on you. We can not help if you do not want to be helped.

You don't know, so don't reply. THAT would have been MORE helpful.

TH forums is not a place where you can dictate who can and can not answer in your topic. :non: It is a free forum, open for all. If you do not like the answers given, go somewhere else. End of discussion.
 

Rogue Leader

It's a trap!
Moderator
Respectfully, both your replies
a) Missed my question
b) Were not at all helpful

Clearly if I'd HAD spare hardware I would have used it, so telling me to do so was only an attempt by you to show you had "an answer". When you didn't.

And telling me to bring my hardware to somebody is even more worthless, especially in regard to what I was asking.

You don't know, so don't reply. THAT would have been MORE helpful.

Thank you again.

@lottaphotos4 Let me be very clear here. Respect on this forum is compulsory. You are being entirely disrespectful to someone in another part of the world who is trying to help you FOR FREE on his free time, who is not in front of your computer and has no idea of your life situation, only the exact words on the screen. His suggestions were perfectly valid, that they do not work for you is not his fault nor does it give you the right to disrespect him or anyone else.

This is your warning, be nice.
 

Rogue Leader

It's a trap!
Moderator
Sorry Rogue.
He was NOT trying to help. That was trolling.
I've always been respectful.

As I'm being very clear here: HE owes me an apology.
As do you.

Let me be very clear here, you do not make the rules or make such determinations, I do.

Since you lack the self awareness or humility to see where you are being abusive to those helping you, this thread is closed.
 
Status
Not open for further replies.