NOTE: This post is being written with my DP cable plugged into my MOBO DP output because my GPU keeps shutting down.
Please excuse the length. Been working at this for a while, and I want to get past the initial suggestions, narrow things down.
Guys, been trying to figure this out, and I'm about to lose my mind. Made a couple of posts as I've tried to debug this, but this is the latest and greatest as of 14FEB23.
I would really appreciate it if somebody can help me pinpoint the issue.
I have been seeing this issue for about 6 months, occasionally, as described below. Have had several Driver and Win10 updates in that time. (See below)
My system (built by me DEC2019:
Intel I7-8700K (default, no overclocking, ever)
GPU Vega 64
MOBO ASUS HERO X Wifi (WIFI Disabled in Windows Settings, using hardline to Router)
WIN10
Corsair 750 HXI PSU
2x 8GB Corsair RAM
Corsair HI 151 Pro CPU Cooler
Multiple Case Fans
AMD Adrenaline Software for fan control
What I am seeing is occasional (random-ish) loss of signal from GPU. When I'm at IDLE, not under load.
System will boot up fine, GPU will work for a while.
If I go into a game (7 Days to Die for example), I can play without ANY signal loss for HOURS. Runs in-temp and at 144FPS.
Today, the moment I exited the game, the GPU lost signal.
Other days, if I boot the computer and just let it sit there, idle, doing nothing but displaying static desktop, sometimes it will be fine for hours, other times, not as long, eventually it will lose signal.
When it loses signal there are no warnings, no sounds.
Only way to recover is to force a reboot (button or PSU switch).
System will usually reboot with GPU working, but it will stop giving signal again, usually fairly swiftly once it has happened once for the night.
When it reboots, it boots like nothing bad happened. No messages, no safe mode, etc.
What I have tried/checked/verified, etc:
1) It is not a power issue.
-This is a 750w PSU. I have a wattage meter on my UPS. This system NEVER pulls more than 460w, even with the GPU maxed out.
-I am using two separate PSU cables (not the Y-split thing).
-I have the cable plugs separated (non-adjacent) on the PSU.
-PSU is running in factory/default "multi-rail mode".
-I have experienced this with both the original cable and a new one from the Corsair PSU OEM box.
2) Fairly certain this is not a thermal issue.
-I have Adrenaline overlay up, GPU Temp never gets above 70C
-I have checked temps with HW Monitor also, nothing above mid 80's on the GPU memory
-Most of the time, GPU Temp reading mid-60's, fan curve has them at 50% at 50c, 60% at 60c, and 70% at 65c
-Fans happy, nothing running "hot"
3) Drivers
-Since last year, I have updated the AMD drivers several times, using AMD website/Adrenaline Software
-Problems seem to have gotten worse since January update
-Tried reverting back to April2021 drivers (when things were fine), still seeing issue now
-Tried a DDU Uninstaller and THEN installing APRIL2021 drivers tonight, problem persists
NOTE on drivers: Noticed on the January2023 driver notes from AMD, they no longer list a Vega 64 under "units we test this update on", only 6600s, 6700s, etc.
Wonder if AMD just doesn't care about conflicts with Vega in their new drivers maybe?
4) Windows 10
-Yes, there have been several updates, including one in January 2023.
-Yes, there could be a conflict with AMD drivers.
Unfortunately, there is no option or way to remove the January update. When you highlight it, there is no UNINSTALL. Thank you Bill Gates.
I can run Furmark with no problems.
I can run Heaven Benchmark with no problems.
GPU Fan works when gaming.
Because of the issue, I've moved my DP cable from the GPU to the MOBO.
Right now, sounds like GPU fan is trying to start up, then stop over and over.
GPU "Radeon" light on front occasionally shuts off for 1/2 second, then turns back on.
While it's off, GPU "Tach lights" go from one red light on, to NO red lights, just a green one that flickers for 1/2 just to the left of the Tach lights (in sync with RADEON logo going off).
CPU H/W Monitor recognizes that the GPU is plugged in, but shows 0% for GPU Use and Memory (which makes sense since I'm not using it?)
So,...here's where I'm at:
I don't think it's the PSU.
I don't think it's thermal.
I don't think it's cables.
I thought it might be drivers, but after re-installing after DDU, I don't think it's drivers...UNLESS the WIN10 update from January royally screwed the pooch.
From all I've been reading, I know that some users report an issue with the HBM, and suggest undervolting or changing the State settings.
Tried raising the State settings so that States 1-5 all use the default value FOR state 5 (1401). Problem persists.
Aside from a general "what the heck is going on???", I'd really like to know, with some certainty, what is going on.
If somebody can clearly explain why this is a GPU hardware issue (HBM or otherwise), fine. I'll go buy a new GPU.
BUT.....I'd rather not feel MORE like an idiot, so if it's drivers, or something else that I CAN fix, before spending a wad on a new GPU, only for the problem to persist, I'd really like to know THAT.
If you're still reading at this point, they should give you a Tom's Award (Maybe Glutton for Punishment 2023?).
I would really appreciate it if somebody can tie everything I just said to some sort of proof, or way to prove what's going on, so I can stop getting these damn drop-outs.
Thank you in advance. Really hope there's a short, simple "Oh, yeah, here's what's going on" kind of explanation so I can make the right move from here.
Please excuse the length. Been working at this for a while, and I want to get past the initial suggestions, narrow things down.
Guys, been trying to figure this out, and I'm about to lose my mind. Made a couple of posts as I've tried to debug this, but this is the latest and greatest as of 14FEB23.
I would really appreciate it if somebody can help me pinpoint the issue.
I have been seeing this issue for about 6 months, occasionally, as described below. Have had several Driver and Win10 updates in that time. (See below)
My system (built by me DEC2019:
Intel I7-8700K (default, no overclocking, ever)
GPU Vega 64
MOBO ASUS HERO X Wifi (WIFI Disabled in Windows Settings, using hardline to Router)
WIN10
Corsair 750 HXI PSU
2x 8GB Corsair RAM
Corsair HI 151 Pro CPU Cooler
Multiple Case Fans
AMD Adrenaline Software for fan control
What I am seeing is occasional (random-ish) loss of signal from GPU. When I'm at IDLE, not under load.
System will boot up fine, GPU will work for a while.
If I go into a game (7 Days to Die for example), I can play without ANY signal loss for HOURS. Runs in-temp and at 144FPS.
Today, the moment I exited the game, the GPU lost signal.
Other days, if I boot the computer and just let it sit there, idle, doing nothing but displaying static desktop, sometimes it will be fine for hours, other times, not as long, eventually it will lose signal.
When it loses signal there are no warnings, no sounds.
Only way to recover is to force a reboot (button or PSU switch).
System will usually reboot with GPU working, but it will stop giving signal again, usually fairly swiftly once it has happened once for the night.
When it reboots, it boots like nothing bad happened. No messages, no safe mode, etc.
What I have tried/checked/verified, etc:
1) It is not a power issue.
-This is a 750w PSU. I have a wattage meter on my UPS. This system NEVER pulls more than 460w, even with the GPU maxed out.
-I am using two separate PSU cables (not the Y-split thing).
-I have the cable plugs separated (non-adjacent) on the PSU.
-PSU is running in factory/default "multi-rail mode".
-I have experienced this with both the original cable and a new one from the Corsair PSU OEM box.
2) Fairly certain this is not a thermal issue.
-I have Adrenaline overlay up, GPU Temp never gets above 70C
-I have checked temps with HW Monitor also, nothing above mid 80's on the GPU memory
-Most of the time, GPU Temp reading mid-60's, fan curve has them at 50% at 50c, 60% at 60c, and 70% at 65c
-Fans happy, nothing running "hot"
3) Drivers
-Since last year, I have updated the AMD drivers several times, using AMD website/Adrenaline Software
-Problems seem to have gotten worse since January update
-Tried reverting back to April2021 drivers (when things were fine), still seeing issue now
-Tried a DDU Uninstaller and THEN installing APRIL2021 drivers tonight, problem persists
NOTE on drivers: Noticed on the January2023 driver notes from AMD, they no longer list a Vega 64 under "units we test this update on", only 6600s, 6700s, etc.
Wonder if AMD just doesn't care about conflicts with Vega in their new drivers maybe?
4) Windows 10
-Yes, there have been several updates, including one in January 2023.
-Yes, there could be a conflict with AMD drivers.
Unfortunately, there is no option or way to remove the January update. When you highlight it, there is no UNINSTALL. Thank you Bill Gates.
I can run Furmark with no problems.
I can run Heaven Benchmark with no problems.
GPU Fan works when gaming.
Because of the issue, I've moved my DP cable from the GPU to the MOBO.
Right now, sounds like GPU fan is trying to start up, then stop over and over.
GPU "Radeon" light on front occasionally shuts off for 1/2 second, then turns back on.
While it's off, GPU "Tach lights" go from one red light on, to NO red lights, just a green one that flickers for 1/2 just to the left of the Tach lights (in sync with RADEON logo going off).
CPU H/W Monitor recognizes that the GPU is plugged in, but shows 0% for GPU Use and Memory (which makes sense since I'm not using it?)
So,...here's where I'm at:
I don't think it's the PSU.
I don't think it's thermal.
I don't think it's cables.
I thought it might be drivers, but after re-installing after DDU, I don't think it's drivers...UNLESS the WIN10 update from January royally screwed the pooch.
From all I've been reading, I know that some users report an issue with the HBM, and suggest undervolting or changing the State settings.
Tried raising the State settings so that States 1-5 all use the default value FOR state 5 (1401). Problem persists.
Aside from a general "what the heck is going on???", I'd really like to know, with some certainty, what is going on.
If somebody can clearly explain why this is a GPU hardware issue (HBM or otherwise), fine. I'll go buy a new GPU.
BUT.....I'd rather not feel MORE like an idiot, so if it's drivers, or something else that I CAN fix, before spending a wad on a new GPU, only for the problem to persist, I'd really like to know THAT.
If you're still reading at this point, they should give you a Tom's Award (Maybe Glutton for Punishment 2023?).
I would really appreciate it if somebody can tie everything I just said to some sort of proof, or way to prove what's going on, so I can stop getting these damn drop-outs.
Thank you in advance. Really hope there's a short, simple "Oh, yeah, here's what's going on" kind of explanation so I can make the right move from here.