[SOLVED] Two different GPUs, same failure on my system. But the failure seems GPU-related... is this even possible?

Apr 11, 2021
4
0
10
So, I really, really need help. Let's tell the tale...

Ordered a pre-built PC last december. The store didn't have a GPU, so I bought it in another place. It arrived a little after the rig. All system specs are in the end of this message.
The GPU, an ASUS TUF RTX 3080 OC, seemed to work fine. Under stress, it had no issues. After more-or-less one month of use, one morning, I started my PC as usual. Boot screen seemed normal. Couple of seconds in Windows, I saw artifacts -- a small pattern of black rectangles that changed from one place to another on screen. Having some experience with failing GPUs, I decided to run a rendering test -- the Heaven Benchmark. The rendering showed severe artifacts (color aberrations, unrecognizable shapes). I thought the GPU was dead, but after a restart, everything was normal again. Given the nature of the issue, I knew I'd see it again. Contacted the store. As they wouldn't be able to reproduce the issue (I couldn't), they told me to record a video if it happened again. And surely it did happen after some 40 days. Between this time, I experienced some BSODs that seemed to be GPU-related. But when the artifacts themselves reappeared, I started having troubles even on simple things like YouTube. Degradation seemed to be real. GPU was finally sent back to the store and it's being tested.

Well... convinced that it was a hardware problem and that the store certainly don't have a new one to send, I bought a new one, in another store. This time, a ROG Strix 3090 OC. Arrived some days ago, and I installed it last wednesday. Worked flawlessly, until... this morning. Same story. Started the PC in the morning, couple of seconds on Windows, and to my dismay, I saw some black rectangles...


Again, ran Heaven Benchmark. Again, some severe artifacts. The pics don't quite do justice, but you can have an idea. I have video if needed, though.



Finally the test stopped with these errors:


...

After a reboot, everything is normal again... until the next time this happens. And I'm couting on some sort of degradation in the near future.
So, it really seems to me a GPU hardware failure. Artifacts resemble a lot the ones you get with VRAM issues. But there's the fact: the exact same failure, on two different cards, different models, different GPUs? Naturally, I'm trying to figure out something that isn't GPU's fault, after all. But it's hard.

  1. Drivers: the cards used different drivers. All driver installations were performed after DDU cleanups.
  2. System RAM: could the RAM give this kind of artifact? And the system worked very, very well in the time without any dedicated GPU. Also, if my RAM was bad, wouldn't I be seeing other issues? General instability etc.
  3. PSU: not sure. All the artifacts were observed with an almost idle system. In fact, in high-demanding applications, the whole system works very well. If PSU was at fault, wouldn't be expected to be more prone to fail as the load increases?
  4. PCi-Express: well... maybe. But against this, there's the aspect of the artifacts. They don't look like bus related. Also, if the slot was bad, I think I'd know by other symptoms (CTDs while gaming, black screens, and different artifacts; and perhaps the failures would be more frequent).
  5. Monitors and cables: hardly, I guess. A failure here wouldn't give the messages I received on Heaven's error report.
  6. CPU: my best bet after the GPU itself. But if so, this is a bit weird. System runs fine otherwise (i. e. without dedicated GPU). Shouldn't I see some other forms of errors, as well?


So that's it. I'm pretty much lost and really need help.
Finally, system specs and important details:

CPU Intel i9-10850k at stock settings (no MCE, fine-tuned voltages because "auto" mode was giving me really high voltages. VCORE is at 1.21v, VCCIO and System Agent at 1.28v -- a bit high, but it's stable).
MOBO ASUS ROG Strix Z490-F Gaming. Not sure about BIOS version, but I know it was updated in late november / first days of december, last year.
RAM 4x8GB DDR4 Corsair Vengeance RGB PRO, 3600MHz, the B-Die version. Just default XMP applied in BIOS.
GPU ASUS ROG Strix RTX 3090 OC, white version. Former GPU was an ASUS TUF RTX 3080 OC.
SSD Corsair MP510 480GB for OS and MSFS 2020, HDD Western Digital Black 2TB for everything else.
PSU Corsair RM850x, 850W.
OS Windows 10 64 bits, 20H2, 19042.867. VGA drivers were 461.09 for the 3080, now 465.89 for the 3090. All driver changes performed after DDU cleanup. MSI Afterburner in use, no overclock, just custom fan profile. Software was uninstalled and reinstalled while I changed GPUs.

Temperatures are fine. The 3090 tops at ~68ºC under stress, with the case closed. CPU is cooled by an AIO, and under AIDA64 stress test reaches 54-57ºC across the cores.


The most simple question: can I be unlucky to the point that I simply got a second GPU with very similar hardware fault as the first one? I guess this theory can't be discarded...
But I'd be really afraid to RMA a second card just to see the same failure on a third one...

So please, help.


Thank you, sorry for long message.
 
Last edited:
Solution
Corrupt OS, yes. More likely it's a driver conflict between motherboard and gpu drivers.

I'd update the bios first, Windows just had a major update version and that can make wierd stuff happen that normally wouldn't with bios settings and compatibility.

Then I'd update the motherboard chipset drivers from the vendor website, make sure both audio and Lan especially are included, but I'd also make sure that pcie and USB family are there too.

Then I'd run ccleaner and clean out any junk files/temp files etc as they can contain all sorts of install files that are wanting to do something.

I'd also use the registry tool (default settings, say yes to backup) and clean out the dead ends, incorrect addressing, orphans and othe...

InvalidError

Titan
Moderator
If you get the same problems between two completely different GPUs, then you have to consider the possibility that the problem is caused by something both GPUs had in common: the rest of the system.

Since you have a 10850k, you could try running Heaven on the IGP. If you get issues there too, then your GPU issue may actually be coming from a corrupt OS or flaky CPU/motherboard.

If the system came with Windows pre-installed, I'd do a re-install to rule out OEM bloatware or a corrupt drive image. My mother kept having issues with her Acer laptop until I uninstalled all of the Acer bloat that kept rolling back drivers to Acer's 1-2 years old junk.
 
Apr 11, 2021
4
0
10
If you get the same problems between two completely different GPUs, then you have to consider the possibility that the problem is caused by something both GPUs had in common: the rest of the system.

Since you have a 10850k, you could try running Heaven on the IGP. If you get issues there too, then your GPU issue may actually be coming from a corrupt OS or flaky CPU/motherboard.

If the system came with Windows pre-installed, I'd do a re-install to rule out OEM bloatware or a corrupt drive image. My mother kept having issues with her Acer laptop until I uninstalled all of the Acer bloat that kept rolling back drivers to Acer's 1-2 years old junk.

Hi, thanks. Exactly. I'm trying to think what in the rest of the system could cause this particular set of issues.
Oh, I ran Heaven on the IGP. Twice: when the PC arrived and a week ago, more or less, to see its performance. No issues. Just slow depending on the settings, as expected.

Can a corrupt OS cause this kind of stuff? Even if the GPU drivers are fresh? I mean, can the OS harm this hard the GPU's abilities to render?
 

Karadjgne

Titan
Ambassador
Corrupt OS, yes. More likely it's a driver conflict between motherboard and gpu drivers.

I'd update the bios first, Windows just had a major update version and that can make wierd stuff happen that normally wouldn't with bios settings and compatibility.

Then I'd update the motherboard chipset drivers from the vendor website, make sure both audio and Lan especially are included, but I'd also make sure that pcie and USB family are there too.

Then I'd run ccleaner and clean out any junk files/temp files etc as they can contain all sorts of install files that are wanting to do something.

I'd also use the registry tool (default settings, say yes to backup) and clean out the dead ends, incorrect addressing, orphans and othe miscellaneous bs that updates, installs, deletes always leave behind.

Open a CMD (admin) and run
dism /online /cleanup-image /restorehealth
to make sure the OS is not quirked.

Starting out with a clean, healthy pc, if it artifacts again, I'd be looking at hardware, but that could also be something as simple as a half-bust or cheap cable, or bad monitor connection not necessarily the gpu itself.
 
Solution

InvalidError

Titan
Moderator
Can a corrupt OS cause this kind of stuff? Even if the GPU drivers are fresh? I mean, can the OS harm this hard the GPU's abilities to render?
Almost anything is possible. Since you said the IGP was able to run everything fine, then that reduces the number of possibilities from "everything else in the system" to things that are different between using the IGP and GPU.
 
One of the easiest ways to diagnose the issue would be to put that artifacting GPU into an entire different system. This would isolate the issue to the GPU only, if it followed the GPU to the new system. This is very expensive and impractical if you dont have a second system on hand, or a friends system to test with. Aside from this, and what was mentioned above, my vote would be a bad PSU. That PSU in specific is a decent model, but that doesnt guarantee anything. In my experience, a bad PSU can cause issues and lasting damage, even not at full load. I had a bad PSU at one point, and it would work fine under full load, nothing appeared wrong. But my GPU started to have issues. Not artifacts like you have, but crashes, BSOD, freezing, that kind of thing. It only ever happened while I was using the internet, or other low use (non game) application. The issue turned out to be the GPU being damaged by the PSU. Replaced the GPU under warranty, and had similar but less severe issues with the new GPU was well after a while. Eventually replaced the PSU for an unrelated reason, and the issues disappeared immediately. In my case, the GPU was not permanently hurt, but that is not outside the realm of possibility.
 
  • Like
Reactions: Karadjgne
Apr 11, 2021
4
0
10
Hi folks,
thanks for replying. More than helping, you are calming me down. It's fair to say I'm a bit distressed by all this.


So, from what I gather from your replies: perhaps the first step would be a fresh install of the OS? It's worth mentioning that the artifacts -- on both GPUs -- always appeared when I already was in Windows -- never while booting or something.
Which makes a case for something wrong within the OS that strangely arises only when I'm on GPU, and not IGP.
Considering this, would be reasonable to hold on any modifications on BIOS just for now? Just to work on this one step at the time. Perhaps flashing the BIOS would be the second step, the first one failing.

Almost anything is possible. Since you said the IGP was able to run everything fine, then that reduces the number of possibilities from "everything else in the system" to things that are different between using the IGP and GPU.

Sorry, wish I knew more. Which things can we gather?
For now I can only think on the OS, the PCI-Express slot and controllers, the motherboard BIOS and perhaps the PSU (more on that below). If it's possible for the OS to produce these kinds of artifacts, it would be now my first bet.

One of the easiest ways to diagnose the issue would be to put that artifacting GPU into an entire different system. This would isolate the issue to the GPU only, if it followed the GPU to the new system. This is very expensive and impractical if you dont have a second system on hand, or a friends system to test with. Aside from this, and what was mentioned above, my vote would be a bad PSU.

I wish I could properly test my PSU. For now, I can only see how's the power delivery with HWMonitor. All voltage lines seem fine.
But, if I may wonder a bit: the artifacts resemble VRAM issues. The messages that appeared in the end of Heaven Benchmark even mention memory issues and incapacity of creating textures (i. e. properly access VRAM, maybe). If the VRAM itself isn't at fault, something could be feeding the GPU with corrupt data, which could even be software-related. Can't rule out a faulty PSU, but I'm not sure if a problem with PSU would cause errors like these.
It's true that I know very little about all the possibilities, anyway, haha. Really didn't see all this mess coming...

also, to remove windows from cause, can always run ubuntu live usb - https://itsfoss.com/create-live-usb-of-ubuntu-in-windows/

I suspect its hardware anyway but it would confirm

Just for the record, while I begin to perform the troubleshooting. If it's hardware, based on the kind of artifact we can see, what in the system do you think could be failing to give those errors? Aside from the GPU itself. Just so I have it in my mind.


Again, really appreciate all the help. I didn't get such help in many places, so... really, thank you all.
 

InvalidError

Titan
Moderator
For now I can only think on the OS, the PCI-Express slot and controllers, the motherboard BIOS and perhaps the PSU (more on that below). If it's possible for the OS to produce these kinds of artifacts, it would be now my first bet.
As I wrote earlier, if you got the OS pre-installed, it may have bloatware or bad config carried over from the OS image it was likely installed from. So re-installing Windows from a fresh up-to-date install image directly from Microsoft would be a reasonable thing to do IMO even if there weren't any initial issues. When I bought my first laptop, re-installing the OS from a clean source was the fourth thing I did after making sure everything appeared to be working out-of-the-box, making sure I had all of the necessary drivers ready to go and backed up the OEM crap I wanted to keep just in case.

As for the list of things that can possibly go wrong, you appear to have a reasonable grasp of the prime suspects. The easiest way to rule all of them out at once is as ttower wrote: try a whole other system if you can borrow one with a suitable PSU or do a temporary PSU swap in. If I had GPU issues in my PC, I'd probably start with swapping my GTX1050 with my Core2's HD5770, see if problems appear to follow the GPU.
 
Apr 11, 2021
4
0
10
Did you try another PCI-Slot?

Looks for me like faulty MB( or faulty CPU, I do not know, how PCI is implemented on Intel-Platform).

Haven't tried another slot. I'm afraid... it may not be possible. GPU takes too much space inside the case. The GPU itself would fit, but the power cables... well, I could try to force it, or test another device that uses the slot, but for now there's no reason unfortunately. See below:

As I wrote earlier, if you got the OS pre-installed, it may have bloatware or bad config carried over from the OS image it was likely installed from. So re-installing Windows from a fresh up-to-date install image directly from Microsoft would be a reasonable thing to do IMO even if there weren't any initial issues. When I bought my first laptop, re-installing the OS from a clean source was the fourth thing I did after making sure everything appeared to be working out-of-the-box, making sure I had all of the necessary drivers ready to go and backed up the OEM crap I wanted to keep just in case.

As for the list of things that can possibly go wrong, you appear to have a reasonable grasp of the prime suspects. The easiest way to rule all of them out at once is as ttower wrote: try a whole other system if you can borrow one with a suitable PSU or do a temporary PSU swap in. If I had GPU issues in my PC, I'd probably start with swapping my GTX1050 with my Core2's HD5770, see if problems appear to follow the GPU.

The bad thing about these artifacts is that they are sporadic. It can take weeks until they show up again.
This makes troubleshooting a nightmare. Even if I managed to get access to another PC, I wouldn't be able to rule out immediately any cause for the artifacts, either in the GPU or in my own PC. Right now, for instance, everything is fine here, as if nothing ever happened...

That's why I was making some effort to identify instances were failure was more... unlikely to happen.
Unfortunately, the only starting point we have to investigate this is the nature of the artifacts: how and when they appear, how they look like, their general behaviour etc.

Anyway: will reinstall Windows. Unfortunately, I may not be able to tell if the problem was solved for some time. Hopefully, if it isn't the OS, the artifacts will manifest themselves again quickly, so we can procede to the next steps.


That's it for now. As always, appreciate all the support and inputs. :)
 

InvalidError

Titan
Moderator
The bad thing about these artifacts is that they are sporadic. It can take weeks until they show up again.
Those sorts of issues are definitely the worst. Even when you do fix them, you don't know for sure until you have exceeded your longest time to failure multiple times in a row.

BTW, the only reason I'm blaming the software first is because you wrote you had the same issue with two different GPUs. The likelihood of that is pretty low. Otherwise, I'd agree that the artifacts look like some sort of VRAM addressing issue. With GPUs being so hard to come by at the moment, blame the easy stuff first.
 

TRENDING THREADS