Question Radeon RX 5700 XT - endless TDR errors regardless of load

Oct 11, 2020
2
0
10
I recently got a new computer, specs below:
Processor: Intel Core i7-9700K
Motherboard: Asus Prime Z390-P
RAM:
16 Gb Thermaltake Toughram DDR4-4000
Graphics:
ASRock Radeon RX 5700 XT
PSU:
750 Watt Thermaltake Toughpower
OS: Windows 10 Pro

When I started it up, I almost immediately started getting full system freezes, and the occasional heavily-corrupted bluescreen. These seemed to happen almost regardless of system load--several times, they happened while it was just idling on the login screen. The crashes didn't always leave any traces in the event log, but when they did, the cause was always a Video TDR Error. So, I treated it as a graphics card issue and did troubleshooting accordingly. Among other things, I tested the card in a PCI-E 3 slot (still broken,) tested the PC with both a different video card and the onboard video (both worked fine,) tested with different monitors and output ports (nope,) reseated the card and the 12V cables, and updated Windows and the card's drivers. After doing all of this, I contacted support at the company that sold me the computer. They agreed it sounded like a defective graphics card and RMA'd it.

The replacement graphics card... immediately started having the same issues. It's a little hard to say anything conclusively about it, since its first system crash was during its initial setup, which corrupted both the AMD drivers and some Windows components. The crashes that followed were all over the place, presumably because stuff was real broken. (Including, at one point, a WHEA_UNCORRECTABLE_ERROR, which definitely had not popped up before.) After fixing all of that, I did a DDU, which at least got it back to its old, more reliable TDR issues. (And on one occasion, three minutes straight of attempting to restart amdkmdag.) At this point, I dug in a bit more, both considering every piece of hardware to be suspect after that WHEA error and looking into some common complaints with the RX 5700 XT. So, I have since tried:
  • Stress-testing the CPU and RAM with Prime95 for six hours, and testing the RAM with MEMbench. Neither test turned up any issues. Since it will happily crash at idle loads within minutes, this inclines me to believe the CPU/RAM aren't directly responsible.
  • Adjusting the fan curve to force the card to run its fans while idling and allow itself to get up to 100% fan speed.
  • Logging the system's status up until one of the freezes with HwInfo. In the last check before the freeze, it says the GPU temp was 39C, and the 12V rail was at 11.808V. I believe that both of these should be within the acceptable ranges, and all of the other temperatures look fine. The only voltage I'm unsure about is the Vcore, which was at 0.852V just before the crash. I really don't know what that should look like. The entire log is here if anybody is that curious about other stats.
So, every individual component I can think to test or monitor looks okay. The system runs fine with the few other GPUs I have available. Updating the card's chipset firmware is one option, but as unstable as it's been, I'm afraid it might mess up and brick itself if I try. Apart from that, I'm kinda running out of ideas. Is there anything else I can try here, or am I going to have to just accept that this particular system hates this particular video card model?
 
Oct 11, 2020
2
0
10
Well, I may have resolved this. Several people with this issue had reported that they'd needed to underclock their RAM before the card was happy with its reliability, and this morning I found one who said that it happened even with memory that didn't show any errors in heavy Memtest86 testing. And... sure enough, after underclocking it, I've now run the computer for four and a half hours with no issues, including maxing it out for a while in FurMark and several hours sunk into multiple games. If it stays stable for a few days, I guess I'll have to tinker with the timings and see how much I can minimize the performance impact.