Question Asus RTX3090 EKWB - difficult diag

Jun 6, 2023
5
1
15
OK here’s a head scratcher for the fellow HW geeks.

Experienced system integrator here with a head scratching int fault on my otherwise
rock solid 5y reliable rig. All I changed were the graphics cards.

Rig specs:
ASUS ROG Rampage V Extreme
EVGA Supernova 1600 P2
Corsair Dominator DDR 4, 4*16
Core i7 6850k w/ EK plexi top copper block
2* Samsung Evo 850
3* Seagate 2TB spinners
Dual custom loops.

Outgoing GFX - 3* EVGA 980Ti 6G Hydro Copper

Incoming GFX - 2* ASUS 3090 EKWB 24G, (nVidia brand 3090 SLI bridge)

The Rampage V implements 4 smd PCI device presence LED next to the 4 switch DIP bank for PCI control.

On first boot, no POST with legacy option ROM code on the q code, but machine was unplugged from mains for 4 weeks so cleared CMOS 1 and we get a successful POST.

Top card in slot 1 - no presence LED. Clear CMOS once more, boot to win (display on 2nd card), verify operation of card in “4th” (physically 3rd) slot, shut down, fit missing SLI bridge, get presence LED on slot 1. Happy days. All operational and performing.

Next day, during game title launch, machine halts dead and CPU idle ceases. Hard reset and machine boots. Did not note presence LED state. But nVidia driver refuses to load auto or manually. Reinstall drivers clean - all good, but note SLI options gone from nvidia control panel. Second GPU gone from ASUS GPU tweak tool. Check presence LED on slot 1 due to this - it’s out again.

Obviously completely removing and reseating the card or swapping is difficult without teardown. This will also reault in a lot of coolant wasteage. Today I will partially reseat the card a few times as much as I’m able with coolant pipe flex and see if I can reproduce intermittency - or even operation at this point. The PCI slots for 5 years had the EVGA 980s seated and never removed. The slots and edges on the 3090s visually in pristine clean condition checked during install. Machine fully filtered with marginal dust. Service about every 6m.

Any in situ diag ideas other than the usual PS volt checks and swaptronics? (pS volts look good - in spec) The cards were second hand but in excellent physical condition - I’m starting to get suspicious here as the seller listed several of these cards at the same time - like 6 or more. I wonder if they were abused in a mining rig or somehow the seller got their hands on a stack of working but RMA cards and sold them off gambling that buyers would not find or recognise an int fault or cosmetic fault like bad argb bus traces. There was coloured coolant residue in the blocks but they were *surprisingly* clean. Seller is being evasive of the card history question (where did these come from, twice with no answer to this, but replied to the rest of the query) and maintains they should not be faulty. I note 2/3 of the argb diodes on the “good” card are dead - which does nothing for my suspicion …. If it needs to be swaptronics I’ll get to that in a week or so - anyone else seen symptoms like the above? Dodgy water cooled parts in hard tube loops are a right pain due to integration labour and inability to test in a rig without a water loop.
 
Last edited:

Lutfij

Titan
Moderator
Welcome to the forums, newcomer!

EVGA Supernova 1600 P2
How old is the PSU in your build?

You might want to pass on a picture of your build/innards. I'm assuming perhaps the heat of the GPU will cause it to warp and break a connection with the pins in the PCIe slot It could also be a BIOS issue.

Are these the cards you own?
 
Jun 6, 2023
5
1
15
Welcome to the forums, newcomer!

EVGA Supernova 1600 P2
How old is the PSU in your build?

You might want to pass on a picture of your build/innards. I'm assuming perhaps the heat of the GPU will cause it to warp and break a connection with the pins in the PCIe slot It could also be a BIOS issue.

Are these the cards you own?
Thanks for the reply! The PSU is about 5 years old, but it’s been a solid performer and the weekly duty averages about 20%. That link is indeed the cards I have. While I’ve checked and monitored the rail voltages - It’s in spec for ATX. Yet to do a detailed ripple check but I expect a ripple causing one functional symptom to cause more than one.

I’ll post a pic of the innards a bit later on, but in-case temps are always reasonable (c40 deg celsius up to 50 peak). The climate here is moderate and the machine has always run in air conditioning. The vented plates on the 980s meant there was still airflow over the cards. The backing plate and EK block on the hydro copper cards are very rigid. I checked PCB with a straight edge - perfect. Same with the new asus cards - less rigid but perfectly flat plane. The cooling loops are both tuned for low temp over low noise. The evga card block covered the RAM and regs not just the gpu. The asus cards appear to have a similar thermal design albeit with clear signs of value engineering.

I cant see any evidence of deformation of the slot plastic or mobo however I did not refit the mobo during this upgrade so did not check it on the bench or eyeball the joints on the back of the board. The mobo is “quirky” but not unexpected for one this complex - e.g features like bios flashback don’t operate as documented - but I have always found asus docs to be less than accurate and hardcore features not fully bottomed out in development. I updated the bios during investigation into a historic issue with the asus power control driver vs W10. The issue persisted until I deprecated and removed the asus power driver. Based on that I restored the factory bios version and there was never an issue since (that symptom was a precisely 5 minute OS uptime before BSOD logging a driver error at every boot).

So next steps:

In place reseating of both cards

Update mobo back to the later bios

Switch to CMOS 2, replicate config and try to reproduce

Post a pic for you all to interrogate

I do have a spare 5yo but unused 1200 P2 so can swap that in if it comes to it. Power req will still be in spec of the PSU.

EDIT: Attached is a pic of the machine core. As you can see, we're in a Phanteks Enthoo Pro and we're pretty tight in here, but but not overcrowded. Positive pressure in the case when closed up, and things get warm but not hot and temps are consistent across the interior (so no nasty gradients that might drive warpage).
kTLFJtR.jpg
 
Last edited:
Jun 6, 2023
5
1
15
Thanks for the reply! The PSU is about 5 years old, but it’s been a solid performer and the weekly duty averages about 20%. That link is indeed the cards I have. While I’ve checked and monitored the rail voltages - It’s in spec for ATX. Yet to do a detailed ripple check but I expect a ripple causing one functional symptom to cause more than one.

I’ll post a pic of the innards a bit later on, but in-case temps are always reasonable (c40 deg celsius up to 50 peak). The climate here is moderate and the machine has always run in air conditioning. The vented plates on the 980s meant there was still airflow over the cards. The backing plate and EK block on the hydro copper cards are very rigid. I checked PCB with a straight edge - perfect. Same with the new asus cards - less rigid but perfectly flat plane. The cooling loops are both tuned for low temp over low noise. The evga card block covered the RAM and regs not just the gpu. The asus cards appear to have a similar thermal design albeit with clear signs of value engineering.

I cant see any evidence of deformation of the slot plastic or mobo however I did not refit the mobo during this upgrade so did not check it on the bench or eyeball the joints on the back of the board. The mobo is “quirky” but not unexpected for one this complex - e.g features like bios flashback don’t operate as documented - but I have always found asus docs to be less than accurate and hardcore features not fully bottomed out in development. I updated the bios during investigation into a historic issue with the asus power control driver vs W10. The issue persisted until I deprecated and removed the asus power driver. Based on that I restored the factory bios version and there was never an issue since (that symptom was a precisely 5 minute OS uptime before BSOD logging a driver error at every boot).

So next steps:

In place reseating of both cards

Update mobo back to the later bios

Switch to CMOS 2, replicate config and try to reproduce

Post a pic for you all to interrogate

I do have a spare 5yo but unused 1200 P2 so can swap that in if it comes to it. Power req will still be in spec of the PSU.

EDIT: Attached is a pic of the machine core. As you can see, we're in a Phanteks Enthoo Pro and we're pretty tight in here, but but not overcrowded. Positive pressure in the case when closed up, and things get warm but not hot and temps are consistent across the interior (so no nasty gradients that might drive warpage).
kTLFJtR.jpg


Update: Checked a few things out last night. With no mechanical interference between boots, card when from not detected to detected and working fine. 30 mins post boot at windows DT idle (idle temps all round, everything still quite cool) , the card dropped the slot 1 LED extinguished and the machine reset. So I opened up, rechecked volts - all good (I was able to get a needle probe beyond the jacks and probe the joints on the card). Gave both cards the best "wiggle/reseat tune" I could without breaking into the cooling lines - I was able to extract from the slot by 10 mil). At the same time I decided to clear and try to boot from bios 1 and config it up to see if any difference - interesting, bios 1 is dead. 00 perpetually on q code, no bootstrapping activity at all. Flashback from 2>1 results in alternating 1/2 led and the process never starts....switch back to bios 2 and we get a POST (cleared, so lost boot device parms and failed to boot but that's OK, got a POST). Back on bios 2 when I confirmed it was POSTing OK I noted the slot 1 indicator was back on, and the power had not been off long enough for anything to cool down. So my suspicion is now split between the first 3090 and bios/slot or possibly the VRM. Asus didn't do nice things with the design of the VRM - I feel the extra cooling they integrated was more a mitigation than a nice feature....

The next step can't be anything more than test the result of the reseat and then drain the loop and swap the cards around - that will tell me what I need to know at a macro component level. Can still get Rampage V in a few places it seems, so I wouldn't need to do a full scale repower if the conclusion was the MB is beyond recovery
 
Jun 6, 2023
5
1
15
Further progress on diags…

In place reseat has not changed anything. Post reseat both cards were detected and operated in GPU post and windows. I even got 30 mins if gameplay in before the machine halted. So at this stage it’s not a seating issue.

Bios - verified the version, and seems at some stage I did flash the latest version on. I found 4 bios in my repository - so I’ve clearly had some fun with this at some stage in the past. So I won’t reflash.

Last step is today, to drain the loop enough to remove and swap the cards around. Once I’ve produced the fault on slot 1 once more, disproving the graphics card is at fault I will replace the rampage V.

Lutfij - you immediately zeroed in on the EVGA PSU. I’m curious about why and if you’re aware of anything I should be more closely checking. All rail voltages are in spec at the moment of observation - do you think I should be logging over time and looking for drifts, particular ripples/noise etc? The 1600 P2 was difficult to get when I bought it. In NZ the typical max available was 1200 watts at the time and the 1600 disappeared from the market not long after I got it. I have always wondered why, when every other EVGA PSU was still available, that this particular one dropped out. The 1600 is a little “odd” - it has a 16 amp (at 240) iec, and is rated to draw 16 from 120 (8 from 240 as is the case here). EVGA for the 240 market used the same cable as for the 120 market, so its a phat cable trunked down to a standard 240 plug which looks not-so-elegantly-engineered but isn’t an issue in itself…the PSU is also physically larger (long) than any I’ve seen however “better far too much than not enough” to the extreme end is my PSU selection MO. It’s rated to deliver 130 on the +12 which I was always enamoured by.