News Nvidia explains the missing ROPs — defective silicon in 0.5% of RTX 5090 and 5070 Ti GPUs

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
I agree that they knew about the defective chips before they were released to AiBs.

The thing I'm not sure about is whether NVIDIA would have to make adjustments to the firmware/BIOS of these cards due to the missing/defective ROPs. Another question is whether they actually had to physically disable those defective ROPs before the chip is ready. If either (or both) of these things are true this shows malicious intent.

I still haven't seen a response from anyone I consider technically savvy enough on the chip fabrication process to know what they're talking about...yet.
When chips are made they are tested in a variety of ways after being cut off the wafer. Any malfunctioning components on the chip are identified and the chip is graded at how high it can clock based upon heating and errors. If parts of a chip don't work the firmware they load will tell the chip not to use those parts of the chip. That is how you get CPUs that have 4, 6, or 8 cores on them even though they are all technically 8 cores - they just deactivated some cores and sold them as a lesser SKU. Chips that can't go as fast are binned down.
The bad 70ti should have more parts of the chip turned off and sold as a 70 class GPU. The bad 5090 could have half of it deactivated and sold as an 80.
 
  • Like
Reactions: jp7189
nVidia is so fortunate that no one else on the planet is even trying to compete with them at the high end. If not for lack of competition and the AI push keeping them flush with massive cash, they would otherwise be headed for deep trouble. I personally think the AI thing is going to turn out to be scam level disasterous in the end, but until it collapses, and they continue putting their eggs in that basket, they are safe.

When it collapses, they better have the ability to rededicate their resources enough back to consumer graphics.
When all this AI hype end they will have something new to sell. Just like crypto/blockchain hype being replaced with AI. if you follow nvidia you should already know what the next hype is after AI. Gamer will continue to get the scrap as usual.
 
And they'll get away with the shenanigans because gamers are addicts and at the end of the day won't change their spending habits in response. They're charging the value of an entire computer for a GPU and people can't buy them fast enough. Pff, it could have been filled with liverwurst, it wouldn't matter
 
The cynic in me knows that they knew all along. The majority of consumers will not bother to look, the majority do not read hardware websites either.
When faced with what looks to be a four month wait(?!) to get an RMA, again, what will a lot of people do?
 
When chips are made they are tested in a variety of ways after being cut off the wafer. Any malfunctioning components on the chip are identified and the chip is graded at how high it can clock based upon heating and errors. If parts of a chip don't work the firmware they load will tell the chip not to use those parts of the chip. That is how you get CPUs that have 4, 6, or 8 cores on them even though they are all technically 8 cores - they just deactivated some cores and sold them as a lesser SKU. Chips that can't go as fast are binned down.
The bad 70ti should have more parts of the chip turned off and sold as a 70 class GPU. The bad 5090 could have half of it deactivated and sold as an 80.
Understood.
The questions I have are whether NVIDIA had to go through any extra steps in the manufacturing process (to disable defective ROPs), or use a different firmware (because the standard would try to use defective ROPs), for these cores.
If yes, that's where you get malicious intent.
 
Last edited:
They don't even need to try that hard. The warranty is non-transferrable, so all those scalped cards are insta-denied. One more reason not to feed the scalpers.
Not really. The scalpers would be considered a reseller. The first person to open the packaging and register the product with the manufacturer would be the original owner.
 
Exactly, they made the decision to either meet launch date, hope it goes unnoticed, get the oops do over I’m sorry … or at worst face the inevitable class action lawsuit … which they could easily quash before it got to that point. The fact is they did not intend to waste this silicon as they diverted some of the more lucrative AI silicon to fulfill this launch. Let’s be real the chances of them missing such an easy check is about reasonable as someone on an auto assembly line not noticing you are missing a cylinder. I mean this is one the key differences in gaming Blackwells vs AI only center chip components … and you don’t notice it’s not complete? These are basic automated checks. They chose to deal with the issue afterwar, period.
Indeed, just get them to market, and clean up the mess after the fact. If they are lucky, a few of their customers don't even notice and they keep primo $$$ for subpar product.
 
  • Like
Reactions: LolaGT
Not really. The scalpers would be considered a reseller. The first person to open the packaging and register the product with the manufacturer would be the original owner.
I suppose it would be a case by case basis, but every warranty statement I've seen uses language like "authorized reseller". I'm pretty sure if it were tried no judge would side with the scalper buyer, but I'm also equally sure no AIB would want the negative PR.
 
Understood.
The questions I have are whether NVIDIA had to go through any extra steps in the manufacturing process (to disable defective ROPs), or use a different firmware (because the standard would try to use defective ROPs), for these cores.
If yes, that's where you get malicious intent.
It is all part of the standard manufacturing at the lithography fab (in this case at TSMC). Before they even attach the processor to what we think of as a 'chip' which provides the connectivity to the GPU board or motherboard for CPUs. There are always all kinds of imperfections in lithography and missing parts and inferior processing efficiency get identified immediately and presumptively routed to lower value SKUs. The sad part here is that fewer ROPs don't impact AI at all, so these chips should have all been sent to their AI product lines. Someone either screwed-up and allowed these chips to be completed and sent to their board partners when they shouldn't have or they decided to take a chance and hope nobody caught the mistake. The fact that Nvidia immediately came back with a percentage (0.5% - which I doubt) suggests they knew they were passing off inferior products deliberately and just hoped nobody would notice.
 
And they'll get away with the shenanigans because gamers are addicts and at the end of the day won't change their spending habits in response. They're charging the value of an entire computer for a GPU and people can't buy them fast enough. Pff, it could have been filled with liverwurst, it wouldn't matter
Gamer not really buying. Just look at AMD revenue for the last 4 quarters or so. AMD usually get something like 1.5billion per quarter with their gaming division but the last 4 quarter gaming revenue drop to like 500 million (on average). Gamer not really buying new hardware be it gpu or console. On nvidia side the one that buys all that gpu were semi pro or company that need them for AI. they are the one that willing to pay those crazy price nvidia is charging for those gaming gpu. AMD saw this that's why they soon will ditch RDNA and go for UNDA to chase that more lucrative market than gaming. If you want to blame someone for the crazy price we see right now blame miners or those that buy gaming gpu for non gaming task. Because they willing to pay those crazy price that gamer not willing. And they buy them in big volume as well unlike gamer that usually only buy one gpu and keep it for 2 or 3 generations.
 
And the reason for that is because everyone else is willing to pay more and don't make it their life mission to complain about anything and everything that these companies do.
I think many gamer already complaining. Some others they just buy what is affordable to them and then adjust their the way they game like not chasing high resolution or high refresh rates. Gamer that buy this expensive stuff most likely not that many. That one that more willingly to pay the more expensive price are the one that use gpu for non gaming stuff. That's why sometimes i do think things like semi pro are what ruining the gpu market price.
 
So, they're going with incompetence instead of deceptive, malicious intent.
I mean, I see no reason to doubt them on that. I see this as indeed plain incompetence, because it makes little sense for it to be anything else. They really don't need those few dozen extra salvaged chips, at expense of their rep and all the screeching.

Still bad of course, but definitely not a malicious intent.
 
nVidia is so fortunate that no one else on the planet is even trying to compete with them at the high end. If not for lack of competition and the AI push keeping them flush with massive cash, they would otherwise be headed for deep trouble. I personally think the AI thing is going to turn out to be scam level disasterous in the end, but until it collapses, and they continue putting their eggs in that basket, they are safe.

When it collapses, they better have the ability to rededicate their resources enough back to consumer graphics.
I would not hold my breath on AI demand collapsing anytime soon.

Besides, in my opinion, this is barking at the wrong tree here. The problem is that we have fudge all for foundries - that is clearly the weakest link in this whole mess.

The real competition we need is a competition to TSMC.
 
You had me scared there for a second. I just ran GPU-Z and my 4090 does have the correct number of ROPS i.e. 176. Not like I would have been able to do anything about it at this point.
Hope you installed the nVidia driver before the check, otherwise GPU-Z showed the number of ROPs from the specification, not from the chip.
 
Given that nvidia supposedly knows the quantity of incorrect chips (and that it affects the 5070 Ti too) that leads me to believe it's one of two things: someone didn't do their job in QA or someone figured who'd notice and shipped it anyways.
Or option 3: At some point in the runup to release, yields were low enough that the 'low ROP count' variant was decided to be the final variant in order to increase supply of valid dies. At this point, dies with the lower ROP count would end up passing QC as meeting the spec at the time, and and dies that had already passed at the higher ROP count would be released to production anyway to avoid the cost of recalling them and disabling the working ROPs. Later, either the yields raised enough for forecast supply to be met or supply forecast was cut, and the official spec was bumped back up to the 'high ROP count', but 'low ROP count' dies had already passed QC with a valid check and were making their way through the supply chain.
The cockup was either that there was no way to retroactively invalidate QC so board partners had no way to tell that a QC-passed die they had received was no longer valid, or that a secondary validation process was missed by some or all board partners that would have caught retroactively invalidated dies if it had been kept to, or the low-ROP dies had already made their way onto validated and shipped boards (or even full SKUs) before being marked as invalid. If Nvidia through that had pulled all low-ROP dies but some had already made their way onto boards (or into finished products), then it may not have been until boards made their way to end users that the effort to track down and account every batch, tray and die would even have been made: if you issue a hold note and get a "yup, we'll not use those" in response then you expect it to actually happen.
 
Everyone thinks logistics is a trivial thing. There so many working parts in industry it is a miracle things work as well as they do.

And that is factoring in incompetent people.

Therefore, mistakes will happen.

People making stuff up here is silly.

Nvidia is a company, not a person. Someone made a big mistake. And because of that GPUs got out in the wild that weren't supposed to.

The engineers know you can see the specs in a computer so they weren't trying to pull a fast one on everybody. Lol.
 
It is all part of the standard manufacturing at the lithography fab (in this case at TSMC). Before they even attach the processor to what we think of as a 'chip' which provides the connectivity to the GPU board or motherboard for CPUs. There are always all kinds of imperfections in lithography and missing parts and inferior processing efficiency get identified immediately and presumptively routed to lower value SKUs. The sad part here is that fewer ROPs don't impact AI at all, so these chips should have all been sent to their AI product lines. Someone either screwed-up and allowed these chips to be completed and sent to their board partners when they shouldn't have or they decided to take a chance and hope nobody caught the mistake. The fact that Nvidia immediately came back with a percentage (0.5% - which I doubt) suggests they knew they were passing off inferior products deliberately and just hoped nobody would notice.
That's along the lines of what I was thinking.
Someone at NVIDIA made a conscious decision to release these chips, knowing that they do not meet the RTX 5090 specification and should be rejected, according to NVIDIA's own guidelines. E.g. give them a pass as RTX 5090 chips and allow them to move forward to the interposer process and get shipped out to, wherever. This is a conscious decision to deceive and thus malicious intent.
 
Last edited:
I mean, I see no reason to doubt them on that. I see this as indeed plain incompetence, because it makes little sense for it to be anything else. They really don't need those few dozen extra salvaged chips, at expense of their rep and all the screeching.

Still bad of course, but definitely not a malicious intent.
I disagree. Incompetence would be something like the lack of proper training so a certain individual missed or ignored the red flashing light showing that a core on a particular wafer had a defect. (oversimplification example)

Malicious intent requires a conscious decision to obfuscate, lie, cheat, or do something you know is wrong. I believe that someone made the conscious decision to release these chips when they should have been rejected.