News Unreal Engine supervisor blasts 50% failure rate with Intel chips — company switching to Ryzen 9 9900X, praises AMD's praises AMD's single-threade...

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Dec 3, 2023
36
5
35
Weird thing for me is that I have had zero such issues with my i7 13700KF; even @32GB RAM XMP6200+6Ghz CPU all core OC and same for my i7 10700KF@32GB XMP 3200+5.129Ghz, i9 10920X@32GB XMP 3200 , i7 6850k@32GB XMP 3200+4Ghz, i7 5930K@32GB XMP3200+4.7Ghz, i7 4770K@XMP 1600Mhz, i5 3570K@16GB XMP 1600Mhz+4Ghz, i5 2500k@16GB XMP 1600Mhz+4Ghz, i7 870@XMP 1333Mhz, i5 750@XMP 1333Mhz, i7 980X@XMP 1866Mhz+4.5Ghz, i7 960@XMP 1333Mhz+4.1Ghz, i7 930@XMP 1333MHz, i7 920@1444Mhz+3.8Ghz, C2Q 9650@DDR2 1066Mhz+3.8Ghz , Q6600@DDR26 667Mhz, QX6700@DDR2 1066Mhz+3.2Ghz, P4 2.8Gz, and P3 1Ghz. All still run flawlessly across W98SE/W2000, XP32, XP64, W7, and W10 depending on the rig.
 
Last edited:
My 13900KS started degrading about 2 months after launch. Took months to get a replacement through warranty. This was the first time I was hit with degradation on intel since the 2600K. Generally I’d been able to push very high voltages and been fine as long as cooling was adequate. But the voltages on this chip are insane if you look at the predicted voltages in bios at stock clocks.

Edit: I should mention with the 13900KS after the first week of playing with the OC, I had gone back to running the chip at stock since I wasn’t able to get a meaningful OC out of the damn thing anyway. These chips are already pushed so hard to the edge that you’ll struggle to get even a 100-200MHz all-core OC out of it.
Interesting. I ran my 2600K at 5.0GHz for the first year I had it, then stock for the next 6 years, and never had an issue. I didn't realize that Sandy Bridge had a degradation issue.
 
  • Like
Reactions: tamalero
Weird thing for me is that I have had zero such issues with my i7 13700KF; even @32GB RAM XMP6200+6Ghz CPU all core OC and same for my i7 10700KF@32GB XMP 3200+5.129Ghz, i9 10920X@32GB XMP 3200 , i7 6850k@32GB XMP 3200+4Ghz, i7 5930K@32GB XMP3200+4.7Ghz, i7 4770K@XMP 1600Mhz, i5 3570K@16GB XMP 1600Mhz+4Ghz, i5 2500k@16GB XMP 1600Mhz+4Ghz, i7 870@XMP 1333Mhz, i5 750@XMP 1333Mhz, i7 980X@XMP 1866Mhz+4.5Ghz, i7 960@XMP 1333Mhz+4.1Ghz, i7 930@XMP 1333MHz, i7 920@1444Mhz+3.8Ghz, P4 2.8Gz, and P3 1Ghz. All still run flawlessly
It might be the specific workloads they're using?
 
  • Like
Reactions: NinoPino
... So guessing it’s simply if you load your chip often even at normal limits, it will start to become unstable kind of thing. The hotter the chip, the higher the rate of failure too. That’s my take from the analysis done.

What a disaster. How did they not stress test their chips for stability? Or worse, did they know and cover it up?
I've been wondering for several years about Intel's power draw. It doesn't SEEM like CPUs would be able to handle that long-term. But I would've thought it worst when Intel was still stuck on the 14nm node. But nobody mentioned 11 and 12th Gen having this problem.
 
  • Like
Reactions: NinoPino

valthuer

Prominent
Oct 26, 2023
185
185
760
Last edited:

TheHerald

Respectable
BANNED
Feb 15, 2024
1,630
502
2,060
I believe so. I tried updating my motherboard's BIOS in the past, but there was a slight decrease in my CPU's performance, so i didn't like it and i reverted to the old one.

By the way, this is a list of BIOS updates for my motherboard. Do you think i should choose tha latest?

https://pg.asrock.com/mb/Intel/Z790 PG Lightning/index.asp#BIOS
If it works, you don't upgrade your bios. Newest bios doesn't mean it's better. Latest bios on my z690 crashes at default settings on my 12900k.

What you should do is go with intel recommended specs (250w pl2, 307a), disable TVB and ST boost and you should be good to go.
 

valthuer

Prominent
Oct 26, 2023
185
185
760
If it works, you don't upgrade your bios. Newest bios doesn't mean it's better. Latest bios on my z690 crashes at default settings on my 12900k.

What you should do is go with intel recommended specs (250w pl2, 307a), disable TVB and ST boost and you should be good to go.

Thanks for the help! Much appreciated!
 
Mar 10, 2020
437
396
5,070
If it works, you don't upgrade your bios. Newest bios doesn't mean it's better. Latest bios on my z690 crashes at default settings on my 12900k.

What you should do is go with intel recommended specs (250w pl2, 307a), disable TVB and ST boost and you should be good to go.
Normally, if it works don’t touch it but in this situation.. intel claims to be bringing necessary fixes to “prevent “ further degradation in August.

Your cpu may not be exhibiting problems today, this does not mean it is not degraded. It only means that it has not reached the threshold to crash.
 
Interesting. I ran my 2600K at 5.0GHz for the first year I had it, then stock for the next 6 years, and never had an issue. I didn't realize that Sandy Bridge had a degradation issue.
I ran my 2600k into the ground; kept it at stock for nearly four years, until it was clearly starting to hold me back. Did a *very* conservative OC to 4GHz (which is tame for that chip); lasted two months. *shrug*
 

rluker5

Distinguished
Jun 23, 2014
913
594
19,760
To put the article in perspective the source:
View: https://x.com/DylserX/status/1815688815996281128

Hasn't been around the in question PCs or doing anything with them for the last 4 months so hasn't tried any recent mitigations. And he seems more focused on the creative production side than the hardware side. Maybe the guy/company he lent them out to has them all tuned and ready to go for him? That would be the polite thing to do.

Maybe Intel needs to more strictly enforce motherboard settings or come out with an intermediate chip that is more locked down. Or if they can't, reduce that stock single/dual core boost on new chips.

Things that are supposed to be limited in a certain way don't seem to be. Like this is on my pc that is supposed to be limited to 6GHz on 2 cores:
kVfEKc7.png

(note that my high voltages are in the safe range, this is from tuning, it doesn't seem like my CPU is exceptional)
The amount of freedom given to motherboard manufacturers in voltage manipulation without the consent of the owner is also excessive. I still think that most of the instability is due to this and not significant degradation, that they are 2 different problems.

And the GN oxidation issue is silly if you think about it" copper junctions get oxidized, sealed in an O2 impermeable internal part of the chip, some pass all checks, then degrade later by more oxygen because the atoms just magically appear there? It wouldn't surprise me if there were chips with O2 processing defects that were rejected because they failed binning, what would surprise me is the spontaneous creation of particular atoms in inconvenient places.
 

TheHerald

Respectable
BANNED
Feb 15, 2024
1,630
502
2,060
To put the article in perspective the source:
View: https://x.com/DylserX/status/1815688815996281128

Hasn't been around the in question PCs or doing anything with them for the last 4 months so hasn't tried any recent mitigations. And he seems more focused on the creative production side than the hardware side. Maybe the guy/company he lent them out to has them all tuned and ready to go for him? That would be the polite thing to do.

Maybe Intel needs to more strictly enforce motherboard settings or come out with an intermediate chip that is more locked down. Or if they can't, reduce that stock single/dual core boost on new chips.

Things that are supposed to be limited in a certain way don't seem to be. Like this is on my pc that is supposed to be limited to 6GHz on 2 cores:
kVfEKc7.png

(note that my high voltages are in the safe range, this is from tuning, it doesn't seem like my CPU is exceptional)
The amount of freedom given to motherboard manufacturers in voltage manipulation without the consent of the owner is also excessive. I still think that most of the instability is due to this and not significant degradation, that they are 2 different problems.

And the GN oxidation issue is silly if you think about it" copper junctions get oxidized, sealed in an O2 impermeable internal part of the chip, some pass all checks, then degrade later by more oxygen because the atoms just magically appear there? It wouldn't surprise me if there were chips with O2 processing defects that were rejected because they failed binning, what would surprise me is the spontaneous creation of particular atoms in inconvenient places.
I don't think there is such thing as a 50% failure rate. If the failure rate is at 50% it just means the other 50% hasn't failed.... yet. If the above numbers are true then every single at least i9 will experience issues sooner or later. Damn.
 

Taslios

Proper
Jul 11, 2024
54
76
110
if the project is a personal one then makes sense...he might not want to spend that $ on a cpu.

however work machiens are a company expense & thats where people spend when otherwise might not of.
Happens all the time really, someone at AMD or an AMD OEM saw that post and jumped on a really brilliant and easy marketing win....
 
Mar 10, 2020
437
396
5,070
To put the article in perspective the source:
View: https://x.com/DylserX/status/1815688815996281128

Hasn't been around the in question PCs or doing anything with them for the last 4 months so hasn't tried any recent mitigations. And he seems more focused on the creative production side than the hardware side. Maybe the guy/company he lent them out to has them all tuned and ready to go for him? That would be the polite thing to do.

Maybe Intel needs to more strictly enforce motherboard settings or come out with an intermediate chip that is more locked down. Or if they can't, reduce that stock single/dual core boost on new chips.

Things that are supposed to be limited in a certain way don't seem to be. Like this is on my pc that is supposed to be limited to 6GHz on 2 cores:
kVfEKc7.png

(note that my high voltages are in the safe range, this is from tuning, it doesn't seem like my CPU is exceptional)
The amount of freedom given to motherboard manufacturers in voltage manipulation without the consent of the owner is also excessive. I still think that most of the instability is due to this and not significant degradation, that they are 2 different problems.

And the GN oxidation issue is silly if you think about it" copper junctions get oxidized, sealed in an O2 impermeable internal part of the chip, some pass all checks, then degrade later by more oxygen because the atoms just magically appear there? It wouldn't surprise me if there were chips with O2 processing defects that were rejected because they failed binning, what would surprise me is the spontaneous creation of particular atoms in inconvenient places.
Your cpu is idle in that screen shot so even though all the clocks show their max limits they are barely sipping current, look at the package power entry in the screen shot.

When it has work to do, lots of switching, the clocks will drop to the programmed limits.. or the overridden limits as set by “motherboard enhancements” or overclocking values if you have set them. This depends on what has been set in the efi (or the tuning utility).

There have been 3 (please correct me if I have misread the situation) problems

1, acknowledged by intel, claimed to be limited and fixed : oxidisation in some 13xxx cpus due to a manufacturing defect. Dont know how many vulnerable cpus are in the wild.
2, motherboard out of box defaults pushing “enhanced” operation, in other words overclocking the cpu by default : intel default values exposed in a bios update have been made available to mitigate against this. Users have to choose to set the bios/efi to use them.
3, intel microcode errors that caused excessive voltages to be applied to the cpu. Fix in August.

For 2 and 3, enthusiasts will apply these or choose not to. Less technically competent users will justifiably have a degree of trepidation applying the bios/efi updates due to the potential result of a failed update.
 
May 21, 2024
15
27
40
some pass all checks, then degrade later by more oxygen because the atoms just magically appear there?
the oxidation exists and unchanged before and after degradation, it is fixed in the copper once fabricated. the presence of oxidized copper vias accelerates the degradation because it effectively narrorw down the nm-scale circuit further, increadsing the current density which is the major contribution to electromigration.
 
  • Like
Reactions: Gururu

rluker5

Distinguished
Jun 23, 2014
913
594
19,760
I don't think there is such thing as a 50% failure rate. If the failure rate is at 50% it just means the other 50% hasn't failed.... yet. If the above numbers are true then every single at least i9 will experience issues sooner or later. Damn.
"failure" in this case is specifically defined as instability, at least according to the source.
Your cpu is idle in that screen shot so even though all the clocks show their max limits they are barely sipping current, look at the package power entry in the screen shot.

When it has work to do, lots of switching, the clocks will drop to the programmed limits.. or the overridden limits as set by “motherboard enhancements” or overclocking values if you have set them. This depends on what has been set in the efi (or the tuning utility).

There have been 3 (please correct me if I have misread the situation) problems

1, acknowledged by intel, claimed to be limited and fixed : oxidisation in some 13xxx cpus due to a manufacturing defect. Dont know how many vulnerable cpus are in the wild.
2, motherboard out of box defaults pushing “enhanced” operation, in other words overclocking the cpu by default : intel default values exposed in a bios update have been made available to mitigate against this. Users have to choose to set the bios/efi to use them.
3, intel microcode errors that caused excessive voltages to be applied to the cpu. Fix in August.

For 2 and 3, enthusiasts will apply these or choose not to. Less technically competent users will justifiably have a degree of trepidation applying the bios/efi updates due to the potential result of a failed update.
You are mostly right, except for the assumption that chips "in the wild" that have an oxidation defect small enough to not be detected in functional testing are vulnerable to anything. As with Steve, that is merely an assumption. But in his case he was fed it by a salesman looking for sales.
the oxidation exists and unchanged before and after degradation, it is fixed in the copper once fabricated. the presence of oxidized copper vias accelerates the degradation because it effectively narrorw down the nm-scale circuit further, increadsing the current density which is the major contribution to electromigration.
That is some pretty specific oxidation. Tell me why it would only cover a bit of the copper interconnect and not all of it? Or all of a thousand in a tiny area? Because the latter two options would be detectable in testing and binning. This is nm scale you are talking about.

Edit: As for the high voltages being an issue, this does seem plausible as voltages over 1.5v have been an issue for CPUs for like a decade now.
 
Mar 10, 2020
437
396
5,070
You are mostly right, except for the assumption that chips "in the wild" that have an oxidation defect small enough to not be detected in functional testing are vulnerable to anything. As with Steve, that is merely an assumption. But in his case he was fed it by a salesman looking for sales.
I’m just saying we don’t know. Not drawing any conclusion, not making any assumption. The problem was acknowledged by intel, the process corrected by intel (claimed by intel).

As for passing tests, yes, it is possible for a new device to pass release tests. If it wasn’t then Intel would have knowingly released faulty hardware and this would be far worse.

I believe that Intel wouldn’t knowingly do that.
 
Last edited:
May 21, 2024
15
27
40
why it would only cover a bit of the copper interconnect and not all of it? Or all of a thousand in a tiny area?
I am not professional, from what I heard in the video it was some incomplete purging for the ALD process, which leaves some oxygen radicals that could oxidize the (inter-)layer top copper of the via. Depositing multiple layers but having the layer interface of the copper via oxidized means the interconnection is having CuO reducing the connection cross section area -> higher current density.

imo, if the oxidation is instead happening in any other random position of thr copper circuit, it would be harder to detect in fresh fabricated chips, as the electromigration degradation takes time to surface; if oxidation is confined to interconnections, then the failurate may be higher, but there are still chips that would not immediately degrade and passed on to customer.
 
  • Like
Reactions: stuff and nonesense

rluker5

Distinguished
Jun 23, 2014
913
594
19,760
I am not professional, from what I heard in the video it was some incomplete purging for the ALD process, which leaves some oxygen radicals that could oxidize the (inter-)layer top copper of the via. Depositing multiple layers but having the layer interface of the copper via oxidized means the interconnection is having CuO reducing the connection cross section area -> higher current density.

imo, if the oxidation is instead happening in any other random position of thr copper circuit, it would be harder to detect in fresh fabricated chips, as the electromigration degradation takes time to surface; if oxidation is confined to interconnections, then the failurate may be higher, but there are still chips that would not immediately degrade and passed on to customer.
Actually having the layer interface of the copper via oxidized would turn it into a resistor and make it non functional which would be picked up in binning. If a select portion of some of the vias were oxidized then you could have that narrowing of cross sectional conductive area which would result in the behavior you are describing.

What I'm saying is that it is highly improbable to have only just enough of the via's area be oxidized to keep the affected ones functional, yet partially oxidized enough to lead to significantly increased current density through them. The inadequate purging would expose all of the Cu atoms to O2 and there is little reason for any but the edge ones to oxidize nonuniformly. The oxide capping of large numbers of vias would become apparent when they found a lot of failures in testing. You would need some magical goldilocks luck in the O2 atoms positioning themselves just right and not behaving as atoms for oxidization induced increased electromigration propensity to slip by without a ton of failures raising flags in the same batches.
 
Mar 10, 2020
437
396
5,070
Actually having the layer interface of the copper via oxidized would turn it into a resistor and make it non functional which would be picked up in binning. If a select portion of some of the vias were oxidized then you could have that narrowing of cross sectional conductive area which would result in the behavior you are describing.

What I'm saying is that it is highly improbable to have only just enough of the via's area be oxidized to keep the affected ones functional, yet partially oxidized enough to lead to significantly increased current density through them. The inadequate purging would expose all of the Cu atoms to O2 and there is little reason for any but the edge ones to oxidize nonuniformly. The oxide capping of large numbers of vias would become apparent when they found a lot of failures in testing. You would need some magical goldilocks luck in the O2 atoms positioning themselves just right and not behaving as atoms for oxidization induced increased electromigration propensity to slip by without a ton of failures raising flags in the same batches.
From the Reddit statement, published on toms hardware

Intel statement on via oxidation​

Short answer: We can confirm there was a via Oxidation manufacturing issue (addressed back in 2023) but it is not related to the instability issue.

Long answer: We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue.

For the Instability issue, we are delivering a microcode patch which addresses exposure to elevated voltages which is a key element of the Instability issue. We are currently validating the microcode patch to ensure the instability issues for 13th/14th Gen are addressed. -
Intel representative via Reddit.

The oxidation was real, it was fixed. (Intel claim)
 

rluker5

Distinguished
Jun 23, 2014
913
594
19,760
From the Reddit statement, published on toms hardware

Intel statement on via oxidation​

Short answer: We can confirm there was a via Oxidation manufacturing issue (addressed back in 2023) but it is not related to the instability issue.

Long answer: We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue.

For the Instability issue, we are delivering a microcode patch which addresses exposure to elevated voltages which is a key element of the Instability issue. We are currently validating the microcode patch to ensure the instability issues for 13th/14th Gen are addressed. -
Intel representative via Reddit.

The oxidation was real, it was fixed. (Intel claim)
I agree, and "only a small number of instability reports can be connected to the manufacturing issue" kind of backs up that it isn't a widespread thing.

Unlike high volts.

Unlike i9s being in motherboards with Haswell era VRM systems. Some have DrMOS which isn't that big a change for tripling or quadrupling the CPU power draw.
 
Mar 10, 2020
437
396
5,070
I agree, and "only a small number of instability reports can be connected to the manufacturing issue" kind of backs up that it isn't a widespread thing.

Unlike high volts.

Unlike i9s being in motherboards with Haswell era VRM systems. Some have DrMOS which isn't that big a change for tripling or quadrupling the CPU power draw.
It was a pr spokesperson … pinch of salt perhaps.
 
  • Like
Reactions: tamalero

logainofhades

Titan
Moderator

Alpha_Lyrae

Reputable
Nov 13, 2021
28
26
4,560
I've been wondering for several years about Intel's power draw. It doesn't SEEM like CPUs would be able to handle that long-term. But I would've thought it worst when Intel was still stuck on the 14nm node. But nobody mentioned 11 and 12th Gen having this problem.
Smaller nodes are naturally more sensitive to higher voltages because resistance in the copper wire connects increases; so, 14nm can actually handle higher power draw better than its smaller node cousins on 10nm or Intel 7. There's also a density increase, so you have more localized hotspotting within a smaller area than before. If Intel willingly wanted to reduce lifespan of their chips from 8-15 years to 3-5 years, increased heat and increased voltages (at high core load, so very high current draw) for prolonged periods would do that.

However, it seems to be a calculation error within microcode. I'm sure there will be internal staffing "corrections" associated with this. CPUs are irreversibly damaged and will have to be replaced in a mass recall campaign.
 
  • Like
Reactions: dalauder