News Intel finds root cause of CPU crashing and instability errors, prepares new and final microcode update

thestryker · Sep 29, 2024

bit_user said:
That's a BS, over-the-top cynical take that does nobody any good. I'll agree that companies will tend to do whatever they think they can get away with (though there are exceptions). However, it's only by holding them to a higher standard and ensuring they suffer the full injury they're due that we can realistically expect they & others will ever do better.

Can you name a single publicly traded company that hasn't done exactly this in the last 20 years?

Expecting it is not the same thing as it being acceptable which if you bothered to finish reading the segment you quoted you'd know I obviously don't think it is.

bit_user said:
I cited another source where a retailer experienced 4x the return rate for Gen 13 CPUs as Gen 12. That wasn't filtered by K-series, either, which implies the K-series return rate should be even higher:

"According to data from Les Numeriques, only 1% of AMD processors were returned in 2020, while Intel had a 1.75% return rate then. So, if AMD’s return rate remained stable since then, we can extrapolate that the Raptor Lake chips have a return rate of 4% to 7% while Raptor Lake Refresh processors would have 3% to 5.25%. We should also note that these numbers only reflect return rates that went through the retailer channels, not those that went straight to Intel."

https://www.tomshardware.com/pc-com...have-4x-higher-return-rate-than-the-prior-gen

A single report which was anonymous from a reseller so this is no more solid evidence than anything else that has been spoken of. That report then made correlations between the anonymous reporting and Mindfactory's public return figures to come up with all of these numbers. You can't possibly think these numbers are to be taken seriously.

bit_user said:
This is some weird logic. Just because a problem their QA failed to catch is tricky for them to both debug and mitigate doesn't mean the oversight by their QA team is excusable. All the QA team had to do was catch the symptom, which is often a lot easier than finding the root cause of a problem. Were it not so, then you'd see some of the best-paid positions and highest job qualifications being in QA, yet they tend to be among the lowest of R&D employees.

What they clearly should've done is some detailed testing of the CPU's internal voltage regulation & management, to make sure it was always staying within safe limits. That seems like it ought to be pretty near the top of the list, maybe just underneath ensuring all the instructions work correctly.

You're still assigning incompetence to the problem without acknowledging the possibility that it isn't. You seem to be consistently working on the assumption that 100% of the CPUs will degrade. What if this doesn't happen to be the case and they did do exactly said testing and nothing they had failed?

This is the problem with making assumptions based on something we cannot be sure of without knowing the internal procedures.

bit_user said:
As for why it's taken so long for these mitigations to dribble out, people can & will speculate as they wish. It definitely gives the feeling of Intel trying to run down the clock, even if that's not the reality.

Maybe because it's extremely hard to nail down? I'll let people more knowledgeable than me take this one:

Ryan Smith: With Intel's announcement that they've isolated the Raptor Lake VMin instability issue to a clock tree circuit, the whole saga is making a lot more sense. Getting that far into the weeds (and ruling out everything else) is a ton of work

Put bluntly, I had been wondering why it had taken Intel so long to replicate and start fixing the issue. And while it doesn't excuse the slow response, a reliability issue with a clock tree circuit is definitely one of the more challenging and complex scenarios to nail down

Jon Masters: Not just a clock tree, but *aging* within a clock tree. Silicon needs to be designed to account for aging, and has aging sensors etc. but this looks like it would have been a fun debug exercise. That kind of stuff is tremendous fun! Kudos to their engineers

https://nitter.poast.org/RyanSmithAT/status/1839125650596393290#m
https://nitter.poast.org/jonmasters/status/1839183677168799836#m

bit_user · Sep 29, 2024

thestryker said:
Can you name a single publicly traded company that hasn't done exactly this in the last 20 years?

Yes, I'm sure.

thestryker said:
Expecting it is not the same thing as it being acceptable which if you bothered to finish reading the segment you quoted you'd know I obviously don't think it is.

Why do you think I didn't finish reading what I quoted?

thestryker said:
A single report which was anonymous from a reseller so this is no more solid evidence than anything else that has been spoken of. That report then made correlations between the anonymous reporting and Mindfactory's public return figures to come up with all of these numbers. You can't possibly think these numbers are to be taken seriously.

What you seem to want is a perfect dataset that tells us the exact dimensions of the problem. As I said, we don't have that. What we do have is a lot of incomplete data, leaks from OEMs, corner cases (i.e. servers), and reports from game developers, publishers and many individuals that suggest there indeed could've been a swelling wave of failures.

As I've said before, my position is that we can't dismiss the possibility the eventual scale of the problem was truly momentous, without Intel's mitigations. The absence of data does not imply the absence of a problem. At this point, we can only hope that more clarity emerges with time.

thestryker said:
You're still assigning incompetence to the problem without acknowledging the possibility that it isn't.

It was a miss, either way. I don't have insight into how staffing of Intel's QA departments has evolved, how well it's scaled with respect to product complexity, or how compressed schedules have been. So, while I think in represents some measure of organizational incompetence or recklessness, I'm not saying the individuals are themselves incompetent.

thestryker said:
You seem to be consistently working on the assumption that 100% of the CPUs will degrade.

Nope. Not sure where you got that idea.

thestryker said:
What if this doesn't happen to be the case and they did do exactly said testing and nothing they had failed?

As I said, there are multiple levels at which the problem can be observed. One way is to test until failure, which is inefficient, time-consuming, and not guaranteed to turn up the problem. However, a better approach that I think is more consistent with validation of these sorts of designs would be to test voltage management & regulation under a variety of scenarios. Had this been done, it should've revealed that unsafe voltages were occurring. It seems quite feasible to me, and a logical part of the test plan for the corresponding functional units.

thestryker said:
Maybe because it's extremely hard to nail down? I'll let people more knowledgeable than me take this one:
https://nitter.poast.org/RyanSmithAT/status/1839125650596393290#m
https://nitter.poast.org/jonmasters/status/1839183677168799836#m

Again, you're conflating two completely different things. Just because a problem is difficult to debug or devise mitigations for does not mean that it necessarily would've been difficult to test for. This is partly due to the fact that their QA engineers aren't constrained to dealing with the CPU as a veritable black box, the way we are.

Also, just because you consider them more knowledgeable than yourself doesn't mean either of those individuals are actually qualified to render an informed opinion on the subject. I wouldn't put much heed into statements by anyone who hasn't worked as a chip designer on a reasonably modern and complex ASIC. Even someone with an advanced degree in EE, like Ian Cutress, wouldn't necessarily have any idea how working chip designers manage, deal with, or regard these sorts of problems.

I think this exchange would go a lot better if you stopped trying to mischaracterize my position and just asked what I'm saying, when there's any doubt or room for uncertainty.

thestryker · Sep 30, 2024

bit_user said:
Yes, I'm sure.

Why do you think I didn't finish reading what I quoted?

You literally called what I said BS, then went on to say it tends to be correct followed by ranting about accountability. That implies you didn't read what I said or just somehow didn't understand that I don't think what they did (or any other company who pulls this) is in any way acceptable.

bit_user said:
It was a miss, either way. I don't have insight into how staffing of Intel's QA departments has evolved, how well it's scaled with respect to product complexity, or how compressed schedules have been. So, while I think in represents some measure of organizational incompetence or recklessness, I'm not saying the individuals are themselves incompetent.

Right so you're assigning incompetence like I've said. Also at no point did I suggest you were calling individuals incompetent.

bit_user said:
Nope. Not sure where you got that idea.

As I said, there are multiple levels at which the problem can be observed. One way is to test until failure, which is inefficient, time-consuming, and not guaranteed to turn up the problem. However, a better approach that I think is more consistent with validation of these sorts of designs would be to test voltage management & regulation under a variety of scenarios. Had this been done, it should've revealed that unsafe voltages were occurring. It seems quite feasible to me, and a logical part of the test plan for the corresponding functional units.

You've missed the point again: what if not every CPU fails due to the voltages being observed?

You're assuming that they're not thoroughly testing, or not testing in a manner deemed appropriate. You're making assumptions still and placing blame based on these assumptions. You've ignored any other possibility the entire time and you're still doing so. For example it could be that after failure analysis they determined that not all silicon could handle the voltages and that's how the new configuration came about.

It could be one of us is right, both, neither or some weird combination. Without someone internally releasing information there's no way to be certain so speaking with certainty about it is dishonest.

bit_user said:
Again, you're conflating two completely different things. Just because a problem is difficult to debug or devise mitigations for does not mean that it necessarily would've been difficult to test for. This is partly due to the fact that their QA engineers aren't constrained to dealing with the CPU as a veritable black box, the way we are.

I'm pretty sure I'm not conflating anything as I boxed out your quote talking about the problem itself and how long it has taken to resolve. There was nothing involving diagnosis of there being a problem.

bit_user said:
Also, just because you consider them more knowledgeable than yourself doesn't mean either of those individuals are actually qualified to render an informed opinion on the subject. I wouldn't put much heed into statements by anyone who hasn't worked as a chip designer on a reasonably modern and complex ASIC. Even someone with an advanced degree in EE, like Ian Cutress, wouldn't necessarily have any idea how working chip designers manage, deal with, or regard these sorts of problems.

That's why I quoted Masters with Smith (as opposed to just Smith) because he's been directly involved with chip design. Maybe you should have checked before assuming qualifications?

bit_user said:
I think this exchange would go a lot better if you stopped trying to mischaracterize my position and just asked what I'm saying, when there's any doubt or room for uncertainty.

If you don't want what you say to be "mischaracterized" you should stop writing as though you're certain when you're just guessing like everyone else. This is the first post you've made in this thread that actually uses any qualifiers as far as your assumptions on the failure are concerned. It's not my job to figure out what you really mean as opposed to responding to what you are actually writing. The back and forth would have gone differently had you indicated what you were saying was opinion as opposed to fact.

YSCCC · Sep 30, 2024

thestryker said:
Can you name a single publicly traded company that hasn't done exactly this in the last 20 years?

Expecting it is not the same thing as it being acceptable which if you bothered to finish reading the segment you quoted you'd know I obviously don't think it is.

A single report which was anonymous from a reseller so this is no more solid evidence than anything else that has been spoken of. That report then made correlations between the anonymous reporting and Mindfactory's public return figures to come up with all of these numbers. You can't possibly think these numbers are to be taken seriously.

You're still assigning incompetence to the problem without acknowledging the possibility that it isn't. You seem to be consistently working on the assumption that 100% of the CPUs will degrade. What if this doesn't happen to be the case and they did do exactly said testing and nothing they had failed?

This is the problem with making assumptions based on something we cannot be sure of without knowing the internal procedures.

Maybe because it's extremely hard to nail down? I'll let people more knowledgeable than me take this one:

https://nitter.poast.org/RyanSmithAT/status/1839125650596393290#m
https://nitter.poast.org/jonmasters/status/1839183677168799836#m

Only replying the incompetence part.

To me either its incompetence, or more likely management pressure to rush through the pipeline and release the generation with competitive performance to the rival leading to such miss, both will be in the culture or structure of the company itself and would be a yellow or red flag for coming generations from them.

Say looking at the max voltage, it’s literally as easy/ basic as what buoldzoid do recently, to weld an oscilloscope to the power rail of the socket and it will catch those crazy peaks! Yes well binned chips might not have 1.65v spikes, but having spikes 0.2v higher than what is considered normal for the specific chip should raise a red flag and hold down the release. And as per degradation rate, how hard is it to force one of the highest spike caught and run it 24/7 at that voltage and maybe 10% higher and use the projection to estimate the potential speed to death? Hack even speaker companies do something similar to the drivers they use.

If that is missed and not incompetence, I don’t know what QA department is for

TheHerald · Sep 30, 2024

YSCCC said:
Only replying the incompetence part.

To me either its incompetence, or more likely management pressure to rush through the pipeline and release the generation with competitive performance to the rival leading to such miss, both will be in the culture or structure of the company itself and would be a yellow or red flag for coming generations from them.

Say looking at the max voltage, it’s literally as easy/ basic as what buoldzoid do recently, to weld an oscilloscope to the power rail of the socket and it will catch those crazy peaks! Yes well binned chips might not have 1.65v spikes, but having spikes 0.2v higher than what is considered normal for the specific chip should raise a red flag and hold down the release. And as per degradation rate, how hard is it to force one of the highest spike caught and run it 24/7 at that voltage and maybe 10% higher and use the projection to estimate the potential speed to death? Hack even speaker companies do something similar to the drivers they use.

If that is missed and not incompetence, I don’t know what QA department is for

Do you also consider x3ds acting as hand grenades management pressure to rush and release a product with competitive performance? Or do you only feel that way about Intel?

thestryker · Sep 30, 2024

YSCCC said:
Only replying the incompetence part.

To me either its incompetence, or more likely management pressure to rush through the pipeline and release the generation with competitive performance to the rival leading to such miss, both will be in the culture or structure of the company itself and would be a yellow or red flag for coming generations from them.

Say looking at the max voltage, it’s literally as easy/ basic as what buoldzoid do recently, to weld an oscilloscope to the power rail of the socket and it will catch those crazy peaks! Yes well binned chips might not have 1.65v spikes, but having spikes 0.2v higher than what is considered normal for the specific chip should raise a red flag and hold down the release. And as per degradation rate, how hard is it to force one of the highest spike caught and run it 24/7 at that voltage and maybe 10% higher and use the projection to estimate the potential speed to death? Hack even speaker companies do something similar to the drivers they use.

If that is missed and not incompetence, I don’t know what QA department is for

Here's what you don't seem to be grasping: What if a majority of the chips don't die with your hypothetical 1.65v? What if everything they tested lasted the expected duration under their accelerated aging testing?

A lot of people seem to be working on the assumption that the high voltage spikes will kill every chip when that may not be the case. Without exact disclosures that Intel would likely never let out the specifics will never be known.

Moonstick2 · Sep 30, 2024

TheHerald said:
Again, you don't know how to drive, don't buy an unlocked motherboard and an unlocked cpu. That's the while point locked parts exist.

Intel are the driver, still trying to pin some of the blame on Porsche (the motherboard manufacturers) when they don't know how to drive (write correct microcode).

bit_user · Sep 30, 2024

thestryker said:
You literally called what I said BS, then went on to say it tends to be correct followed by ranting about accountability. That implies you didn't read what I said or just somehow didn't understand that I don't think what they did (or any other company who pulls this) is in any way acceptable.

How about, instead of casting broad aspersions and vague allegations, you challenge specific statements, so that I know what they heck you're going on about and actually have a chance to answer them?

This whole time, I can't shake the feeling you're trying to bait me into a strawman argument. It's as if you're really seeking a foil, but I'm not about to be your patsy.

thestryker said:
You've missed the point again: what if not every CPU fails due to the voltages being observed?

I went to some great trouble to explain this: you shouldn't need to test the CPU to failure, in order to expose the unsafe voltages. The failure is a consequence of the unsafe voltages occurring repeatedly, many many times. So, it's just a matter of creating the right circumstances, which seems like it shouldn't be hard to do if you just designed a stress test to ramp frequencies up and down over different timescales and while varying the other independent variables. That seems like it ought to be a part of the test plan, for the parts of the chip which are involved in managing voltage.

As an aside, you seem pretty smart, but I'm not sure if you've ever worked as an engineer. If not, then I can at least see why you might not have a clear idea how QA teams do their job. When I worked at a fabless chip company, they actually had a separate QA team for doing system-level testing than the folks who wrote the lower level tests. Again, they're not limited to testing these chips and their functional units like you or I would. Only the SQA team did testing at the level of a chip on a board, running actual workloads.

thestryker said:
You're assuming that they're not thoroughly testing, or not testing in a manner deemed appropriate.

Yes. The problem is not caused by silicon defects. It seems mostly workload-specific, as well as being affected by other parameters concerning how the chip is configured. If it's not a consequence of manufacturing defects - and I've not heard evidence that it is - then it should be possible to trigger the unsafe voltages on enough samples that they should've been able to catch it. Keep in mind that they're aware of how much manufacturing variability there is, so they should know how many & what diversity of samples they need to test.

thestryker said:
You're making assumptions still and placing blame based on these assumptions.

I've worked with a lot of QA departments and engineers, in my career. I've seen a lot of bugs. I have a pretty clear idea how things should be tested. These chips are immensely complex and it's not by accident that they work as well as they do. Somebody dropped the ball, here. We lack the insight to say where the failure was, but it was definitely dropped.

Let me ask you this: if a certain model of airplane was just occasionally falling out of the sky, would you take the same stance? Would you just accept that maybe something bad happened and no one is to blame? Or would you look at how the defect could've been detected and hold the manufacturer to account for failing to have the proper processes and resources in place to have found it?

thestryker said:
You've ignored any other possibility the entire time and you're still doing so.

What other possibilities absolve Intel from this? Even if it's a manufacturing defect, that's something they're supposed to be able to test for, and it's not like they have anyone else to blame.

thestryker said:
Without someone internally releasing information there's no way to be certain so speaking with certainty about it is dishonest.

The only certainty here seems to be your assertion that this whole thing is overblown. You simply don't know that.

thestryker said:
That's why I quoted Masters with Smith (as opposed to just Smith) because he's been directly involved with chip design. Maybe you should have checked before assuming qualifications?

First, why even bother quoting Ryan Smith? Second, John Masters' quote did not do anything to absolve Intel's chip designers nor their QA team. If anything, it sounded like casting aspersions on the design. The only leeway he cut them was on the debugging side of the problem.

You really need to stop conflating the difficulty of root-causing a problem with the difficulty of testing for it. I've seen many bugs that were easy to trigger & reproduce, but incredibly difficult to actually find and fix! This is actually the norm! Yes, there are some bugs for which the converse is true, or where it's both hard to test for & find them, but it's not clear to me this is necessarily such a case.

thestryker said:
you should stop writing as though you're certain when you're just guessing like everyone else.

Ditto.

bit_user · Sep 30, 2024

TheHerald said:
Do you also consider x3ds acting as hand grenades

Why don't you just call them nuclear bombs? It would be almost as accurate as labeling the hand grenades.

TheHerald said:
management pressure to rush and release a product with competitive performance? Or do you only feel that way about Intel?

Classic whataboutism, right there. One problem had thousands of known incidents - maybe more - while the other had only a handful. A more rare problem is easier to understand how they could've missed, but it did indeed highlight a miss in their testing. Anyway, AMD promptly issued a fix and that was that.

BTW, there's only one kind of person who tries to defend Intel by trying to change the subject with an exaggerated attack on AMD. Don't think we don't all see exactly what you're doing, here.

TheHerald · Sep 30, 2024

bit_user said:
Why don't you just call them nuclear bombs? It would be almost as accurate as labeling the hand grenades.

Classic whataboutism, right there. One problem had thousands of known incidents - maybe more - while the other had only a handful. A more rare problem is easier to understand how they could've missed, but it did indeed highlight a miss in their testing. Anyway, AMD promptly issued a fix and that was that.

BTW, there's only one kind of person who tries to defend Intel by trying to change the subject with an exaggerated attack on AMD. Don't think we don't all see exactly what you're doing, here.

The point I replied to was about how Intel let it slip through their QA due to the culture going on in the company. So my question is very valid, is the same culture also present at amd, letting things slide through their QA etc. How severe the problem was (obviously amds was much more severe) has nothing to do with the problem slipping past them.

And no, the issue isn't fixed. Increased vsoc when enabling xmp isn't a problem, as I've said before that has been a thing for decades on both amd and intel. The problem with those cpus was that instead of the cpu just degrading due to the voltage, it basically fried itself and the mobo because it has no failsafes. It still has none, it hasn't been fixed.

bit_user · Sep 30, 2024

TheHerald said:
How severe the problem was (obviously amds was much more severe) has nothing to do with the problem slipping past them.

But not how rare, which you cheerfully overlooked.

TheHerald said:
The problem with those cpus was that instead of the cpu just degrading due to the voltage,

Such a mischaracterization is no less than I expected of you.

bit_user · Sep 30, 2024

...and the plot sickens.

WCCFTech said:
... (Intel) stated the performance impact will be within the "run-to-run" variation, including some synthetic apps such as Cinebench R23, Speedometer, Crossmark, etc. Run-to-run variation is where you find the performance difference almost unnoticeable and is generally under the margin of error. This means that a +-1% performance difference is expected when you run the same application a few times with the same hardware parameters.

However, this doesn't seem to be the case with the new BIOS patch. As tested by the Chiphell forum user 'twfox', Intel CPUs are seeing a performance loss in synthetic benchmarks.

WCCFTech said:
the Intel Core i9 13900K saw a noticeable 6.5% drop in performance in Cinbench R15.
...
In Cinebench R23, the multi-core score has dropped to 37276 points, ... Even though this is just a 2% decrease, it's hardly under the margin of error.

Source: https://wccftech.com/intel-14th-13th-gen-cpus-0x12b-microcode-bios-patch-performance/

YSCCC · Sep 30, 2024

thestryker said:
Here's what you don't seem to be grasping: What if a majority of the chips don't die with your hypothetical 1.65v? What if everything they tested lasted the expected duration under their accelerated aging testing?

A lot of people seem to be working on the assumption that the high voltage spikes will kill every chip when that may not be the case. Without exact disclosures that Intel would likely never let out the specifics will never be known.

They limit the max spike to 1.55v now, which should mean that voltages shouldn’t go above say, 1.6v for real safety, that, after the announcement of the voltage issues, have tons of ppl reporting the vid requests, and since intel actually binned the cpus and get the VF curve built in, it’s extremely reasonable to test the worst binned ones for the safety. Not the best ones. Unlike consumers where everything is luck. And that, is no excuse of incompetence

TheHerald said:
Do you also consider x3ds acting as hand grenades management pressure to rush and release a product with competitive performance? Or do you only feel that way about Intel?

I consider the X3D slip as incompetent as the voltage slip in RPL

But that’s only on the slip pass part

AMD fixed it in literally 1-2 weeks and right at the launch, so a few months later and nothing else pop up, I call it a bare pass and something buyable.

While RPL? The fix comes right at the EOL of the product, and hack the admission of their fault takes it at the EOL also! That is what I call complete incompetence and toxic company culture.

YSCCC · Sep 30, 2024

Moonstick2 said:
Intel are the driver, still trying to pin some of the blame on Porsche (the motherboard manufacturers) when they don't know how to drive (write correct microcode).

And it’s actually simple logic to choose what to purchase next. One admit early and issue fix/mitigation, one dodge until their next lineup comes up and the old one is phasing out. Given how much they care about own reputation and the financial situation they are in…

thestryker · Sep 30, 2024

bit_user said:
How about, instead of casting broad aspersions and vague allegations, you challenge specific statements, so that I know what they heck you're going on about and actually have a chance to answer them?

This whole time, I can't shake the feeling you're trying to bait me into a strawman argument. It's as if you're really seeking a foil, but I'm not about to be your patsy.

I'm really not sure what you don't understand here so I guess it's best to just drop it because I thought it was extremely clear what this part was about.

bit_user said:
You really need to stop conflating the difficulty of root-causing a problem with the difficulty of testing for it. I've seen many bugs that were easy to trigger & reproduce, but incredibly difficult to actually find and fix! This is actually the norm! Yes, there are some bugs for which the converse is true, or where it's both hard to test for & find them, but it's not clear to me this is necessarily such a case.

You keep claiming I'm conflating when I'm literally just talking about the path of resolution not the diagnosis of the problem. You're the one who keeps claiming I'm conflating two things while quoting me speaking of one.

bit_user said:
First, why even bother quoting Ryan Smith? Second, John Masters' quote did not do anything to absolve Intel's chip designers nor their QA team. If anything, it sounded like casting aspersions on the design. The only leeway he cut them was on the debugging side of the problem.

Because he was posting in response to Smith.

bit_user said:
Let me ask you this: if a certain model of airplane was just occasionally falling out of the sky, would you take the same stance? Would you just accept that maybe something bad happened and no one is to blame? Or would you look at how the defect could've been detected and hold the manufacturer to account for failing to have the proper processes and resources in place to have found it?

This is almost as dumb of an analogy attempt as TheHerald's earlier Porsche one. That's an industry which is actually regulated and when accidents happen there's an agency which investigates them and publicly releases findings. That means nobody has to guess how these things happen as it will be publicly exposed. There's also actual accountability there as the government can step in and force the product to be sidelined before finding the cause.

bit_user said:
I went to some great trouble to explain this: you shouldn't need to test the CPU to failure, in order to expose the unsafe voltages. The failure is a consequence of the unsafe voltages occurring repeatedly, many many times. So, it's just a matter of creating the right circumstances, which seems like it shouldn't be hard to do if you just designed a stress test to ramp frequencies up and down over different timescales and while varying the other independent variables. That seems like it ought to be a part of the test plan, for the parts of the chip which are involved in managing voltage.

As an aside, you seem pretty smart, but I'm not sure if you've ever worked as an engineer. If not, then I can at least see why you might not have a clear idea how QA teams do their job. When I worked at a fabless chip company, they actually had a separate QA team for doing system-level testing than the folks who wrote the lower level tests. Again, they're not limited to testing these chips and their functional units like you or I would. Only the SQA team did testing at the level of a chip on a board, running actual workloads.

Yes. The problem is not caused by silicon defects. It seems mostly workload-specific, as well as being affected by other parameters concerning how the chip is configured. If it's not a consequence of manufacturing defects - and I've not heard evidence that it is - then it should be possible to trigger the unsafe voltages on enough samples that they should've been able to catch it. Keep in mind that they're aware of how much manufacturing variability there is, so they should know how many & what diversity of samples they need to test.

Except what if it is silicon based? We simply do not know anything beyond what triggers the problem. We do not know if it happens to all of the CPUs if they're exposed to said voltage or if it's some percentage.

Here's a hypothetical with made up numbers to try to make myself more clear:
What if 10% of all B0 die cannot handle the raised voltages. Of that 10% there's an unknown number which would even end up with profiles that would attempt to apply said voltages. At this point you'd be looking at a small enough number it could make it through testing without them ever seeing it.

This is still obviously an Intel problem, but hardly some damning indictment of their QA or process. One would sure hope diagnosing this problem provided them with ways to add to their QA process no matter what.

If it is as you and YSCCC think then that would be extremely damning and make nothing they put out trustworthy.

thestryker · Sep 30, 2024

YSCCC said:
They limit the max spike to 1.55v now, which should mean that voltages shouldn’t go above say, 1.6v for real safety, that, after the announcement of the voltage issues, have tons of ppl reporting the vid requests, and since intel actually binned the cpus and get the VF curve built in, it’s extremely reasonable to test the worst binned ones for the safety. Not the best ones. Unlike consumers where everything is luck. And that, is no excuse of incompetence

You're still missing the point I'm making:

What if most chips do not die under those circumstances in the first place?

If it's as simple as all of them degrade under high voltage/spikes then that would absolutely be the failure you keep claiming it is.

YSCCC said:
I consider the X3D slip as incompetent as the voltage slip in RPL

But that’s only on the slip pass part

AMD fixed it in literally 1-2 weeks and right at the launch, so a few months later and nothing else pop up, I call it a bare pass and something buyable.

It didn't "slip past" AMD unless you categorize them neglecting to forward information to their board partners as such. It was a horrible oversight, but absolutely nothing like the Intel situation otherwise.

I think the problem by itself is worse than Intel's since it was something they already knew, had dealt with on past platforms and just didn't forward the information. Though at the same time this allowed for a swift resolution which means it didn't linger. Asus also ended up helping AMD here by being stupid about the BETA BIOS voiding warranty which shifted the public conversation.

Intel on the other hand fumbled the response to their issue at every turn until they had started narrowing in on a root cause.

TheHerald · Oct 1, 2024

YSCCC said:
I consider the X3D slip as incompetent as the voltage slip in RPL

But that’s only on the slip pass part

AMD fixed it in literally 1-2 weeks and right at the launch, so a few months later and nothing else pop up, I call it a bare pass and something buyable.

While RPL? The fix comes right at the EOL of the product, and hack the admission of their fault takes it at the EOL also! That is what I call complete incompetence and toxic company culture.

Ok then, we are almost in agreement. Although I disagree with the fixed part, it's not fixed, they just locked the voltage down so you can't expose the issue. There are actually complaints on reddit with people not being able to hit their ram frequencies (600-6400) cause of the reduction in voltage. Cause again the problem isn't the cpu degrading, that happens to every cpu when you apply enough voltage, the problem is that the cpu literally cooks itself and the socket.

thestryker · Oct 1, 2024

TheHerald said:
Ok then, we are almost in agreement. Although I disagree with the fixed part, it's not fixed, they just locked the voltage down so you can't expose the issue. There are actually complaints on reddit with people not being able to hit their ram frequencies (600-6400) cause of the reduction in voltage. Cause again the problem isn't the cpu degrading, that happens to every cpu when you apply enough voltage, the problem is that the cpu literally cooks itself and the socket.

Your stance doesn't really make much sense as it's an upper limit for safe voltage. AM4 boards had similar limits and this isn't really any different than most power related limits when it comes to sensitive parts of CPUs. The only actual "problem" was AMD neglecting to make it part of the specifications which were sent to board partners. While it's embarrassing and a bad oversight it's also easily rectified.

As for memory clocks AMD still doesn't recommend over 6000 which I assume is due to IO die silicon lottery. They've also had a lot more issues with kits that aren't on QVLs than Intel systems, but a lot of this seemed to have been resolved after the AGESA patch which addressed memory and unlocked mismatched ratios.

TheHerald · Oct 1, 2024

thestryker said:
Your stance doesn't really make much sense as it's an upper limit for safe voltage. AM4 boards had similar limits and this isn't really any different than most power related limits when it comes to sensitive parts of CPUs. The only actual "problem" was AMD neglecting to make it part of the specifications which were sent to board partners. While it's embarrassing and a bad oversight it's also easily rectified.

As for memory clocks AMD still doesn't recommend over 6000 which I assume is due to IO die silicon lottery. They've also had a lot more issues with kits that aren't on QVLs than Intel systems, but a lot of this seemed to have been resolved after the AGESA patch which addressed memory and unlocked mismatched ratios.

Exceeding safety voltages should just degrade the chip. That's fine, I have no issues with that. The chip and the socket melting is a completely different thing. That's not normal. Can you imagine what temperatures are needed to melt the metal pins?

YSCCC · Oct 1, 2024

TheHerald said:
Ok then, we are almost in agreement. Although I disagree with the fixed part, it's not fixed, they just locked the voltage down so you can't expose the issue. There are actually complaints on reddit with people not being able to hit their ram frequencies (600-6400) cause of the reduction in voltage. Cause again the problem isn't the cpu degrading, that happens to every cpu when you apply enough voltage, the problem is that the cpu literally cooks itself and the socket.

No need to be in agreement, in all sense, the Intel issue is far worse

YSCCC · Oct 1, 2024

thestryker said:
You're still missing the point I'm making:

What if most chips do not die under those circumstances in the first place?

If it's as simple as all of them degrade under high voltage/spikes then that would absolutely be the failure you keep claiming it is.

It didn't "slip past" AMD unless you categorize them neglecting to forward information to their board partners as such. It was a horrible oversight, but absolutely nothing like the Intel situation otherwise.

I think the problem by itself is worse than Intel's since it was something they already knew, had dealt with on past platforms and just didn't forward the information. Though at the same time this allowed for a swift resolution which means it didn't linger. Asus also ended up helping AMD here by being stupid about the BETA BIOS voiding warranty which shifted the public conversation.

Intel on the other hand fumbled the response to their issue at every turn until they had started narrowing in on a root cause.

For the if they don’t die is beyond good will to Intel IMO, electron migration is physics and it always happen but will be worse with high current and voltages. It’s pretty normal to arrest it at even much higher voltage to let it degrade in minutes, hours and weeks before launch and plot that curve for “Vmin shift” to happen within 3 years for most of the chip, let alone one year. The better binned ones only survive longer due to it requests lower voltages, but under extreme voltages they will degrade similarly. They can now identify 1.55 as safe limit by whatever method they should’ve been able to before release, and at least, before 14th gen kicks off.

It will be even worse and incompetent for them if their QA system have no way to catch early degradation in stock settings, coz it means for upcoming generations the same can slip past.

I do count the failing to inform vendors the max safe voltage for X3D a fail and incompetence mistake on AMD side, where the stacked chip inducing heat transfer issues should be warned prior to release of bios supporting the new SKU, but since they fixed it soon enough, to me it is a pass for AMD of confidence and only mishap on slipping pass initial stage

thestryker · Oct 1, 2024

YSCCC said:
For the if they don’t die is beyond good will to Intel IMO, electron migration is physics and it always happen but will be worse with high current and voltages. It’s pretty normal to arrest it at even much higher voltage to let it degrade in minutes, hours and weeks before launch and plot that curve for “Vmin shift” to happen within 3 years for most of the chip, let alone one year. The better binned ones only survive longer due to it requests lower voltages, but under extreme voltages they will degrade similarly. They can now identify 1.55 as safe limit by whatever method they should’ve been able to before release, and at least, before 14th gen kicks off.

This is called a guess. We already have established your opinion as you keep repeating yourself. You might be right, but if you are Intel failed miserably at the most basic part of their validation process. It's perfectly fine for you to believe this, but if you dismiss that you might be wrong you're just being willfully ignorant.

YSCCC · Oct 1, 2024

thestryker said:
This is called a guess. We already have established your opinion as you keep repeating yourself. You might be right, but if you are Intel failed miserably at the most basic part of their validation process. It's perfectly fine for you to believe this, but if you dismiss that you might be wrong you're just being willfully ignorant.

Can’t agree with this as this is literally the first time in the PC history such degradation issue have slipped past the design phase and reach the customer, there maybe more complicated issues combined for this mess to be happening, but since every design would have a cap safe voltage for long term usage, and that having the last 0x125 and 0x129 being basically just not calling for crazy stupid voltages, there is no way it is something unthinkable and not Intel being incompetent. Sure what I guess might and should not be the whole picture or I will be the chief engineering at Intel or AMD, but my guess being the real case or not doesn’t make Intel competent.

And what you say IF back then is basically logically or statistically impossible, I can’t imagine that in their internal test, say 1.6v+ testing, 99.9% of their test samples are fine and stable, but then out there, you have so much failures to make it into such long dreaded issue. Unless they aren’t testing their worst bins in QA but some random best bins.. or say, their test is just for the cpu to keep booting but not running some kind of stress test without error…

but if those are the cases it’s even more worrying or incompetent than they just forget to look at the basic electron migration issues.

It’s Intel for god sake, not some new Chinese new chip makers, their QA procedures should be able to find degradation related issues well before launch, it’s not even like X3D where the stacked chip transfer gear so slow that having high voltage will melt it. Plus this is a tock gen, so the alder lake test results should hint them how far they can go.

But of course, it is free for you to think this is just unfortunate and that Intel is just unlucky that this rare, hard to find degradation didn’t happen to most of their test cpus but somehow out in the field they have the situation now that they ran out of replacement CPU

P.S. I do admit I am extra cautious about Intel after the oxidation via issue was disclosed, they have a known batch of faulty silicon in the manufacturing and caught it themselves, they don’t even wanted to recall those, just wait and RMA what will be sent back to them, that alone says a lot about how much they care about user experience. Now at the gen they have to go the refresh route and degradation issue surfacing, it’s hard to believe it isn’t something rushed or the management just pushed it through QA ignoring some fundamental issues

And edit: if I understand correctly, what you will not call a guess will be some kind of Intel admission they failed hard and missed the basic QA. Which if true will always be a guess and never become truth, as they are the listed giants, it took 300+ lives and criminal investigations for Boeing to admit they have pushed through by management the engineering concerns. Since the RPL isn’t life threatening at all, no such investigation will be held ever, and we all know how big corporates will say. Everyone else is having their best/educated guess, but to me, it is Intel who have to prove it’s just unfortunate and not incompetence. Not outsider to prove they f up big.

TheHerald · Oct 1, 2024

YSCCC said:
Can’t agree with this as this is literally the first time in the PC history such degradation issue have slipped past the design phase and reach the customer, there maybe more complicated issues combined for this mess to be happening, but since every design would have a cap safe voltage for long term usage, and that having the last 0x125 and 0x129 being basically just not calling for crazy stupid voltages, there is no way it is something unthinkable and not Intel being incompetent. Sure what I guess might and should not be the whole picture or I will be the chief engineering at Intel or AMD, but my guess being the real case or not doesn’t make Intel competent.

And what you say IF back then is basically logically or statistically impossible, I can’t imagine that in their internal test, say 1.6v+ testing, 99.9% of their test samples are fine and stable, but then out there, you have so much failures to make it into such long dreaded issue. Unless they aren’t testing their worst bins in QA but some random best bins.. or say, their test is just for the cpu to keep booting but not running some kind of stress test without error…

but if those are the cases it’s even more worrying or incompetent than they just forget to look at the basic electron migration issues.

It’s Intel for god sake, not some new Chinese new chip makers, their QA procedures should be able to find degradation related issues well before launch, it’s not even like X3D where the stacked chip transfer gear so slow that having high voltage will melt it. Plus this is a tock gen, so the alder lake test results should hint them how far they can go.

But of course, it is free for you to think this is just unfortunate and that Intel is just unlucky that this rare, hard to find degradation didn’t happen to most of their test cpus but somehow out in the field they have the situation now that they ran out of replacement CPU

P.S. I do admit I am extra cautious about Intel after the oxidation via issue was disclosed, they have a known batch of faulty silicon in the manufacturing and caught it themselves, they don’t even wanted to recall those, just wait and RMA what will be sent back to them, that alone says a lot about how much they care about user experience. Now at the gen they have to go the refresh route and degradation issue surfacing, it’s hard to believe it isn’t something rushed or the management just pushed it through QA ignoring some fundamental issues

And edit: if I understand correctly, what you will not call a guess will be some kind of Intel admission they failed hard and missed the basic QA. Which if true will always be a guess and never become truth, as they are the listed giants, it took 300+ lives and criminal investigations for Boeing to admit they have pushed through by management the engineering concerns. Since the RPL isn’t life threatening at all, no such investigation will be held ever, and we all know how big corporates will say. Everyone else is having their best/educated guess, but to me, it is Intel who have to prove it’s just unfortunate and not incompetence. Not outsider to prove they f up big.

Of what benefit would it be for Intel to push it through the qa issues? That makes no sense.

bit_user · Oct 1, 2024

thestryker said:
This is almost as dumb of an analogy attempt as TheHerald's earlier Porsche one. That's an industry which is actually regulated and when accidents happen there's an agency which investigates them and publicly releases findings. That means nobody has to guess how these things happen as it will be publicly exposed. There's also actual accountability there as the government can step in and force the product to be sidelined before finding the cause.

It's a fully relevant analogy, because airplane parts and material have variations. The design should account for those variations and the test plan should involve an adequate sample to ensure the range of possible variations are covered.

Essentially, what you're saying is that some variations could exist in Intel's manufacturing process that Intel lacks the visibility to see or the ability to control. Yet, they're common enough for a significant number of these failures to happen in otherwise well-qualified chips. There is no excuse for this. Intel must know the variabilities that exist in their manufacturing process and account for those in their testing. I'm sure they already do this, in fact, or else it's not plausible their products would work as well as they have.

thestryker said:
Except what if it is silicon based? We simply do not know anything beyond what triggers the problem.

Unless you believe in voodoo and witchcraft, there's no real excuse for Intel missing this.

I honestly don't know how you think Intel managed to deliver CPUs that worked so well, prior to this. You can't do that without extensive verification. One estimate I've heard is that chip companies traditionally devote about twice the resources to verification as design.

thestryker said:
Here's a hypothetical with made up numbers to try to make myself more clear:
What if 10% of all B0 die cannot handle the raised voltages. Of that 10% there's an unknown number which would even end up with profiles that would attempt to apply said voltages. At this point you'd be looking at a small enough number it could make it through testing without them ever seeing it.

The Alderon Games situation suggests it's not so much a question of what the silicon can handle, as much as it is the workload. They were running at a 100% failure rate, with the only major variable being the time-to-failure.

News Intel finds root cause of CPU crashing and instability errors, prepares new and final microcode update

Judicious

Titan

Judicious

Estimable

Respectable

Judicious

Distinguished

Titan

Titan

Respectable

Titan

Titan

Estimable

Estimable

Judicious

Judicious

Respectable

Judicious

Respectable

Estimable

Estimable

Judicious

Estimable

Respectable

Titan

Share this page