News Intel finds root cause of CPU crashing and instability errors, prepares new and final microcode update

Page 6 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

bit_user

Titan
Ambassador
Of what benefit would it be for Intel to push it through the qa issues? That makes no sense.
Cost cutting (by reducing QA headcount) or more likely schedule crunch (could be both!). The launch dates of products are somewhat immovable and engineering is often late. This leads to QA getting squeezed and corners being cut. That's one possible explanation, and definitely something I've seen many times.
 

TheHerald

Notable
Feb 15, 2024
1,288
355
1,060
Cost cutting (by reducing QA headcount) or more likely schedule crunch. The launch dates of products are somewhat immovable and engineering is often late. This leads to QA getting squeezed and corners being cut. That's one possible explanation, and definitely something I've seen many times.
He said intel ignored the issues from the QA, that doesn't amount to cost cutting. I'm saying if the issue was detected at QA then obviously the cpus wouldn't have launched. Especially if it's as simple as "lower vcore from 1.55 to 1.53" then yeah, that's an easy fix

There isn't even a performance incentive, the difference from 5.7 to 5.6 ghz is 1%.
 
There isn't even a performance incentive, the difference from 5.7 to 5.6 ghz is 1%.
It's pretty obvious that a significant Intel incentive lately has been chasing headline GHz figures. The only reason the 14900KS existed was to say "Look!!! 6.2 GHz!!!"

Claims that Intel wouldn't have done XYZ just to gain a few % or MHz would charitably be described as shaky at best.
 

TheHerald

Notable
Feb 15, 2024
1,288
355
1,060
It's pretty obvious that a significant Intel incentive lately has been chasing headline GHz figures. The only reason the 14900KS existed was to say "Look!!! 6.2 GHz!!!"

Claims that Intel wouldn't have done XYZ just to gain a few % or MHz would charitably be described as shaky at best.
Everyone does this? Have you seen a presentation from competing companies? They have their clockspeeds front back and center. I can provide you with some links if you want
 

YSCCC

Notable
Dec 10, 2022
444
341
1,060
He said intel ignored the issues from the QA, that doesn't amount to cost cutting. I'm saying if the issue was detected at QA then obviously the cpus wouldn't have launched. Especially if it's as simple as "lower vcore from 1.55 to 1.53" then yeah, that's an easy fix

There isn't even a performance incentive, the difference from 5.7 to 5.6 ghz is 1%.
The pushing scenario is simple enough to guess or have one possible makeup.

Arrow lake and Intel 20A was supposed to be arriving in 14th gen or similar time frame, it’s in the late stage that the fab team announced they can’t do it, some technical issues can’t ramp up the production yet, and to some extend, AMD 7000 series presented pressure on RPL early on, it have more than enough reason for Intel management to ask the engineering team to release the top skus “beating the competition” and not let zen 1 era losing to ryzen all along repeat.

And when time is limited and if the architecture being asked to push beyond its original tock cycle evolution (RPL performance increase is more than normal tock gen), it is easy to rush dial in some extra mhz inside and overlooked the VF curve top points, where the “bug” calling for excessive voltage used to not be a problem in ADL is now in RPL.
 
  • Like
Reactions: Saldas and bit_user
It's a fully relevant analogy, because airplane parts and material have variations. The design should account for those variations and the test plan should involve an adequate sample to ensure the range of possible variations are covered.
No it's really not relevant due to the fact that the industry itself is built entirely differently. That would be like comparing radio controlled cars and cars because they both have electronics and wheels with manufacturing tolerance levels that have to be accounted for. Blatantly ignoring the structure of the relevant industry in the process.
The Alderon Games situation suggests it's not so much a question of what the silicon can handle, as much as it is the workload. They were running at a 100% failure rate, with the only major variable being the time-to-failure.
Yeah let's pick an indie dev who clearly comes across as having an axe to grind as an example. None of the other reporting on this issue backed up their claims. They're also the ones who said it was happening with laptop chips. Confirmation bias at its finest.
Unless you believe in voodoo and witchcraft, there's no real excuse for Intel missing this.

I honestly don't know how you think Intel managed to deliver CPUs that worked so well, prior to this. You can't do that without extensive verification. One estimate I've heard is that chip companies traditionally devote about twice the resources to verification as design.
So somehow you think it makes more sense that Intel failed at an extremely basic part of validation than they didn't properly account for silicon variation that affects a minority of parts? I mean cool discussion done there I will never agree it can possibly be that black and white without evidence.

(and yes I'm quite serious I'm done with this, feel free to reply, but I won't be again)
 
Just a couple of notes because I don't disagree with what you said overall because if you're right it's 100% that bad:
Unless they aren’t testing their worst bins in QA but some random best bins..
Bins don't necessarily dictate the overall silicon quality but rather its ability to operate under certain parameters.
but if those are the cases it’s even more worrying or incompetent than they just forget to look at the basic electron migration issues.
This almost certainly can't be the case as it would have required them not doing age testing and if that's true none of their stuff would work right except by luck.
But of course, it is free for you to think this is just unfortunate and that Intel is just unlucky that this rare, hard to find degradation didn’t happen to most of their test cpus but somehow out in the field they have the situation now that they ran out of replacement CPU
You keep bringing up replacement CPUs as if Intel has warehouses full of them. They plan ahead of time for expected replacements and it sounds like ADL was very reliable. This means they may not have planned to have many replacements (I have no doubt if this is the case it would be due to financial people not engineers), and especially for 13th Gen as soon as they knew they were going to be releasing 14th Gen. I've already mentioned how they loosened up retail replacement requirements and they undoubtedly reserved additional OEM replacements first as well. The point I'm trying to make here is that this doesn't mean what you think it does.

I do find it easier to believe it didn't happen on validation (due to scale) than they just didn't validate (or ignored the results) which is what it would take for you to be right.
 
Last edited:

TheHerald

Notable
Feb 15, 2024
1,288
355
1,060
The pushing scenario is simple enough to guess or have one possible makeup.

Arrow lake and Intel 20A was supposed to be arriving in 14th gen or similar time frame, it’s in the late stage that the fab team announced they can’t do it, some technical issues can’t ramp up the production yet, and to some extend, AMD 7000 series presented pressure on RPL early on, it have more than enough reason for Intel management to ask the engineering team to release the top skus “beating the competition” and not let zen 1 era losing to ryzen all along repeat.

And when time is limited and if the architecture being asked to push beyond its original tock cycle evolution (RPL performance increase is more than normal tock gen), it is easy to rush dial in some extra mhz inside and overlooked the VF curve top points, where the “bug” calling for excessive voltage used to not be a problem in ADL is now in RPL.
Im sorry but that might explain eg. the 13900k issues,since you can claim it was pushed to take the absolute crown, but it does not explain at all the 13600k or 13700k problems. Those cpus were released to go against the R5 7600x and the R7 7700x, they easily walked all over them in performance, there was absolutely no need to push them to high wattages or voltages. They can walk all over the competition at less than 100 watts. Heck their old 12900k which was replaced by the 13700k (and supposedly doesn't have issues) was already plenty faster than the 7700x.

I don't think it's as simple as "oh yeah, they ignored QA"
 

bit_user

Titan
Ambassador
Yeah let's pick an indie dev who clearly comes across as having an axe to grind as an example.
They only had an axe to grind after getting burnt by such poor reliability. That doesn't invalidate their data, with which they were very forthcoming.

Confirmation bias at its finest.
Now I'm pretty sure you don't know what that term means, because you're not using it correctly.

So somehow you think it makes more sense that Intel failed at an extremely basic part of validation than they didn't properly account for silicon variation that affects a minority of parts?
No, you're the one who seemed to be suggesting that they missed it due to manufacturing variations. I just said that's not a valid excuse, because it should be the norm for them to account for that in their testing.

I mean cool discussion done there I will never agree it can possibly be that black and white without evidence.
You can't deny the simple fact that they failed to detect the problem. That's the one inconvenient fact that's glaring you in the face, this whole time, and it's not going away!

(and yes I'm quite serious I'm done with this, feel free to reply, but I won't be again)
IMO, this is just acknowledging what's obvious to most of us. Intel missed a defect which doesn't seem all that rare, even if we can't say exactly how prevalent it is. If they were truly incapable of detecting such defects during QA, then their products would be completely unusable garbage.

Dynamic V/F scaling is hardly new to their products (or others'). It should be something they have lots of experience testing. Manufacturing variations also aren't new and they should be extremely knowledgeable about how to account for their in their testing. So, the fact that they missed this defect suggests to me that some systemic problem occurred at Intel.

This is obviously just my interpretation. I accept that I could be off the mark, but Intel could probably make that case to us, if there were truly a good excuse for missing this.
 

bit_user

Titan
Ambassador
Im sorry but that might explain eg. the 13900k issues,since you can claim it was pushed to take the absolute crown, but it does not explain at all the 13600k or 13700k problems.
Did they say the 13600K was susceptible?

Those cpus were released to go against the R5 7600x and the R7 7700x,
No, they weren't. AMD didn't name its CPUs according to Intel's scheme. AMD named them according to which of the Ryzen 5000 generation they were supposed to succeed. Intel does the same. We've been through this like 100 times, already.

they easily walked all over them in performance,
...which is explained by the fact that you got the matchups wrong.

their old 12900k which was replaced by the 13700k (and supposedly doesn't have issues) was already plenty faster than the 7700x.
Because the two were never intended as rivals.
 

TheHerald

Notable
Feb 15, 2024
1,288
355
1,060
Did they say the 13600K was susceptible?


No, they weren't. AMD didn't name its CPUs according to Intel's scheme. AMD named them according to which of the Ryzen 5000 generation they were supposed to succeed. Intel does the same. We've been through this like 100 times, already.


...which is explained by the fact that you got the matchups wrong.


Because the two were never intended as rivals.
You havent explained anything, you are just claiming that 2 cpus with the exact same name, price and release date weren't meant to compete. Sorry, I need more than just your word for it. Seems to me like they were intended to butt heads.
 

bit_user

Titan
Ambassador
You havent explained anything, you are just claiming that 2 cpus with the exact same name, price and release date weren't meant to compete.
Again, we've been through this before. Ryzen 7000 launched nearly a month earlier and was entitled to a price premium. Once AMD could see how their product stack matched up against Raptor Lake and could see its pricing structure, they quickly repriced theirs.

As I said before, the whole name argument is BS. AMD was just sticking with their traditional naming scheme. It's a lot more logical for them to do that, than to try and guess how their products will match up against Intel.

Seems to me like they were intended to butt heads.
Maybe in some kind of video game fantasy world.
 

TheHerald

Notable
Feb 15, 2024
1,288
355
1,060
Again, we've been through this before. Ryzen 7000 launched nearly a month earlier and was entitled to a price premium. Once AMD could see how their product stack matched up against Raptor Lake and could see its pricing structure, they quickly repriced theirs.

As I said before, the whole name argument is BS. AMD was just sticking with their traditional naming scheme. It's a lot more logical for them to do that, than to try and guess how their products will match up against Intel.


Maybe in some kind of video game fantasy world.
20 days later, that's practically the same.

It doesn't matter what amd did, I'm saying at time of release intel obviously put it's 13700k against the 7700x. In order for Intel to win against their intended target they didn't need to push wattages and voltages. That's the whole point I'm making.

What's a fantasy world is you claiming that 2 products sharing name price and release date aren't competitors. That's just completely crazy.
 

bit_user

Titan
Ambassador
20 days later, that's practically the same.
It's practically the same in your mind, but in the mind of the person deciding what their launch prices will be, they have no firm data on how Intel's lineup will either perform or be priced. So, they priced a bit high (but still lower than Ryzen 5000) and dropped. That's the safe route and a much better position to be in than if you accidentally price too low and make almost no money in the face of strong demand!

I'm saying at time of release intel obviously put it's 13700k against the 7700x.
Nope. I don't believe that, either. Intel has also long had a product stack where the unlocked i7 part was called the i7-#700K. They were just sticking with their traditional scheme and positioned the i7-13700K as the successor to the i7-12700K. Why is that so hard to understand??

In order for Intel to win against their intended target they didn't need to push wattages and voltages. That's the whole point I'm making.
Your point falls flat when you don't match up the products, correctly.

What's a fantasy world is you claiming that 2 products sharing name price and release date aren't competitors. That's just completely crazy.
No, I've already explained it quite clearly. You're not being realistic.

I'm sure you know that since Ryzen 3000 series, the #600X had 6 cores and one CCD, the #700X and #800X have had 8 cores and one CCD, #900X had 12 cores and two CCDs, and #950X had 16 cores and two CCDs. AMD was simply keeping with that convention.

Similarly, Intel has long had their i3, i5, i7, and later i9 naming convention. They basically took the S-class dies they had, looked at how they binned, and then decided how many cores they could afford to offer at each price point. They sorted those into their product naming scheme and there you have it!

The main way they align their product stacks is by pricing. If they tried to align model numbers, every generation, the product stacks would be a complete mess, especially if both of them were trying to second-guess each other! At least by maintaining consistency from one generation to the next, they can maintain some level of sanity which I'm sure their partners and customers appreciate.
 
Last edited:
  • Like
Reactions: Saldas

TheHerald

Notable
Feb 15, 2024
1,288
355
1,060
It's practically the same in your mind, but in the mind of the person deciding what their launch prices will be, they have no firm data on how Intel's lineup will either perform or be priced. So, they priced a bit high (but still lower than Ryzen 5000) and dropped. That's the safe route and a much better position to be in than if you accidentally price too low and make almost no money in the face of strong demand!


Nope. I don't believe that, either. Intel has also long had a product stack where the unlocked i7 part was called the i7-#700K. They were just sticking with their traditional scheme and positioned the i7-13700K as the successor to the i7-12700K. Why is that so hard to understand??


Your point falls flat when you don't match up the products, correctly.


No, I've already explained it quite clearly. You're not being realistic.

I'm sure you know that since Ryzen 3000 series, the #600X had 6 cores and one CCD, the #700X and #800X have had 8 cores and one CCD, #900X had 12 cores and two CCDs, and #950X had 16 cores and two CCDs. AMD was simply keeping with that convention.

Similarly, Intel has long had their i3, i5, i7, and later i9 naming convention. They basically took the S-class dies they had, looked at how they binned, and then decided how many cores they could afford to offer at each price point. They sorted those into their product naming scheme and there you have it!
And the reason amd dropped those prices was because the i7 was much faster. If it was slower, the prices would remain there.

Neither intel nor amd cares about keeping their product names. Look at the gpu side, amd changes them frequently and moves products up and down the stack.

Anyways, if price naming scheme and release dates matching isn't enough for 2 cpus to be competitors I have no idea how I'd ever compare 2 cpus with each other. Say the new ultra 7 arrow lake has a 369$ msrp, sames as the 9700x. They aren't meant to compete?
 

bit_user

Titan
Ambassador
Neither intel nor amd cares about keeping their product names.
This is obviously not true.

Look at the gpu side, amd changes them frequently and moves products up and down the stack.
GPUs are a different business. The number of cores, CUs, EUs, or SMs they have doesn't affect users in the same way as the number of cores a CPU has, so you don't have quite the same sort of unit-continuity issue.

And, even in the GPU business, AMD retains a similar naming scheme from one generation to the next, a lot more often than they change it. So, yeah, they changed when going from GCN to RDNA, but now it's had the same structure since RDNA 1 and each product with the same last digits matches up favorably against the corresponding one from the previous generation.

Anyways, if price naming scheme and release dates matching isn't enough for 2 cpus to be competitors I have no idea how I'd ever compare 2 cpus with each other.
Match up based on specialization, current pricing, and power.

So, like it's nuts to compare a R9 7950X against a i9-14900T, because nobody is realistically trying to decide between a 230 W and a 35 W CPU.

As an example of specialization, if someone just cares about gaming performance, you wouldn't have them comparing 7800X3D vs. i7-14700, even though they use similar power and usually have similar pricing.

In other words, just use common sense!
 
Product names don't really mean anything to anyone but the company involved. You can certainly compare MSRPs as a point of reference for what the company thinks a product is worth however these don't matter to anyone buying unless they're the retail price at the time of purchase.
 

TheHerald

Notable
Feb 15, 2024
1,288
355
1,060
This is obviously not true.


GPUs are a different business. The number of cores, CUs, EUs, or SMs they have doesn't affect users in the same way as the number of cores a CPU has, so you don't have quite the same sort of unit-continuity issue.

And, even in the GPU business, AMD retains a similar naming scheme from one generation to the next, a lot more often than they change it. So, yeah, they changed when going from GCN to RDNA, but now it's had the same structure since RDNA 1 and each product with the same last digits matches up favorably against the corresponding one from the previous generation.


Match up based on specialization, current pricing, and power.

So, like it's nuts to compare a R9 7950X against a i9-14900T, because nobody is realistically trying to decide between a 230 W and a 35 W CPU.

As an example of specialization, if someone just cares about gaming performance, you wouldn't have them comparing 7800X3D vs. i7-14700, even though they use similar power and usually have similar pricing.

In other words, just use common sense!
No, they literally changed the names with rdna3. They moved the whole stack up resulting in their 7800xt having the same performance as the 6800xt because it was actually meant to replace the 6700xt.

Using your method you can't really compare anything. You can't compare the 7950x to the 14900k cause nobody interested in a 220w cpu will also be interested in a 400w cpu. And I can keep it up for ages pointing to 150 differences between the two.

What I find funny is that you keep changing your arguments. On the zen 5 review I said the 9700x replaces the 7700 and therefore it's more expensive, yet you insisted that it's replacing the 7700x. Even though they are on completely different power levels..... According to the argument you just presented, I was correct, the 9700x is competing with the 7700, since they both have a 65 tdp, nobody interested in a 65w tdp cpu will also be interested in a 105 tdp cpu right?
 
So, like it's nuts to compare a R9 7950X against a i9-14900T, because nobody is realistically trying to decide between a 230 W and a 35 W CPU.
I know the point you're making and agree, but the 7950X (and obviously 9950X) is the only 16 core desktop part from AMD so it in eco mode would be the only logical comparison anyone could make!
 

bit_user

Titan
Ambassador
You picked the only source of information which backs up the point you were making.
What I did was to look for an example where the same workload reliably caused CPUs to fail. If it were true that only a small proportion of them were susceptible to the problem, such an example shouldn't exist.

That is literally the definition of confirmation bias.
No, you still don't seem to understand what it means.

I'm not being hyperbolic, when I say that this reminds me of a section from Stephen Hawking's Reith Lecture on black holes. He discusses the issue of whether information is destroyed upon entering them. He's aware that, to a general audience, this might seem like an insane thing to worry about, but he justifies it as follows:

"What began as an explanation of what happens at an event horizon has deepened into an exploration of some of the most important philosophies in science - from the clockwork world of Newton to the laws of Laplace to the uncertainties of Heisenberg - and where they are challenged by the mystery of black holes. Essentially, information entering a black hole should be destroyed, according to Einstein's Theory of General Relativity while quantum theory says it cannot be broken down, and this remains an unresolved question.

If information were lost in black holes, we wouldn't be able to predict the future, because a black hole could emit any collection of particles.

It could emit a working television set, or a leather-bound volume of the complete works of Shakespeare, though the chance of such exotic emissions is very low.

It might seem that it wouldn't matter very much if we couldn't predict what comes out of black holes. There aren't any black holes near us. But it is a matter of principle.

If determinism, the predictability of the universe, breaks down with black holes, it could break down in other situations. Even worse, if determinism breaks down, we can't be sure of our past history either.

The history books and our memories could just be illusions. It is the past that tells us who we are. Without it, we lose our identity.

It was therefore very important to determine whether information really was lost in black holes, or whether in principle, it could be recovered."

Source: https://speakola.com/ideas/stephen-hawking-reith-lectures-black-holes-depression-2016

The relation to this subject is that if your conjecture is true, that only very few Raptor Lake dies are susceptible to the problem, then it shouldn't be the case that even a single real-world workload exists that causes a near-100% failure rate! So, I only need one example of such a workload to show that the problem should be possible to manifest on most or all dies!
 
Last edited:

YSCCC

Notable
Dec 10, 2022
444
341
1,060
Just a couple of notes because I don't disagree with what you said overall because if you're right it's 100% that bad:

Bins don't necessarily dictate the overall silicon quality but rather its ability to operate under certain parameters.

This almost certainly can't be the case as it would have required them not doing age testing and if that's true none of their stuff would work right except by luck.

You keep bringing up replacement CPUs as if Intel has warehouses full of them. They plan ahead of time for expected replacements and it sounds like ADL was very reliable. This means they may not have planned to have many replacements (I have no doubt if this is the case it would be due to financial people not engineers), and especially for 13th Gen as soon as they knew they were going to be releasing 14th Gen. I've already mentioned how they loosened up retail replacement requirements and they undoubtedly reserved additional OEM replacements first as well. The point I'm trying to make here is that this doesn't mean what you think it does.

I do find it easier to believe it didn't happen on validation (due to scale) than they just didn't validate (or ignored the results) which is what it would take for you to be right.
You keep claiming in said this but I didn’t said they didn’t validate or ignore the result, maybe it’s lost in the passages, but what I meant is, they have a political pressure from management and rushed or pushed the SKU too far, and the validation wasn’t as complete as the previous, so either they missed the signs by that, or the signs isn’t definitive before release, like, for a make up number, in ADL 1% showed minor degradation and in RPL 5% shows some sort of degradation within projected life expectancy, without definitive cause identified, but launch time is approaching, so management would press “if it’s definite that all will fail, or likely have just a higher failure rate.

As for engineers since there was no exact proof, they most likely would say “possible and worst case would be we will in big trouble”, but without solid evidence (insufficient testing time or decoding for the root cause, which, per how they’re handling it now, assume needing 2 months and extensive failed samples) would make engineers saying it’s just a possibility.

As such, you know what the managers and bean counters will order to go forward to get the MT performance crown.

It’s not pure incompetence of the engineer team, I have confidence in the original engineer team as I have in Boeing. But boss hate those negative view dominant engineers and the incompetence IMO lies in just like Boeing, from not risking quality (stability/safety) at all cost, even if it means losing the “most advanced” title, to fight for stock price and sales figures, sidelining the signs or engineer’s opinion, that’s why I keep saying I will be skeptical for different architectural SKUs from Intel right now, it is incompetent culture, not the SOP/ engineers. Not to say now they have a lot of sort of big name engineers left or fired. What else left is a big question mark needing time to prove
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
I know the point you're making and agree, but the 7950X (and obviously 9950X) is the only 16 core desktop part from AMD so it in eco mode would be the only logical comparison anyone could make!
Okay, but then it's not a straight comparison. It'd be a comparison of those CPUs with a modified configuration vs. the i9-14900T.

The configuration is an intrinsic part of how it behaves. In some cases, it's as important as which model number you're using.
 
What I did was to look for an example where the same workload reliably caused CPUs to fail. If it were true that only a small proportion of them were susceptible to the problem, such an example shouldn't exist.
If such a workload even did exist, and if that's why their processors failed (PC World had a mobo which turned out to be killing RPL CPUs for example) and if they were even telling the truth in the first place. These are the folks who claimed laptops were failing in the same manner. Have you heard any other reports of laptops failing? I haven't, and Intel has certainly denied it at every turn.

It is completely unbelievable that a single developer has some sort of workload that can kill a part that nobody else has. Especially when there are other examples of workloads which were triggering failures but weren't 100%.
 

YSCCC

Notable
Dec 10, 2022
444
341
1,060
Product names don't really mean anything to anyone but the company involved. You can certainly compare MSRPs as a point of reference for what the company thinks a product is worth however these don't matter to anyone buying unless they're the retail price at the time of purchase.
Completely agree on this, and every gen we will have intel competes favourably to AMD at price point/class 1, and opposite in price point 2. And it usually gets more extreme when one line having troubling news or soon to be replaced and non socket compatible.
 
  • Like
Reactions: bit_user

TRENDING THREADS