News Intel's CPU instability and crashing issues also impact mainstream 65W and higher 'non-K' models — damage is irreversible, no planned recall

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
Customers who buy new ones and apply the microcode update shortly thereafter will (theoretically) still get a CPU that lasts at least through the warranty period.
yea, but assuming the customers would even do that. i know a few people at work, that only kmow how to turn a comp on, and use it,thats it. so this would have to be done via windows update, and not via a bios upate .. unless any fixes will be done via windows update ?
 
yea, but assuming the customers would even do that. i know a few people at work, that only kmow how to turn a comp on, and use it,thats it. so this would have to be done via windows update, and not via a bios upate .. unless any fixes will be done via windows update ?
Unless they block Automatic Updates on their OS, I'm pretty sure they'll get the new microcode without having to do anything. On Linux, CPU microcode updates come through the OS and I believe Windows is the same.
 
Unless they block Automatic Updates on their OS, I'm pretty sure they'll get the new microcode without having to do anything. On Linux, CPU microcode updates come through the OS and I believe Windows is the same.
Windows can even update certain OEM firmwares as well. I didn't know this until Win11 updated the BIOS on my Dell G15 vis Windows Update a few weeks ago (It comes with some very stern warnings about not interrupting the process etc.).
 
FWIW, there's the potential of Bartlett Lake. New leaks emerged about it, a couple weeks ago. The i9 and i7 will be available in 12P + 0E and 10P + 0E configurations. The rest of the model lineup looks like yet more Raptor Lake (B0) rebadges.

I think the biggest remaining questions are what the lithography and microarchitecture of those P-core only models will be. I assume still Raptor Cove on Intel 7+. Also, what might the emergence of such models say about the gaming potential of Arrow Lake?
I saw on YT Moore's Law is Dead last video, the possible cause of the whole generation of RPL dying may be due to overloading the much stretched ringbus, which i also suspected because it seems to me the structure to keep that many cores coherent on such a complex shared network to ditribute workload and also sharing L3$, additionally manage frequency differences across all the e- and p-cores is just physically mind blowing

Intel Raptor Lake Ring Bus Flaw Leak: Bartlett Lake is Affected, and there’s no Instability Fix!

As long as the over stretching ringbus structure is the same, Bartlett Lake may not stand a chance to have better fate than RPL(r)
 
You're overthinking this. Intel is only on the hook for the warranty period...
And what about performance impact?

Also, what about selling expensive unlocked chips for overclocking which they now tell you that you will have to lock them down if you want to avoid damage long AFTER the money has changed hands? How is that not false advertising? How is that not taking back part of the product after the sale has been made and thus a breach of sale contract? Or are we now renting our CPUs as well? CPU as a service?
Why stop sales? Most of these CPUs take a while to fail. Customers who buy new ones and apply the microcode update shortly thereafter will (theoretically) still get a CPU that lasts at least through the warranty period.
I hate to break it to you, but 99% of the customers won't update the microcode, and even if Microsoft includes in in Windows patch between BIOS post and loading of that patch there's enough time for a single core spinning at 100% CPU usage at uncontrolled voltage (BSP waiting for APs to come up and respond to Startup IPI, as well as single-threaded parts of Windows boot process) to get some damage in every time the system is (re)booted.
Probably, but are you certain?
I am not, but if it was possible to fully fix it, then why they didn't bundle that fix with eTVB microcode fix released two months ago? Also, what happened to transparency? How come that the errata documents don't even mention eTVB microcode bug and subsequent fix?

And finally, when we are at errata documentation, isn't it weird that Sapphire Rapids errata shares some of the entries as 13th and 14th generation errata? Doesn't that imply SPR and RPL have basically the same P cores when they suffer from the same issues? Couldn't that imply that SPR is also vulnerable to this issue partially (it doesn't have eTVB but the other wrong voltage issue August microcode is supposed to fix might still be there)?
 
That is completely normal. AMD did the same thing with the Ryzen 1000 series, they replaced them if they died or started throwing errors and left them alone if they didn't. Same thing happens to vehicles with potentially catastrophic issues, they fix them if they start showing issues and leave them alone if they don't. Why? You said it yourself.
Holy heck, no no no NO NO NO.

I work in automotive, and you do NOT leave vehicles alone if they have known or suspected catastrophic defects and wait until the customer takes it to a dealer and asks about it. That's how you end up with 8-to-9 figure fines from the US government, and in the worst cases, mainstream news coverage and a body count.

If there's something like a potentially suspect frame weld or a potentially incorrectly torqued seat belt mounting bolt, you identify the range of suspect VINs and start mailing every owner you can possibly locate, telling them what is potentially wrong and that they need to take their car into a dealership for inspection as soon as possible, and stress that inspection and any necessary repairs will be at no cost to them, because people will ignore warning signs and delay service if they think it's gonna cost them dealership service rates. If you've identified an issue and don't have replacement parts, you still mail them and let them know you've identified an issue and are prioritizing repairs, and promise them another letter when parts are widely available. In really bad cases, you even do like-for-like replacements on parts that catastrophically fail do to unexpected deterioration, just to get the oldest ones out of play while you work on a better design.

The stakes are lower for gaming PCs, but my old car I was not the first owner, and that model had a manufacturing problem with the brazing on an air conditioner line. Not considered catastrophic, since A/C is an accessory item and not a safety item. The manufacturer still tracked me down and sent me a letter explaining the nature of the defect, the extension of warranty coverage for the A/C system (out to 10 years!), and instructions on how to apply for reimbursement if I had paid for any repairs related to that issue at either a dealership or independent shop before the warranty extension was announced.
 
A lot of people are thinking that these chips are deficient if they degrade over 1.5v.
I thought the node was indestructible, the most durable node since the triple digits because of the volts I saw my CPU abused by and remain unscathed.
Like over 1.6v a little over a month ago when I applied Intel fail safe settings.
But after hearing others are having troubles, maybe it is a more normal node and 1.5volts may be around the upper safe limit (my chip runs an OC at under 1.4v btw)
There was even a story on these forums on how a user's adequately cooled 13900k would peg at 100c starting many games. Most owners of i9s can attest those aren't normal volts. And there is little chance they were chosen by the user.

If the microcode stops the degradation then the degradation should be stopped.

Seeing how this recent high voltage single core boost fad is ending badly, it wouldn't surprise me if Ryzen got recalled to scale it back. It isn't like TSMC can handle 1.6v or anything. And it is generally tough to see any worthwhile improvements going from 5.5 to 6.0GHz on 1-2 core loads.
 
Intel can and must do better: https://www.cnn.com/2024/07/26/busi...eplacement-tundra-trucks-lexus-suv/index.html


In early June, Toyota announced the recall of nearly 100,000 Tundra pickups and about 3,500 Lexus luxury SUVs to fix a problem that could cause their engines to lose power while driving.

At the time, Toyota said it was working to find a solution to the issue. The solution, it now says, is simply to replace the entire engine on each one of the 103,500 big trucks and SUVs.

Intel can and must replace all the chips they sold, as there's no way to know if they are actually damaged or not. The assumption is that as long as they do not crash, they are not damaged. That's a bad assumption to make.
 
And what about performance impact?
Every manufacturer datasheet or specifications page I can recall seeing in recent memory all have a get-out-of-jail footnote, like: "specification subject to change without notice". Also, I'm sure Intel has some similar language around turbo boosting, since that already couldn't be guaranteed due to thermal throttling.

I'm not arguing what's morally right, here. I'm just trying to focus on what they're actually liable for and what I think they will likely do.

Also, what about selling expensive unlocked chips for overclocking which they now tell you that you will have to lock them down if you want to avoid damage long AFTER the money has changed hands?
It's still an unknown just what the performance impact will be. I'm not going to speculate about that. Let's just wait and see.

if it was possible to fully fix it, then why they didn't bundle that fix with eTVB microcode fix released two months ago?
I think they would've if they could've.

Also, what happened to transparency?
Agreed. I'm sure their PR strategy is not ideal, but we won't be able to properly judge it until all of the facts are known (plus, what they knew and when). I assume they're holding back information for good reasons (from their perspective), but hopefully time will tell.

How come that the errata documents don't even mention eTVB microcode bug and subsequent fix?
Well, do those documents cover microcode bugs or just hardware defects?

And finally, when we are at errata documentation, isn't it weird that Sapphire Rapids errata shares some of the entries as 13th and 14th generation errata? Doesn't that imply SPR and RPL have basically the same P cores when they suffer from the same issues? Couldn't that imply that SPR is also vulnerable to this issue partially (it doesn't have eTVB but the other wrong voltage issue August microcode is supposed to fix might still be there)?
Good questions. Hopefully, we'll get some answers.

BTW, Sapphire Rapids' cores have 2 MB of L2 cache, like Raptor Cove. However I think it's fabbed on the same Intel 7 node as Alder Lake. If it's a process-related issue, then Sapphire Rapids might be clear but Emerald Rapids is probably affected. If it's a microarchitecture flaw, then perhaps both could be in trouble?

I guess the main argument that Sapphire Rapids is in the clear would be that it's been shipping in volume and running 24/7 workloads for like 18 months. If it were affected by the same issue, I assume we'd have heard about it, by now.
 
I kept asking and Tom's finally reported the answer. The 14700 is affected. Now I have to figure out if I should tell the person who bought one that they should return it and wait for the next gen or keep it and hope it works out. It's still in the box.
 
I kept asking and Tom's finally reported the answer. The 14700 is affected. Now I have to figure out if I should tell the person who bought one that they should return it and wait for the next gen or keep it and hope it works out. It's still in the box.
In your shoes, I'd tell them, but characterize the risk for that model as "low, as far as we know" and that Intel promises a mitigation is coming within the next month or so.

The main factors that should bias the advice on their best course of action would be:
  • do they run lots of heavy compute jobs or do a lot of gaming?
  • do they plan on keeping the CPU well beyond the 3 year warranty period of retail boxed Intel CPUs?

If the answer to one or both of those questions is "no", then I'd probably go ahead and use it, if I were in their shoes.
 
Intel looks like the bad guy who pushed his machine too hard trying to beat the hero (or win the race, or whatever). "AMD is getting ahead, we need more power!" "But sir..."

(I'm not calling Intel the bad guy, just comparing the anecdote)
"Prepare chips for Light Speed!"
"No no no Light Speed's too slow!"
"Light Speed too slow?"
"We're gonna have to go right to- Ludicrous Speed!"
Ludicrous speed? Sir, we've never made chips that fast before. I don't know if the Fabs can take it!
 
In your shoes, I'd tell them, but characterize the risk for that model as "low, as far as we know" and that Intel promises a mitigation is coming within the next month or so.

The main factors that should bias the advice on their best course of action would be:
  • do they run lots of heavy compute jobs or do a lot of gaming?
  • do they plan on keeping the CPU well beyond the 3 year warranty period of retail boxed Intel CPUs?

If the answer to one or both of those questions is "no", then I'd probably go ahead and use it, if I were in their shoes.
First answer is no; it will be mostly browsing and productivity (office) work. Second answer is yes (their last pc is 14 years old 😅). It's a mixed bag, but then again, so is Intel right now.
 
This answers the question I had, specifically being was it only the high end K skew, or all. This is super unfortunate for anyone that has Intel 12-14th gen chips. The other aspect of this is now, how long do we wait to see what happens with 15th gen? A thing like this just being smoothed over for a year and a half + surely isn't going to leave me confident in buying Intel replacement/upgrade for years to come at this point.

I suppose that many of us will have to take a wait and see attitude. If there is some method upon which we can know for fact that it is a 'post-problem' chip it may make them a relevant choice again. Personally sort of ticks me off since I went with a 12th gen chip which are priced quite attractively right now with the future plan to be to update into a 14th gen a couple of years from now.
Wait, but aren't you in the clear? I thought that 12th gen was safe, and only 13th and 14th were affected.

EDIT: ah, @yc1 beat me to it.
 
Last edited:
Your description of the auto industry is not accurate. If a major issue shows up, they will/must recall all the vehicles potentially affected for repair or replacement. And they certainly won't sell new vehicles with those problems. Doing anything would lead to legal liability, major lawsuits and financial catastrophe.

Boeing tried it; I am not certain if they will recover in the near future.

Except they're not. Ford, for example, has a known issue on Bronco and Escape SUVs for engine fires due to cracked fuel injectors, but they aren't fixing them, they're only "monitoring" them for signs of a fuel leak and they will fix them if it's detected. And then there's the Kia and Hyundai recall that doesn't fix the fluid leak issue that causes fires, it only makes the fires less likely by putting in a fuse.

It would be NICE if US law required car manufacturers to actually fix potentially catastrophic issues, but until then they're not actually required to.
 
That is completely normal. AMD did the same thing with the Ryzen 1000 series, they replaced them if they died or started throwing errors and left them alone if they didn't. Same thing happens to vehicles with potentially catastrophic issues, they fix them if they start showing issues and leave them alone if they don't. Why? You said it yourself.
So, since this is an Intel problem, why are you hiding behind AMD?
And, were the Ryzen 1000 series actually suffering degradation like the 13th and 14th gen Intel chips are?
 
Holy heck, no no no NO NO NO.

I work in automotive, and you do NOT leave vehicles alone if they have known or suspected catastrophic defects and wait until the customer takes it to a dealer and asks about it. That's how you end up with 8-to-9 figure fines from the US government, and in the worst cases, mainstream news coverage and a body count.

If there's something like a potentially suspect frame weld or a potentially incorrectly torqued seat belt mounting bolt, you identify the range of suspect VINs and start mailing every owner you can possibly locate, telling them what is potentially wrong and that they need to take their car into a dealership for inspection as soon as possible, and stress that inspection and any necessary repairs will be at no cost to them, because people will ignore warning signs and delay service if they think it's gonna cost them dealership service rates. If you've identified an issue and don't have replacement parts, you still mail them and let them know you've identified an issue and are prioritizing repairs, and promise them another letter when parts are widely available. In really bad cases, you even do like-for-like replacements on parts that catastrophically fail do to unexpected deterioration, just to get the oldest ones out of play while you work on a better design.

The stakes are lower for gaming PCs, but my old car I was not the first owner, and that model had a manufacturing problem with the brazing on an air conditioner line. Not considered catastrophic, since A/C is an accessory item and not a safety item. The manufacturer still tracked me down and sent me a letter explaining the nature of the defect, the extension of warranty coverage for the A/C system (out to 10 years!), and instructions on how to apply for reimbursement if I had paid for any repairs related to that issue at either a dealership or independent shop before the warranty extension was announced.

Tell that to Ford which has at least 150,000 cars on the road right now from fuel leak and fuel injector issues in the last couple of years that they will not fix unless it catches fire, or the millions of Kia and Hyundai cars which will not be fixed but only have a fuse put in to lower the chance of a fire.
 
  • Like
Reactions: KyaraM
>I thought it was just a couple of models that Intel was pusing too hard to keep up with AMD. This paints a different picture.

Not really. It makes sense that the defective microcode would be in all ranges of CPUs and not just some. But per reported evidence, apparently only CPUs running at high voltages (i7/i9 using AIBs' default unlimited power limits) and/or running at sustained high-loads (servers) were reported as unstable/damaged. I've yet to see any confirmed damaged i5 or non-K report on reddit or in the wild.

So, the potential for damage exists for all chips, but damage only happens under certain conditions. Assuming said microcode defect is the root cause, the fix would suffice, but not for CPUs already damaged. The vast majority of CPUs--non-Ks (65W), i5, those running at default Intel power limit ie mobile CPUs--should be fine with or without the fix.

Intel is correct in not invoking a wholesale recall, given the actual extent of the issue, which boils down to some small percentage of damaged CPUs. A wholesale recall of every 13/14th CPU would be idiotic.

In echo chambers like this, there's a lot of faux outrage and righteous indignation that feed upon each other, so we get tempest-in-a-tea-cup skewed perspective. I've a 65W RPL and I would go bonkers if I paid any attention to the moaning & groaning here. The world isn't ending, and neither is Intel.

That said, once awareness of issue spreads wider, there will be some loss of public confidence in RPL/RPL-R, which will be reflected in their pricing in the coming months. Bargain hunters would be on the alert for deals.

It's all about awareness and management of risk, which is normal for me, as an investor.


>This is super unfortunate for anyone that has Intel 12-14th gen chips. The other aspect of this is now, how long do we wait to see what happens with 15th gen?

To wit, misinformation and FUD like the above is why I don't pay attention to gossip chambers.
 
Last edited:
Well, do those documents cover microcode bugs or just hardware defects?
Microcode is used to fix the hardware defects when possible so that's a yes. Any CPU erratum which requires microcode change is listed there and this one with eTVB and the other one with wrong voltage should be there as well.
If it's a microarchitecture flaw, then perhaps both could be in trouble?
That's what I am afraid of since I saw several identical flaws.
I guess the main argument that Sapphire Rapids is in the clear would be that it's been shipping in volume and running 24/7 workloads for like 18 months. If it were affected by the same issue, I assume we'd have heard about it, by now.
Those work in different conditions -- no spiky workloads, no insane boosting (everyone probably runs them in power efficient mode), and insanely good cooling (18 degrees Celsius in datacenters). I wonder whether X models (i.e. the OC unlocked ones like mine) are affected though.
 
  • Like
Reactions: NinoPino
>I thought it was just a couple of models that Intel was pusing too hard to keep up with AMD. This paints a different picture.

Not really. It makes sense that the defective microcode would be in all ranges of CPUs and not just some. But per reported evidence, apparently only CPUs running at high voltages (i7/i9 using AIBs' default unlimited power limits) and/or running at sustained high-loads (servers) were reported as unstable/damaged. I've yet to see any confirmed damaged i5 or non-K report on reddit or in the wild.

So, the potential for damage exists for all chips, but damage only happens under certain conditions. Assuming said microcode defect is the root cause, the fix would suffice, but not for CPUs already damaged. The vast majority of CPUs--non-Ks (65W), i5, those running at default Intel power limit ie mobile CPUs--should be fine with or without the fix.

Intel is correct in not invoking a wholesale recall, given the actual extent of the issue, which boils down to some small percentage of damaged CPUs. A wholesale recall of every 13/14th CPU would be idiotic.

In echo chambers like this, there's a lot of faux outrage and righteous indignation that feed upon each other, so we get tempest-in-a-tea-cup skewed perspective. I've a 65W RPL and I would go bonkers if I paid any attention to the moaning & groaning here. The world isn't ending, and neither is Intel.

That said, once awareness of issue spreads wider, there will be some loss of public confidence in RPL/RPL-R, which will be reflected in their pricing in the coming months. Bargain hunters would be on the alert for deals.

It's all about awareness and management of risk, which is normal for me, as an investor.


>This is super unfortunate for anyone that has Intel 12-14th gen chips. The other aspect of this is now, how long do we wait to see what happens with 15th gen?

To wit, misinformation and FUD like the above is why I don't pay attention to gossip chambers.
Pure speculation on my part, and perhaps some laziness, but it may still be Thermal Velocity Boost. The mobile chips which Intel claims are unaffected do not have TVB listed as a feature. I haven't bothered to go check on the S and T versions ..
 
If the microcode stops the degradation then the degradation should be stopped.
Intel themselves say it doesn't -- if the chip started degrading then it's too late. It only prevents degradation if applied before it started (which is by definition impossible for all CPUs already sold and in the supply chain and that's why everyone should be pissed off.
 
This answers the question I had, specifically being was it only the high end K skew, or all. This is super unfortunate for anyone that has Intel 12-14th gen chips.
I thought this only affects 13-14th gen, and not 12th?

Genuine question here, would appreciate clarification - as this seems to be adding to FUD.

Have any 12th gen users been reporting the same crashing behaviour?
 
Last edited:
  • Like
Reactions: bit_user
I thought this only affects 13-14th gen, and not 12th?

Genuine question here, would appreciate clarification - as this seems to be adding to FUD.

Have any 12th gen users been reporting the same crashing behaviour?
so far all the cases known involve the B0 die, but none from the C0 die. B0 die has 8 raptor cove p-cores (w/ 2MB L2$) and 4 clusters of gracemont e-cores, i7 configuration has some cores disabled; C0 die is from ADL with 1.25MB L2$, and only 2 clusters of e-cores, some 13th gen i5 use C0 instead of B0 dies which are none affected.
 
Status
Not open for further replies.