News Intel's CPU instability and crashing issues also impact mainstream 65W and higher 'non-K' models — damage is irreversible, no planned recall

Page 4 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

NinoPino

Respectable
May 26, 2022
489
305
2,060
You're overthinking this. Intel is only on the hook for the warranty period and we're told they believe they have a fix in the works that will fulfill that obligation for most customers. So:
  1. Why stop sales? Most of these CPUs take a while to fail. Customers who buy new ones and apply the microcode update shortly thereafter will (theoretically) still get a CPU that lasts at least through the warranty period.
The problems in you sentence are the "Most of", "(theoretically)" and the fact that a user affected by the fault loose a lot of time in debugging and adjusting setting to mitigate the problem. This time have a cost, also for gamers.
Intel can release new CPUs with a patched microcode that prevent damage also if not optimal for performance while waiting for the definitive patch to be released.
Consider that Intel declared that also after the release of microcode fix in August they cannot guarantee that the new CPUs will get the fix in it. This imply that the new customers not aware of the problem will be potentially affected by this problem for months if they do not upgrade the motherboard BIOS. Actually there are no information of the possibility of a OS level fix, Intel always talked of MB fix.

Because it probably affects all that are currently in circulation?
But a official declaration with a list is the bare minimum in a case like this.
If the problem is related to TVB as suggested by @CelicaGT than why not simply say it.
But until now the only declaration was that released to The Verge, not satisfactory at all.
  • Not sure it's detectable via software.
Good point and interesting to investigate. 😀
Some good points.
; )
As always. 😀
 
  • Like
Reactions: Nitrate55
Jul 13, 2024
7
7
15
Any socketed desktop CPUs at 65W and above are for sure potentially impacted by this issue. That doesn't mean that they are going to have problems just that the potential is there. Your specific CPU might be perfectly fine, but it also might not as it's not a straightforward problem.
But that means that also i3 could be in danger, the i3-14100 could consume up to 110 W.
 

NinoPino

Respectable
May 26, 2022
489
305
2,060
...
I hope they get taken to the cleaners via a class action lawsuit, and if FTC doesn't do anything, then maybe WTO or EC will do something about it.
...
The only way to impact on Intel are massmedia communication of the issue.
If the news hit newspapers or TV shows than... Bang!
 

NinoPino

Respectable
May 26, 2022
489
305
2,060
The thing is, those CPUs aren't just in gaming PCs -- they are in embedded systems, like medical equipment, calculating the strength of an x-ray for CAT scanner, or the current for the MRI magnet coils, or controlling cash dispensers in ATMs, serving in PoS machines at retail chains, etc.
I doubt that this consumer grade or workstation grade CPUs with high power consumption are used in medical machinery. Any source ?
 

A Stoner

Distinguished
Jan 19, 2009
377
140
18,960
Looks like I jumped off the good ship Intel just in time with the 7950X3D from AMD powering my current flagship computer. I will have to wait for the 9950X3D to come out to see if it is worth ~$700 to upgrade or not. Generally, I like to skip at least one generation before upgrading, By then, the latest and greatest motherboards will be out and will require a completely new full build. Then this computer can be retired to doing background labors.
 
I thought it was just a couple of models that Intel was pusing too hard to keep up with AMD. This paints a different picture.
A few tech tubers were able to discover there was an oxidation problem on 13th and likely some 14th gen chips coming out of the Phoenix Arizona fab, this caused the rapid unscheduled degradation of these chips in seemingly random and difficult to identify ways. It's believed while this oxidation issue isn't the source of all of the problems it's definitely contributing to some of them.

Additional problems identified is P & E cores getting their voltage through the same ring bus, causing the RB to basically burn itself out/melt down, and in general heat/power issues on the top end chips simply breaking them down due to the insane power and heat being driven through them.

That's 3 separate issues causing problems on the 13th and 14th gen.

The microcode fix may limit damage from issues 2 and 3, but any existing damage won't be fixed by it. and this fix won't affect damage caused by oxidation.
 
  • Like
Reactions: slightnitpick

CmdrShepard

Prominent
BANNED
Dec 18, 2023
531
428
760
SPR is GC not RC thus it won't be there unless you think somehow GC is affected too.
From the errata tables for 13th/14th Gen Core and 4th Gen Xeon respectively:

- RPL003 (SPR97): Debug Exceptions May Be Lost or Misreported When MOV SS or POP SS Instruction is Not Followed By a Write to SP

- RPL007 (SPR99): Processor May Generate Spurious Page Faults On Shadow Stack Pages

- RPL008 (SPR100): Processor May Hang if Warm Reset Triggers During BIOS Initialization

- RPL017 (SPR91): Intel® PT Trace May Contain Incorrect Data When Configured With Single Range Output Larger Than 4KB

- RPL057 (SPR128): Disabling The APIC While an Interrupt is Being Delivered May Cause a System Hang

Etc, etc... I hope that's enough to establish architectural similarity of Raptor Lake and Sapphire Rapids P-cores.



65W doesn't mean 65W for the socketed parts which I assume you know...
Read again what Intel said "65W and up are affected".
For embedded the 65W SKUs do not carry the higher TDP which is why they don't even boost as high at the T parts which are 35W (92-109W).
That's irrelevant if they already said "65W and up" and didn't exclude embedded. After all, they are cut from the same wafer just packaged differently.
That's most certainly incorrect as the Minecraft servers (largely single threaded) reported on had been set to disable TVB and yet were still dying they just lasted longer.
I think you people are missing the point that there are two issues -- eTVB bug which was fixed in June microcode through BIOS updates, and incorrect voltage requests bug in microcode which will be fixed in August microcode update. So even if you disable eTVB and even if the CPU doesn't advertise eTVB because it is disabled, the other problem remains. That's also the reason I am thinking that SPR might be affected too based on how many bugs are shared with RPL.
I'm not sure why you think this situation is a binary problem. If every single CPU was actually being degraded there would be a significantly higher rate of failure than what has been seen. Anybody who didn't undervolt/set a voltage cap would have CPUs that are failing.
Because it is binary -- you either powered the CPU on and used it with:

- Aggressive BIOS defaults
- Buggy eTVB
- (still unpatched) microcode wrong voltage request issue

Or you didn't.

How much the CPU has degraded (and whether such degradation is already noticeable) depends on how much you used it. It's normal that gamers, developers, and tech-savvy users are seeing it and reporting it. Regular people even if encountering it probably just shrug and reboot cursing Windows.
 

NinoPino

Respectable
May 26, 2022
489
305
2,060
I thought this only affects 13-14th gen, and not 12th?

Genuine question here, would appreciate clarification - as this seems to be adding to FUD.

Have any 12th gen users been reporting the same crashing behaviour?
I suppose @punkncat refers to the missed upgrade path for customers of 12th gen and not to the fact that 12th gen is also affected.
 
  • Like
Reactions: punkncat

NinoPino

Respectable
May 26, 2022
489
305
2,060
...

Intel is correct in not invoking a wholesale recall, given the actual extent of the issue, which boils down to some small percentage of damaged CPUs. A wholesale recall of every 13/14th CPU would be idiotic.
Agree.
...

It's all about awareness and management of risk, which is normal for me, as an investor.
For sure this is not a normal issue.

>This is super unfortunate for anyone that has Intel 12-14th gen chips. The other aspect of this is now, how long do we wait to see what happens with 15th gen?

To wit, misinformation and FUD like the above is why I don't pay attention to gossip chambers.
Where are the misinformation ? It is true that 12-14th customers are unfortunate. The 12th because have lost a lot of upgrade possibilities.
It is comprehensible to be afraid for the next generation.
Where is the misinformation ?
 
  • Like
Reactions: Guardians Bane

abufrejoval

Reputable
Jun 19, 2020
592
426
5,260
Looks like I jumped off the good ship Intel just in time with the 7950X3D from AMD powering my current flagship computer. I will have to wait for the 9950X3D to come out to see if it is worth ~$700 to upgrade or not. Generally, I like to skip at least one generation before upgrading, By then, the latest and greatest motherboards will be out and will require a completely new full build. Then this computer can be retired to doing background labors.
Somewhat similar story here, just moved to a 7950X3D a few months ago, because one of my kids wanted an upgrade and I saw the chance to pass on a 5800X3D.

That 7950X is certainly good enough for all the workstation stuff I need it for. With an RTX 4090--I mostly bought for CUDA work--any game I throw at the pair just seems to do well enough at 4k and the max 144 FPS my 42" monitor (and HP Reverb VR headset) can handle so that my gaming skills are the only limiting factor remaining...

As much as I'd have liked that 7950X3D to be a 9950X3D, I may just not need it to be better for quite a while. The kids stick with 2-3k resolutions so far, so they'll be happy with their 5800X3Ds for quite a while, too: if there is a bottleneck, it's their GPUs.

I really wanted NUCalikes based in earlier mobile Ryzens, but while those seem to flood AliExpress these days, back when I needed them they just could not be had at all and I went with Intel NUCs and was happy enough with them.

The H-class mobile Intel chips had a lot of appeal in the µ-server arena, because they don't share their desktop cousins ravenous theirst for Wattage and can be operated on modest power. But I need RAM and 10Gbit Ethernet on those µ-servers and that might become an issue with RAM expandability going DIMM-Dodo.
 
  • Like
Reactions: A Stoner

NinoPino

Respectable
May 26, 2022
489
305
2,060
Not K-series, but some of the 65W models are indeed recommended by Intel for use in embedded applications. Intel has an IoT group which (among other things) promotes the use of some of their CPU models for specialized applications, including:
  • Retail, Banking, Education, and Hospitality - Integrated graphics supports immersive and interactive digital signage, video walls, AI-driven in-store advertising, and interactive flat panel displays (IFPDs) for services and storefronts.
  • Healthcare - Performance for more devices, apps, and multitasking—alongside built-in AI acceleration—support more diagnostics and medical procedures, ultrasound imaging, medical carts, endoscopy, and clinical devices.
  • Industrial - Enable machine vision use cases on the factory floor as well as real-time capabilities for critical workloads in AI-based industrial process control (AIPC), industrial PCs, and human-machine interfaces (HMIs).
  • Smart Cities and Transportation - Support network video recorder (NVR) solutions with AI box and roadside units (RSUs) for computer vision, smart city, and smart transportation use cases with Intel® UHD Graphics and fast CPU image classification performance.

That's all from their website, verbatim.

In the Gen 13 models being promoted for use in such applications, they include the i9-13900, i9-13700, and i7-13700T. I wonder if the i9-13900T used to be included, but was subsequently removed.

Details, here:
It is mostly marketing materials but in the end Intel list as IoT devices only the E and TE versions that are specific for embedded.
Somebody have info if affected by the problem ? I think and hope no, for the sake of Intel.
 

Taslios

Proper
Jul 11, 2024
54
76
110
I saw on YT Moore's Law is Dead last video, the possible cause of the whole generation of RPL dying may be due to overloading the much stretched ringbus, which i also suspected because it seems to me the structure to keep that many cores coherent on such a complex shared network to ditribute workload and also sharing L3$, additionally manage frequency differences across all the e- and p-cores is just physically mind blowing

Intel Raptor Lake Ring Bus Flaw Leak: Bartlett Lake is Affected, and there’s no Instability Fix!

As long as the over stretching ringbus structure is the same, Bartlett Lake may not stand a chance to have better fate than RPL(r)
Tom may deny it but he's a hard core AMD fanboy... he is purely speculating. Also if part of the issue is the complexity of the differing needs between E and P cores then Bartlett may be the ONLY viable fix... it doesn't have E cores so won't have the issues with the down stepping/upstepping etc.

however... I've yet to hear what sort of bus Intel is using for Arrow lake.. if Arrow lake has the same Ring bus style power delivery they also be affected.

AMD uses a mesh-ring on Zen 5... curious if Intel will need to do something similar?
 
Mar 10, 2020
421
387
5,070
Tom may deny it but he's a hard core AMD fanboy... he is purely speculating. Also if part of the issue is the complexity of the differing needs between E and P cores then Bartlett may be the ONLY viable fix... it doesn't have E cores so won't have the issues with the down stepping/upstepping etc.

however... I've yet to hear what sort of bus Intel is using for Arrow lake.. if Arrow lake has the same Ring bus style power delivery they also be affected.

AMD uses a mesh-ring on Zen 5... curious if Intel will need to do something similar?
Buildzoid said similar in his video monitoring voltages on a scope….
 

NinoPino

Respectable
May 26, 2022
489
305
2,060
I never claimed that -- I said embedded, and @bit_user shared some info on that.
You wrote "...they are in embedded systems, like medical equipment, calculating the strength of an x-ray for CAT scanner, or the current for the MRI magnet coils, or controlling cash dispensers in ATMs, serving in PoS machines at retail chains, etc."

For me, "medical equipment, calculating the strength of an x-ray for CAT scanner, or the current for the MRI magnet coils", are medical machinery, and I doubt 13th/14th series CPU affected by the problem are used in such machines.

@bit_user linked the Intel marketing material for IoT and embedded 13-14th use cases (thanks), but Intel refers to E and TE models, not normal consumer models. I presume that those SKUs are not affected, otherwise Intel would have an even bigger problem.
 

Taslios

Proper
Jul 11, 2024
54
76
110
Buildzoid said similar in his video monitoring voltages on a scope….
Sorry... what did Buildzoid say? I love his MB reviews but he rambles so much I cant watch his vids all the way through?

So... Buildzoid said the removal of the e cores may be the fix? or that Intel may be in a world of hurt if the ring bus is still in place for Arrow lake? I made two suppositional statements in my post :)
 
Last edited:

vanadiel007

Distinguished
Oct 21, 2015
376
368
19,060
Why? Can you even give reason why they must? Comparing to car companies is the most disengenuous nonsense that keeps going around. Those are government regulated and can face huge fines along with potential barred sales. Even then they still don't always issue recalls unless there's a direct threat during operation.

There's no such regulation for CPUs and it'd be virtually impossible to prove properly working products are defective which is what you'd need to be able to do for a company to care.

Because this is in effect a product defect that can result in product degradation and chip failure.
Governments, hospitals, police forces use Intel chips in their individual work stations.

I don't see why they would not do a voluntary recall of all these affected products, unless they are planning on putting the Company stocks under extreme pressure.

They for sure are not going to "win" the performance crown from AMD under these circumstances...
 
  • Like
Reactions: Nitrate55
From the errata tables for 13th/14th Gen Core and 4th Gen Xeon respectively:

- RPL003 (SPR97): Debug Exceptions May Be Lost or Misreported When MOV SS or POP SS Instruction is Not Followed By a Write to SP

- RPL007 (SPR99): Processor May Generate Spurious Page Faults On Shadow Stack Pages

- RPL008 (SPR100): Processor May Hang if Warm Reset Triggers During BIOS Initialization

- RPL017 (SPR91): Intel® PT Trace May Contain Incorrect Data When Configured With Single Range Output Larger Than 4KB

- RPL057 (SPR128): Disabling The APIC While an Interrupt is Being Delivered May Cause a System Hang

Etc, etc... I hope that's enough to establish architectural similarity of Raptor Lake and Sapphire Rapids P-cores.
So what you're actually saying is that you think Intel is lying about GC being affected. This certainly isn't an avenue I'm willing to entertain without any evidence to back it up, and no the architecture being similar is not evidence.
I think you people are missing the point that there are two issues -- eTVB bug which was fixed in June microcode through BIOS updates, and incorrect voltage requests bug in microcode which will be fixed in August microcode update. So even if you disable eTVB and even if the CPU doesn't advertise eTVB because it is disabled, the other problem remains. That's also the reason I am thinking that SPR might be affected too based on how many bugs are shared with RPL.
No you just missed the point about the workloads degrading the CPU being ones that run high clocks which trigger extreme voltages irrespective of the TVB bug. This would be why in the wild the CPUs predominantly affected are the i9 variety. It even seems like the 14th Gen i7 is more susceptible than 13th which seemingly makes sense as the boost clocks are higher.
Because it is binary -- you either powered the CPU on and used it with:

- Aggressive BIOS defaults
- Buggy eTVB
- (still unpatched) microcode wrong voltage request issue

Or you didn't.

How much the CPU has degraded (and whether such degradation is already noticeable) depends on how much you used it. It's normal that gamers, developers, and tech-savvy users are seeing it and reporting it. Regular people even if encountering it probably just shrug and reboot cursing Windows.
You're absolutely ignoring binning which controls the programmed VID. While the bug may exist in every CPU that doesn't mean the negative effects will be. It's not like there's an arbitrary voltage number being pumped through every CPU with the faulty algorithm.

edit:
Forgot to add this:
That's irrelevant if they already said "65W and up" and didn't exclude embedded. After all, they are cut from the same wafer just packaged differently.
You don't seem to grasp that this isn't some sort of architectural silicon bug. If it was then the T series parts would be listed as being affected. It's about the way the CPUs boost and the voltage required to get there.
 
Because this is in effect a product defect that can result in product degradation and chip failure.
Governments, hospitals, police forces use Intel chips in their individual work stations.
What does any of this matter? If it was guaranteed to being affecting operation of every single part I'd agree, but there's no evidence to support that.
I don't see why they would not do a voluntary recall of all these affected products, unless they are planning on putting the Company stocks under extreme pressure.
Money... you don't seem to understand how much a recall of all 65W+ 13th/14th Gen parts would cost them. It would cost them in the billions of dollars to do a recall like that and there's no evidence to support that being warranted.

You can just take a gander at how Intel was dragged kicking and screaming into replacing Pentiums in the 90s which had an unfixable silicon bug (predated microcode) where a recall was the only option.
 
Status
Not open for further replies.