News Intel's CPU instability and crashing issues also impact mainstream 65W and higher 'non-K' models — damage is irreversible, no planned recall

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

NedSmelly

Prominent
Feb 11, 2024
740
399
770
so far all the cases known involve the B0 die, but none from the C0 die. B0 die has 8 raptor cove p-cores (w/ 2MB L2$) and 4 clusters of gracemont e-cores, i7 configuration has some cores disabled; C0 die is from ADL with 1.25MB L2$, and only 2 clusters of e-cores, some 13th gen i5 use C0 instead of B0 dies which are none affected.
Thank you. According to Wccftech, Alder Lake 12th Gen uses C0 and H0 dies.
 
  • Like
Reactions: bit_user
The thing is, those CPUs aren't just in gaming PCs -- they are in embedded systems, like medical equipment, calculating the strength of an x-ray for CAT scanner, or the current for the MRI magnet coils, or controlling cash dispensers in ATMs, serving in PoS machines at retail chains, etc.
Embedded CPUs absolutely do not use the same power controls as socketed (no tvb/tb3 and lower boost clocks even at the same TDP). Also can we drop the hyperbole nobody's putting high power CPUs in ATMs/PoS.
And finally, when we are at errata documentation, isn't it weird that Sapphire Rapids errata shares some of the entries as 13th and 14th generation errata? Doesn't that imply SPR and RPL have basically the same P cores when they suffer from the same issues? Couldn't that imply that SPR is also vulnerable to this issue partially (it doesn't have eTVB but the other wrong voltage issue August microcode is supposed to fix might still be there)?
EMR would be more likely to be susceptible to these issues than SPR. I doubt either one would be though because they don't boost in the same fashion as their desktop counterparts (they do have tb2 and some tb3 though for W). The runaway voltages are seemingly attached to lighter workloads when the cores boost to maximum. The highest turbo on SPR/EMR is 4.2Ghz in Scalable form and 4.8Ghz in W form. This means even if they had the same type of boosting algorithm they're so much lower the tables would never have such high voltages.

Here's an example of the top end W VID tables: https://skatterbencher.com/2023/09/...00-mhz/#Xeon_w7-3465X_Voltage-Frequency_Curve

As for the errata documentation Intel got way worse updating ongoing discoveries years ago and sadly that hasn't changed.
Intel themselves say it doesn't -- if the chip started degrading then it's too late. It only prevents degradation if applied before it started (which is by definition impossible for all CPUs already sold and in the supply chain and that's why everyone should be pissed off.
Ah except you're making the false assumption that every chip has been damaged. What this more likely comes down to is binning which is also why Intel can't just wave their hands and come up with a list of affected CPUs. They should be able to write some software which could identify likely suspects.
 
Intel can and must replace all the chips they sold, as there's no way to know if they are actually damaged or not. The assumption is that as long as they do not crash, they are not damaged. That's a bad assumption to make.
Why? Can you even give reason why they must? Comparing to car companies is the most disengenuous nonsense that keeps going around. Those are government regulated and can face huge fines along with potential barred sales. Even then they still don't always issue recalls unless there's a direct threat during operation.

There's no such regulation for CPUs and it'd be virtually impossible to prove properly working products are defective which is what you'd need to be able to do for a company to care.
 

andrep74

Distinguished
Apr 30, 2008
14
19
18,515
What's the news on the laptop chips?

After all that's where things would get really ugly...

I have a Minisforum MS-01 that would randomly get BSOD and I put up with it because it's a quick enough reboot. But last week it happened again and this time wouldn't reboot. The MS-01 uses an i9-13900H (laptop CPU), so I'm wondering if this was the cause. In that case Minisforum has a big problem with these popular machines.
 

bit_user

Titan
Ambassador
Embedded CPUs absolutely do not use the same power controls as socketed (no tvb/tb3 and lower boost clocks even at the same TDP). Also can we drop the hyperbole nobody's putting high power CPUs in ATMs/PoS.
Not K-series, but some of the 65W models are indeed recommended by Intel for use in embedded applications. Intel has an IoT group which (among other things) promotes the use of some of their CPU models for specialized applications, including:
  • Retail, Banking, Education, and Hospitality - Integrated graphics supports immersive and interactive digital signage, video walls, AI-driven in-store advertising, and interactive flat panel displays (IFPDs) for services and storefronts.
  • Healthcare - Performance for more devices, apps, and multitasking—alongside built-in AI acceleration—support more diagnostics and medical procedures, ultrasound imaging, medical carts, endoscopy, and clinical devices.
  • Industrial - Enable machine vision use cases on the factory floor as well as real-time capabilities for critical workloads in AI-based industrial process control (AIPC), industrial PCs, and human-machine interfaces (HMIs).
  • Smart Cities and Transportation - Support network video recorder (NVR) solutions with AI box and roadside units (RSUs) for computer vision, smart city, and smart transportation use cases with Intel® UHD Graphics and fast CPU image classification performance.

That's all from their website, verbatim.

In the Gen 13 models being promoted for use in such applications, they include the i9-13900, i9-13700, and i7-13700T. I wonder if the i9-13900T used to be included, but was subsequently removed.

Details, here:
 
Not K-series, but some of the 65W models are indeed recommended by Intel for use in embedded applications.
Right and as I said the embedded parts don't have the same power or boost profiles as their socketed counterparts. The 13900T has a higher rated boost clock than the 13900E for example despite the latter being a 65W part.

Also none of that is ATM/PoS.

edit:
I wonder if the i9-13900T used to be included, but was subsequently removed.
If the regular 13900 is listed I can't think of any logical reason the T wouldn't be so maybe availability?
 
Last edited:
  • Like
Reactions: KyaraM
Jul 13, 2024
7
7
15
Not K-series, but some of the 65W models are indeed recommended by Intel for use in embedded applications. Intel has an IoT group which (among other things) promotes the use of some of their CPU models for specialized applications, including:
  • Retail, Banking, Education, and Hospitality - Integrated graphics supports immersive and interactive digital signage, video walls, AI-driven in-store advertising, and interactive flat panel displays (IFPDs) for services and storefronts.
  • Healthcare - Performance for more devices, apps, and multitasking—alongside built-in AI acceleration—support more diagnostics and medical procedures, ultrasound imaging, medical carts, endoscopy, and clinical devices.
  • Industrial - Enable machine vision use cases on the factory floor as well as real-time capabilities for critical workloads in AI-based industrial process control (AIPC), industrial PCs, and human-machine interfaces (HMIs).
  • Smart Cities and Transportation - Support network video recorder (NVR) solutions with AI box and roadside units (RSUs) for computer vision, smart city, and smart transportation use cases with Intel® UHD Graphics and fast CPU image classification performance.

That's all from their website, verbatim.

In the Gen 13 models being promoted for use in such applications, they include the i9-13900, i9-13700, and i7-13700T. I wonder if the i9-13900T used to be included, but was subsequently removed.

Details, here:
So is my i9-13900 safe or not?
 

rluker5

Distinguished
Jun 23, 2014
904
584
19,760
Even though the motherboard ultimately decides what voltage to provide the CPU, isn't that just a master voltage, with the voltage of individual cores derived by the CPU internally stepping it down?

I'm having a little trouble finding much info on FIVR, but this quote directly supports the idea that each core can run at a different voltage:
"Note that while each of the CPU cores now has its own PLL (and its own V/F curve), for processors without FIVR and where the cores share a single VccIA or VccCore voltage rail, only one voltage is applied across all cores.​
Alder Lake and Raptor Lake inherit the overclocking feature from Rocket Lake and offers ratio limits for each of the P-cores and each cluster of 4 E-cores."​

Then, the question is whether FIVR is stepping down the voltage it thinks it's getting or the voltage it actually gets.

According to scatterbencher, the cores, cache and igpu don't have a FIVR on Alder and Raptor, just power gating. How much capping the CPU requested volts will cap the received volts remains to be seen but it certainly won't be absolute, and it will vary per motherboard manufacturer.

HWinfo monitors both the voltage requested by the CPU and measured by an external sensor on the Z690 boards I have. I changed some things and did some screenshots that I apparently can't share from an imagur link anymore so I'll just summarize. I checked voltage readings under the light all core CPUz stress test for consistency.
My very cheap Aorus itx showed 2 distinct behaviors in my 2 different types of tests. Adjustments to core voltage apparently adjusted the displayed core voltage request (core VID) and adjustments to the LLC setting completely ignored core requests and just changed the vcore external sensor reading. Taking the LLC to the second highest option on the Aorus increased the externally read voltage to 160mv over what the CPU requested. At the stock(also lowest) LLC setting the external sensor read about 25 mv higher.

My Asus board kept the externally monitored voltage about 65mv higher than the requested voltage at stock, stock with an autotune applied, stock with Intel failsafe SVID, and 50 mv higher under a tuned undervolt (with second highest LLC setting). Also the Asus board defaults XMP to run the memory controller at 1.56v which seems to just be asking for degradation.

If you have a tip on how to post these images here, I wouldn't mind sharing them under a spoiler as they have a lot of specific and corroborating information.

A cap on requested voltage would have to take in account for these motherboard variances to be effective. If Intel wants to cap some chips at 1.5v they will have to set it lower than 1.5v.
 

TheHerald

Respectable
BANNED
Feb 15, 2024
1,633
502
2,060
According to scatterbencher, the cores, cache and igpu don't have a FIVR on Alder and Raptor, just power gating. How much capping the CPU requested volts will cap the received volts remains to be seen but it certainly won't be absolute, and it will vary per motherboard manufacturer.

HWinfo monitors both the voltage requested by the CPU and measured by an external sensor on the Z690 boards I have. I changed some things and did some screenshots that I apparently can't share from an imagur link anymore so I'll just summarize. I checked voltage readings under the light all core CPUz stress test for consistency.
My very cheap Aorus itx showed 2 distinct behaviors in my 2 different types of tests. Adjustments to core voltage apparently adjusted the displayed core voltage request (core VID) and adjustments to the LLC setting completely ignored core requests and just changed the vcore external sensor reading. Taking the LLC to the second highest option on the Aorus increased the externally read voltage to 160mv over what the CPU requested. At the stock(also lowest) LLC setting the external sensor read about 25 mv higher.

My Asus board kept the externally monitored voltage about 65mv higher than the requested voltage at stock, stock with an autotune applied, stock with Intel failsafe SVID, and 50 mv higher under a tuned undervolt (with second highest LLC setting). Also the Asus board defaults XMP to run the memory controller at 1.56v which seems to just be asking for degradation.

If you have a tip on how to post these images here, I wouldn't mind sharing them under a spoiler as they have a lot of specific and corroborating information.

A cap on requested voltage would have to take in account for these motherboard variances to be effective. If Intel wants to cap some chips at 1.5v they will have to set it lower than 1.5v.
VID is how much the CPU requests, vcore is how much it actually gets from the mobo. The "goal" should be to make these 2 match, for starters, because that's the only way to get actual power usage from hwinfo. If vid and vcore don't match your power readings will be off. In order to make these 2 match you need to play around with AC / DC LL on advanced lite load settings. Setting up the correct values is based on motherboard's vrms mostly.

As you can see from my SS, min / avg / max vcore and VID are as close as possible. Maybe could tune a bit further to close that 0.003 difference but im too bored to do that

image-2024-07-29-081155957.png
 
  • Like
Reactions: rluker5

KyaraM

Admirable
This answers the question I had, specifically being was it only the high end K skew, or all. This is super unfortunate for anyone that has Intel 12-14th gen chips. The other aspect of this is now, how long do we wait to see what happens with 15th gen? A thing like this just being smoothed over for a year and a half + surely isn't going to leave me confident in buying Intel replacement/upgrade for years to come at this point.

I suppose that many of us will have to take a wait and see attitude. If there is some method upon which we can know for fact that it is a 'post-problem' chip it may make them a relevant choice again. Personally sort of ticks me off since I went with a 12th gen chip which are priced quite attractively right now with the future plan to be to update into a 14th gen a couple of years from now.
12th gen isn't affected.
 
  • Like
Reactions: bit_user

CmdrShepard

Prominent
BANNED
Dec 18, 2023
531
428
760
Embedded CPUs absolutely do not use the same power controls as socketed (no tvb/tb3 and lower boost clocks even at the same TDP). Also can we drop the hyperbole nobody's putting high power CPUs in ATMs/PoS.
As far as I know, 65W isn't considered high power CPU. Intel said 65W and up.
The runaway voltages are seemingly attached to lighter workloads when the cores boost to maximum.
That's for eTVB bug.
The highest turbo on SPR/EMR is 4.2Ghz in Scalable form and 4.8Ghz in W form. This means even if they had the same type of boosting algorithm they're so much lower the tables would never have such high voltages.
As I said SPR doesn't have eTVB, doesn't mean that the other incorrect voltage selection bug isn't present if the issue is architectural.
Ah except you're making the false assumption that every chip has been damaged.
Every 65W+ chip that has been installed and powered on, as well as those that have the microcode bug which will be installed by end users or which are installed in premade systems which won't be updated since regular people don't update their BIOS like ever.
 

CmdrShepard

Prominent
BANNED
Dec 18, 2023
531
428
760
I kept asking and Tom's finally reported the answer. The 14700 is affected. Now I have to figure out if I should tell the person who bought one that they should return it and wait for the next gen or keep it and hope it works out. It's still in the box.
If it is still in the box, then assuming they can get a mainboard with an updated BIOS which contains both microcode fixes (eTVB and wrong voltage one), or do a recovery BIOS flash to such version before installing the CPU (some mainboards support this), their CPU would probably be ok.
 
  • Like
Reactions: ThomasKinsley
Mar 10, 2020
421
387
5,070
You're overthinking this. Intel is only on the hook for the warranty period and we're told they believe they have a fix in the works that will fulfill that obligation for most customers. So:
  1. Why stop sales? Most of these CPUs take a while to fail. Customers who buy new ones and apply the microcode update shortly thereafter will (theoretically) still get a CPU that lasts at least through the warranty period.
  2. Same reason as above. No compelling reason to stop sales, if the microcode fix works well enough.
  3. Because it probably affects all that are currently in circulation?
  4. Not sure it's detectable via software.
  5. Yes, good question. They clearly need to fill us in on more details.
  6. Yes, this would be a good move by them and I think it's really owed. This is something that could still happen, but perhaps they're currently focused on how to keep the chip from sinking. Once they've stabilized the crisis, perhaps they'll turn attention to secondary matters, like customers whose chips haven't yet started failing.
  7. Probably, but are you certain?
  8. Assume it did. The question is how much it's degraded and I'm not sure that's something you can determine via software.

Some good points.
; )

All that is good for Intel from their point of view.

It could be seen thus:-
1. We will keep selling a device that we know is operating in a flawed manner, if we stop our share price tanks. So long as we can get through the warranty period then we are fine.
2. We have a possible fix to the system.. let’s hope it works. So long as we can get through the warranty period then we are fine.
3. Hope it works or we are screwed.
4. As you wrote.
5. As you wrote.
6. We could increase the warranty period, we might have to if the failure rate becomes catastrophic. It has the potential to do so. We had better hope the fix works.
7. We really have to pray the fix works.
8. No way to tell if the chips in the wild are already damaged, hope the fix works.

I know that the above is overly negative. I hope that as a one off for the 13 and 14 series of processors as a gesture of good faith Intel place very few restrictions on RMA requests whether in warranty on not. Chips cost a lot of money. Intel need to respect their customer’s good faith in their products.

A one off hit on the balance sheet to make good on the problems would go a long way to restoring trust.
 
Last edited:

abufrejoval

Reputable
Jun 19, 2020
592
426
5,260
I have a Minisforum MS-01 that would randomly get BSOD and I put up with it because it's a quick enough reboot. But last week it happened again and this time wouldn't reboot. The MS-01 uses an i9-13900H (laptop CPU), so I'm wondering if this was the cause. In that case Minisforum has a big problem with these popular machines.
Just who would have the big problem and how it's going to be dealt with is the uglyness I was alluding to.

The vast majority of these chips would be in laptops and all of them are soldered implying non-trivial cost and logistics to replace.

Everyone would like to have the consumer hold the candle, except of course the consumers, who'd expect to be served a good replacement device with a minimum of disruption, effort or cost.

Can't see how those positions could be met, nor how OEMs in the middle of all this could help, let alone manage these volumes... to run a process that (hopefully) will never be necessary again.

As several others have predicted, also: if this spills in significant volumes into soldered variants within warranty, it will get really ugly.

So everyone who has a device like that right now: make sure you test extensively, make backups and cry wolf early, before warranties run out and Intel tries to duck out of this.
 

NinoPino

Respectable
May 26, 2022
489
305
2,060
That is completely normal. AMD did the same thing with the Ryzen 1000 series, they replaced them if they died or started throwing errors and left them alone if they didn't.
Not the same thing because Intel continue to sell faulty CPUs knowing what the problem is. In this case they should stop selling new CPUs until the fix was applied.
This is the most scary thing of the whole story.

Same thing happens to vehicles with potentially catastrophic issues, they fix them if they start showing issues and leave them alone if they don't. Why? You said it yourself.
Absolutely not true, at least with European cars or when sold in Europe. If a dangerous or potentially dangerous defect is confirmed, first of all, the production lines are immediately updated to fix the problem on next produced cars., than immediately start a recall campaign that inform every customer to bring his car to service specifying the exact cause of the recall.
 
As I said SPR doesn't have eTVB, doesn't mean that the other incorrect voltage selection bug isn't present if the issue is architectural.
SPR is GC not RC thus it won't be there unless you think somehow GC is affected too.
As far as I know, 65W isn't considered high power CPU. Intel said 65W and up.
65W doesn't mean 65W for the socketed parts which I assume you know, but just in case they're 154-219W. Maybe you don't consider that high power, but I certainly do. For embedded the 65W SKUs do not carry the higher TDP which is why they don't even boost as high at the T parts which are 35W (92-109W).
That's for eTVB bug.
That's most certainly incorrect as the Minecraft servers (largely single threaded) reported on had been set to disable TVB and yet were still dying they just lasted longer.
Every 65W+ chip that has been installed and powered on, as well as those that have the microcode bug which will be installed by end users or which are installed in premade systems which won't be updated since regular people don't update their BIOS like ever.
I'm not sure why you think this situation is a binary problem. If every single CPU was actually being degraded there would be a significantly higher rate of failure than what has been seen. Anybody who didn't undervolt/set a voltage cap would have CPUs that are failing.
 
Status
Not open for further replies.