News Warframe devs report 80% of game crashes happen on Intel's overclockable Core i9 chips — Core i7 K-series CPUs also have high crash rates

It also seems to make overclocking 13th Gen and 14th Gen Intel CPUs risky, especially if you aren't going "all the way" with liquid cooling.
Going "all the way" with liquid is what will make it even riskier since you will be able to push even more watts and amps and volts through the CPU at the same maximum of 100°

If you don't allow your CPU to reach 100° in the first place then going liquid or not will also make no difference to the crashing, it will possibly make a lot of difference in temps.
 
As noted by the Warframe team, the staff member in question wasn't overclocking and was even using a new PC, but even the most basic in-game tasks resulted in hard crashes.

Curiously, his computer at the office was fine: he was playing with the same loadout, the same customizations, with the same people, but he would only crash at home.

Dollars to donuts his computer at home was using an outdated BIOS in which the "stock" settings were set by the OEMs to be beyond Intel's specifications which resulted in system instability while his office computer was managed by IT and properly updated so that it runs within them.
 

bit_user

Titan
Ambassador
This graph is quite fascinating, since it's the first statistics I've seen about the distribution of the failures:

sEb8b4GQWgaaKxSKdfzWGn.png


I'd love it, if they provided normalized data, so we knew what proportion of each CPU type was experiencing the issue. If we had that, probably the share of 13th gen CPUs would drop - there should be more of those among their users, since they've been on the market much longer.

However, since the problem seems to develop over time, perhaps it would go the other way, specifically because the 13th Gen CPUs would tend to have more runtime on them. Either way, I'd expect the slices for the Gen 14 CPUs to be occupying more and more of that pie, as time goes on.
 

bit_user

Titan
Ambassador
Clearly it’s a non-fixable problem or Intel would have fixed it by now.
I can think of a reason they might not issue a fix for a fixable problem: too much performance cost.

It'll be really interesting to see if they issue a fix around the time that Arrow Lake launches. If they issue a performance-robbing fix now, a significant number of people will probably jump ship to AMD, especially with Zen 5 (Ryzen 9000) launching in about 2 weeks. If Intel can string those people along, maybe they'll eventually switch to Arrow Lake.

They’re trying to delay the inevitable lawsuit until their new line of CPU launch to draw attention away from the failed line of processors.
Yeah, I think a class action lawsuit also might result from a fix that costs too much performance.

I gather their Q2 quarterly report is due out, soon. It'll be fascinating to see if they include any sort of guidance about a liability from warranty claims.
 
  • Like
Reactions: Roland Of Gilead
This graph is quite fascinating, since it's the first statistics I've seen about the distribution of the failures:
sEb8b4GQWgaaKxSKdfzWGn.png
I am a bit suspicious of this pie chart because it includes the 14700K and the 14700KF twice. While it is probably some sort of typo, I am not sure what the CPU they would have included would be. All there is left to replace the duplicates are non-K sku CPUs and that would not be consistent with the rest of the chart because they only include K/S sku 1X700, and 1X900 CPUs.
 

PCWarrior

Distinguished
May 20, 2013
215
100
18,670
Hopefully this will bring the price of the 13th and 14th gen cpus down. Let most people think that these CPUs are faulty. In any case here is a well-known simple fix that pretty much always works.

1. Go to your BIOS.
(a) For ASUS go to the Extreme Tweaker section.
(c) For Gigabyte go to Advanced mode and then choose Tweaker
(b) For MSI go to the OC (Overclocking) section.

2. Sync all cores and set the all-core ratio limit of the p cores to whatever their Turbo boost 2.0 frequency is for your CPU. It is listed on the Intel ARC database as Performance-core (i.e. P Core) Max Turbo Frequency. Below is a list of these p-core turbo 2.0 frequencies:
13700K – 5.3GHz
13900K/KF/KS - 5.4GHz.
14700K – 5.5GHz
14900K – 5.6GHz
14900KS – 5.7GHz

(a) For ASUS go to Performance Core Ratio. The default is set to Auto. Change it to Sync All Cores. And then go to ALL-CORE Ratio Limit and type the multiplier corresponding to the frequencies mentioned above (so type 53 for the 13700K, 54 for the 13900K/KF/KS, 55 for the 14700K, 56 for the 14900K and 57 for the 14900KS)
(b) For Gigabyte go to Performance Core Ratio and change Auto by typing the multiplier in the field (type 53 for the 13700K, etc)
(c) For MSI go to Adjusted CPU frequency. Then go to Per P-core Ratio Limit. The default is set to Auto. Change it to Manually. For every P-core type the same multiplier (53 for all 8 cores of the 13700L, 54 for all 8 cores of the 13900K/KF/KS, etc see above)

No need to do anything about the e cores.

3. If you are going to run RAM that is above 5600MHz buy Karhu (the licence costs just 10Euros ~ 11USD). Run the test for 24 hours. If it passes you are good.

4. The official Intel support is 5600MHz and 4400MHz for 1 and 2 DIMMs per channel respectively. Pretty much virtually all the i7 and i9 K SKUS should pass the Karhuu test for anything up to 6400MHz for 1 DIMM per channel (i.e using 2 two sticks of RAM) and up to 5200MHz for 2 DIMMs per channel (i.e. using 4 sticks of RAM). For higher speeds it will depend on the IMC. In any case if it fails to pass the Karhu test you can either lower the RAM frequency from the advertised XMP and try again or use same frequency but with manually tuned timings and try again.

5. If your CPU has been running unstably or has been on the verge of instability for a while you might:
(i) need to re-install your OS as it may be corrupted. If you did the above steps most likely the hardware issue is fixed but the issue might persist due to corrupted OS. So re-install Windows.
(ii) have degraded over time silicon and in such case you will need to lower the above-mentioned speeds by 100 or 200MHz .

6. Enjoy a fully stable crash-free system for several years. Come here and thank me. Ignore any cynical posts/replies below. Just do this and 99.9% of the time it works.
 

Neilbob

Distinguished
Mar 31, 2014
242
299
19,620
Oh yeah, that 1% difference in performance for multithreaded workloads when going from unlocked 315W to 253W is really going to make the judge throw away the key...
I typically emerge from my cave to comment only periodically (and even then it's usually just to make snarky remarks), but I do read most threads, and I have to say, it is rather tiresome how you appear to consistently be operating under the assumption that this problem stems from a clock/power issue. You don't know, and neither do any of the rest of us.

But more than that, you seem intent on just brushing the whole thing aside: that doesn't help you or anyone else here, and it sure doesn't help Intel. It's a bit like those people who insist that because THEY haven't experienced problems, that must mean it doesn't exist.

If the problem really is as simple as dodgy voltage/clock/power settings, it'd probably be fixed already and horror stories like this wouldn't keep emerging.

For some reason, any time Intel experiences even a tiny bit of negativity, you come out on the attack. Most of us would rather this wasn't happening at all (rabid fanboys notwithstanding), because competitors are generally not charitable and will possibly do their best to take advantage in a way that is ultimately detrimental to customers. Acknowledging problems is the best way forward for everyone.

Now, back to my cave.
 
Hopefully this will bring the price of the 13th and 14th gen cpus down. Let most people think that these CPUs are faulty. In any case here is a well-known simple fix that pretty much always works.

1. Go to your BIOS.
(a) For ASUS go to the Extreme Tweaker section.
(c) For Gigabyte go to Advanced mode and then choose Tweaker
(b) For MSI go to the OC (Overclocking) section.

2. Sync all cores and set the all-core ratio limit of the p cores to whatever their Turbo boost 2.0 frequency is for your CPU. It is listed on the Intel ARC database as Performance-core (i.e. P Core) Max Turbo Frequency. Below is a list of these p-core turbo 2.0 frequencies:
13700K – 5.3GHz
13900K/KF/KS - 5.4GHz.
14700K – 5.5GHz
14900K – 5.6GHz
14900KS – 5.7GHz

(a) For ASUS go to Performance Core Ratio. The default is set to Auto. Change it to Sync All Cores. And then go to ALL-CORE Ratio Limit and type the multiplier corresponding to the frequencies mentioned above (so type 53 for the 13700K, 54 for the 13900K/KF/KS, 55 for the 14700K, 56 for the 14900K and 57 for the 14900KS)
(b) For Gigabyte go to Performance Core Ratio and change Auto by typing the multiplier in the field (type 53 for the 13700K, etc)
(c) For MSI go to Adjusted CPU frequency. Then go to Per P-core Ratio Limit. The default is set to Auto. Change it to Manually. For every P-core type the same multiplier (53 for all 8 cores of the 13700L, 54 for all 8 cores of the 13900K/KF/KS, etc see above)

No need to do anything about the e cores.

3. If you are going to run RAM that is above 5600MHz buy Karhu (the licence costs just 10Euros ~ 11USD). Run the test for 24 hours. If it passes you are good.

4. The official Intel support is 5600MHz and 4400MHz for 1 and 2 DIMMs per channel respectively. Pretty much virtually all the i7 and i9 K SKUS should pass the Karhuu test for anything up to 6400MHz for 1 DIMM per channel (i.e using 2 two sticks of RAM) and up to 5200MHz for 2 DIMMs per channel (i.e. using 4 sticks of RAM). For higher speeds it will depend on the IMC. In any case if it fails to pass the Karhu test you can either lower the RAM frequency from the advertised XMP and try again or use same frequency but with manually tuned timings and try again.

5. If your CPU has been running unstably or has been on the verge of instability for a while you might:
(i) need to re-install your OS as it may be corrupted. If you did the above steps most likely the hardware issue is fixed but the issue might persist due to corrupted OS. So re-install Windows.
(ii) have degraded over time silicon and in such case you will need to lower the above-mentioned speeds by 100 or 200MHz .

6. Enjoy a fully stable crash-free system for several years. Come here and thank me. Ignore any cynical posts/replies below. Just do this and 99.9% of the time it works.
Although this may fix or help the situation for some, I have some questions. Where did you get this information? Did you figure this out yourself? Did you directly solve the issue for you or others in this manner before?
 

PCWarrior

Distinguished
May 20, 2013
215
100
18,670
Although this may fix or help the situation for some, I have some questions. Where did you get this information? Did you figure this out yourself? Did you directly solve the issue for you or others in this manner before?
I figured this fix myself but youtubers like Frame Chasers have posted videos describing a similar fix as well as posts on intel forums like this. The reason this fix works so well is because the issue seems to be the heat transients that result from the boosting of the P-cores and the failure of the motherboard VRM to keep up and supply the correct voltage.

The default boosting algorithm of the i9s (and the i7s to a lesser extent) causes the P cores to constantly fluctuate rapidly in frequency in the range between the turbo 2.0 single core limit and the max turbo limit (e.g. for the 13900K that is between 5.4GHz and 5.8GHz). For these max speeds (which by the way are not supported by all the p-cores) which also require and higher voltages, the motherboard’s VRM occasionally fails to keep up with the demanded rapidly fluctuating voltage and occasionally instead of supplying the correct voltage it shoves a much higher core voltage resulting to a thermal transient that drives the boosted core above TJmax causing instant crashing. This frequent overvolting as well as the frequent heat transients also likely accelerate silicon degradation/aging through electromigration, thermomechanical stress and thermal cycling and other forms of deterioration and modes of failure. Anyway, for anyone buying a new cpu if you do what I said above you are good. If you experienced silicon degradation drop 100-200Mhz and you are good.

At this point in time, however, I would wait for Zen5 or Arrowlake. Personally I would wait till November as by then both Intel and AMD will have launched their entire enthusiast cpu portfolio.
 

bit_user

Titan
Ambassador
Hopefully this will bring the price of the 13th and 14th gen cpus down.
If not, the launch of Zen 5 certainly will.

here is a well-known simple fix that pretty much always works.
...
2. Sync all cores and set the all-core ratio limit of the p cores to whatever their Turbo boost 2.0 frequency is for your CPU. It is listed on the Intel ARC database as Performance-core (i.e. P Core) Max Turbo Frequency. Below is a list of these p-core turbo 2.0 frequencies:
13700K – 5.3GHz
13900K/KF/KS - 5.4GHz.
14700K – 5.5GHz
14900K – 5.6GHz
14900KS – 5.7GHz
Nope. This video mentioned article already said they tried 5.3 GHz on the i9's and they still would eventually encounter the errors.

It's actually mentioned in the video, itself (thanks, Terry) at 17:57:

"the most stable configuration for testing YC cruncher 24 hours at a time on the Linux side was definitely configuring a Max multiplier of 53"

View: https://youtu.be/QzHcrbT5D_Y?t=1075


He said it's the "most stable", but didn't fix the problems.

In that video, the data he managed to get from his contacts at the big PC OEMs said that (at 16:53):

"between 10 and 25% of CPUs have a problem or are marginal in some way"
...
"based on what I saw from game Telemetry and game server crash data ... I would have guessed I would have guessed that about half of these CPUs have some type of issue with some clearly a lot worse than others"
So, that suggests the 24/7 game servers are acting as an accelerated test bed for encountering these problems. Given enough time, the failure rate of consumer CPUs should eventually approach that of the servers.

3. If you are going to run RAM that is above 5600MHz buy Karhu (the licence costs just 10Euros ~ 11USD). Run the test for 24 hours. If it passes you are good.
That video also talks about RAM configurations, and it has to do with the number of DIMMs. It's right after the part I linked with the multipliers.

And no, just down-clocking your RAM isn't a long-term solution, either, though it helps.

6. Enjoy a fully stable crash-free system for several years.
How can you possibly know that??? Are you a time traveler?

If you do buy any bad CPUs on the cheap, good luck. I wouldn't waste my time or money.
 
Oh yeah... I remember several saying AMD's issue and Intel's issue was the same. Yeah... Mhm... Totally. Specially people saying it wasn't Intel's fault and just the motherboards makers and AMD's was exclusively AMD's.

Specially in the way they've approached the communication and solution expectations, of course.

So, when should we expect the first class action lawsuit and OEM with big enough family jewels to sue Intel for loss of revenue in the server side? Because yes, Xeons also have this issue.

Regards.
 

bit_user

Titan
Ambassador
So, when should we expect the first class action lawsuit and OEM with big enough family jewels to sue Intel for loss of revenue in the server side?
I plan to follow coverage of the forward-looking statements in their Q2 quarterly report, very carefully.

Because yes, Xeons also have this issue.
Very interesting. Source?
 

KyaraM

Admirable
If not, the launch of Zen 5 certainly will.


Nope. This video mentioned article already said they tried 5.3 GHz on the i9's and they still would eventually encounter the errors.

It's actually mentioned in the video, itself (thanks, Terry) at 17:57:
"the most stable configuration for testing YC cruncher 24 hours at a time on the Linux side was definitely configuring a Max multiplier of 53"​

He said it's the "most stable", but didn't fix the problems.

In that video, the data he managed to get from his contacts at the big PC OEMs said that (at 16:53):
"between 10 and 25% of CPUs have a problem or are marginal in some way"​
...​
"based on what I saw from game Telemetry and game server crash data ... I would have guessed I would have guessed that about half of these CPUs have some type of issue with some clearly a lot worse than others"​
So, that suggests the 24/7 game servers are acting as an accelerated test bed for encountering these problems. Given enough time, the failure rate of consumer CPUs should eventually approach that of the servers.


That video also talks about RAM configurations, and it has to do with the number of DIMMs. It's right after the part I linked with the multipliers.

And no, just down-clocking your RAM isn't a long-term solution, either, though it helps.


How can you possibly know that??? Are you a time traveler?

If you do buy any bad CPUs on the cheap, good luck. I wouldn't waste my time or money.
Just stumbled over this video talking about the problem, specifically the server use of the affected CPUs:

View: https://www.youtube.com/watch?v=oAE4NWoyMZk


They say there multiple times that they might have gotten a lead and it sounds like they will investigate it soon. Also talking speeds and other stuff. Quite interesting.
 
80% of crashes which specifically involve the nvidia driver not 80% of game crashes which is a huge difference.

The first question that leapt to mind was: do AMD graphics driver crashes happen with these systems as well?
 

PCWarrior

Distinguished
May 20, 2013
215
100
18,670
Nope. This video mentioned article already said they tried 5.3 GHz on the i9's and they still would eventually encounter the errors.
It's actually mentioned in the video, itself (thanks, Terry) at 17:57:
"the most stable configuration for testing YC cruncher 24 hours at a time on the Linux side was definitely configuring a Max multiplier of 53"​
He said it's the "most stable", but didn't fix the problems.

So, that suggests the 24/7 game servers are acting as an accelerated test bed for encountering these problems. Given enough time, the failure rate of consumer CPUs should eventually approach that of the servers.

That video also talks about RAM configurations, and it has to do with the number of DIMMs. It's right after the part I linked with the multipliers.

And no, just down-clocking your RAM isn't a long-term solution, either, though it helps.

How can you possibly know that??? Are you a time traveller?
The counterpoints you provide are for systems that have experienced silicon degradation and a heavy one at that as they were operating heavily non-stop for 6-12 months. These systems will need to settle for lower speeds (in both cpu core and RAM frequency) as the damage has already been done. But the fix still works stopping crashes and most importantly stopping further silicon deterioration. Just to be clear all silicon is deteriorating over time. 13th and 14th gen, even with this fix, will deteriorate eventually too. But it will be similar to the deterioration rate of previous generations. With the current stock boosting behaviour, they are experiencing an unusually faster deterioration which is further accelerated in server 24/7 operation.

I don’t need to be a time traveller when I know the root cause of an issue and know how the aforementioned fix stops what is causing the issue and prevents it from developing in the first place. The rest of my confidence stems from the known good reliability of previous Intel generations that are still going strong without any issues for at least a decade. 13th and 14th gen should be no different to previous gens when this fix is applied. People who applied this fix from the get go have not experienced any of the reported issues. And people who have had an issue at the beginning stopped having it after they applied the fix and still enjoy issue-free systems.
 

bit_user

Titan
Ambassador
The counterpoints you provide are for systems that have experienced silicon degradation and a heavy one at that as they were operating heavily non-stop for 6-12 months. These systems will need to settle for lower speeds (in both cpu core and RAM frequency) as the damage has already been done. But the fix still works stopping crashes and most importantly stopping further silicon deterioration.
First, I didn't catch how long these systems have been running. Did they actually say that, or are you just assuming?

Second, they said that the only help they got from Intel was more trays of CPUs to try. So, I assume some of the faulty systems are with CPUs that already had their settings dialed back, to some extent, since they were put into service. Even so, the customer still has yet to find a way to prevent these failures, other than by switching to AMD.

Just to be clear all silicon is deteriorating over time. 13th and 14th gen, even with this fix, will deteriorate eventually too.
Yes, but not necessarily by the same mechanism as what we're seeing. Intel famously has a 10 year operating target for their process nodes, which makes these development all the more shocking.

But it will be similar to the deterioration rate of previous generations.
How do you know?

I don’t need to be a time traveller when I know the root cause of an issue and know how the aforementioned fix stops what is causing the issue and prevents it from developing in the first place.
Forgive me if I don't quite believe you. I think you listed some steps that are effective in mitigating the issue (which aligns with what L1Techs reported in that video), but we can't yet know just how effective they'll be, in the long run. The fact that the Raptor Lake servers are still having issues, with some of the CPUs being replacements a lot newer than 6-12 months, suggests that isn't not as solid a mitigation as you believe.

People who applied this fix from the get go have not experienced any of the reported issues.
Yeah, just how long have those systems been in service and what's their duty cycle?
 

Zarathustra3612

Prominent
Mar 23, 2023
2
3
515
I keep hearing about these issues with Intel cpu's, but I'm not having any issues on my side. I built a PC for me and one for my nephew about a year ago with 13900k's and neither of our systems have crashed.

I had one crash on mine but that was my doing because stupid me forgot to plug my AIO pump back in after working on it.
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
I keep hearing about these issues with Intel cpu's, but I'm not having any issues on my side. I built a PC for me and one for my nephew about a year ago with 13900k's and neither of our systems have crashed.
There's apparently some lottery aspect to it, but it also matters how you've configured it and how much & how hard you push it.

In the L1Techs video, he claimed his sources at big PC OEMs are seeing failure rates on the order of 10% to 25%. So, that's still pretty decent odds of not encountering it, but more than enough for those who are inclined to worry.
 
I keep hearing about these issues with Intel cpu's, but I'm not having any issues on my side. I built a PC for me and one for my nephew about a year ago with 13900k's and neither of our systems have crashed.

I had one crash on mine but that was my doing because stupid me forgot to plug my AIO pump back in after working on it.
In a video Wendall said that the failures started typically after 3-4 months of heavy usage. New CPUs were fine initially. Therefore depending on usage it might take years to develop for you.