News Game publisher claims 100% crash rate with Intel CPUs – Alderon Games says company sells defective 13th and 14th gen chips

Page 4 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
The article doesn't mention supermicro, the L1 video doesn't mention supermicro and only says that in the datacenter most systems are deployed with a motherboard based around the w680 chipset.
They do show supermicro in the video but they don't say anything about datacenters only using supermicro, they say they use w680 in general.
Just because supermicro was the first result he found when googling w680 doesn't mean anything.
Actually the one and only example he gave from a system that crashed hard was this.
Oh, look, weirdly it says asus right there!
Ud4Y3oT.jpg

That's low even for you!
If you had listened to the interview with Wendell he specifically says SuperMicro. The data center has a mix of Asus and SuperMicro motherboards and they are having the same failure rate which means CPU issue.
 

King_V

Illustrious
Ambassador
They would need to proof that this happens with nothing more than the specs that are stated in the datasheet for the given CPU.
It will never happen.

A class action against mobo makers though that didn't state that they where using out of spec settings, that would be a lot more doable.
Especially after they changed into stating it after the issues started,
The assertion that the motherboard makers are using specs that Intel objects to is absurd.

Intel allows, and encourages, I daresay, whatever defaults are being used by the motherboard makers because that gives Intel the performance comparison numbers that Intel wants.

Intel's CPUs aren't hot running power hogs because the motherboard makers are flaunting the rules of an Intel that is helpless to stop them, and the 13th and 14th gen degradation is also not motherboard makers flouting the rules set down by a helpless Intel.

Between the motherboard makers and Intel, do you really believe it's NOT Intel that has the upper hand?
 

SyCoREAPER

Honorable
Jan 11, 2018
876
327
13,220
I've been fortunate with my 13700K. Overlooking included. Though I know it's mostly 13900, 14700 and 14900 with the most issues.

That said, I'm not defending Intel or craping on the game studio because games are started years ago but did they not test their game with the most popular CPUs through development?
 

bit_user

Titan
Ambassador
The article doesn't mention supermicro,
Second paragraph:

"The Linux enthusiast obtained crash data from thousands of servers running Intel's Raptor Lake Core i9 K-series processors. He discovered that roughly 50% of the Raptor Lake servers he obtained telemetry information from have stability issues, despite each of them running server-grade LGA1700 socket motherboards from Asus and Supermicro."
Don't tell me you don't know how to use the text search feature of your web browser.

They do show supermicro in the video but they don't say anything about datacenters only using supermicro, they say they use w680 in general.
Again, this is nonsense. You're just spinning lies and hoping nobody will go to the trouble of fact-checking you.

At 12:46:

"I've seen similar crash screens from the super micro w680 boards which is the other game provider they use super micro one of them uses Asus and the Crash rate is pretty similar between these two
...
W680 was created to go along with motherboards designed for maximum stability neither Asus nor super micro motherboards really support giving tons of extra power to the CPU or doing insane overclocking for things on a desktop so I really don't think both Asus and super micro have colored really far outside the lines on this motherboard and I really don't think Super Micro or Asus have just lazily copy paste the voltage settings from their desktop motherboards to the server class motherboard boards.

Fully 50% of the systems deployed for both companies with either one of these processors to within one percentage Point are experiencing the same stability issues even disabling ecores has not fully resolved the issue. For one of these companies, the error rate also seems to be going up over time on the server side as well."

That's low even for you!
Just regurgitating my line is toothless, when you're the one spinning lies, here.
 
Last edited:

bit_user

Titan
Ambassador
The whole instability issue seems to have a lot more evidence against degradation than for it,
LOL! And just what is this evidence against degradation?

It seems there is a common issue with these recently reported crashes. Mismatching sub z690 vrm w680 boards with consumer hardware and i9k chips, pretending it is enterprise grade server hardware and having instability with not the most compatible hardware.
The W680 boards do not support Xeons. The W680 chipset launched in 2022 and it wasn't until Q4 of the following year that Intel even decided to launch the Xeon E-2400 series. Had Meteor Lake landed on the desktop, as planned, it looks like Xeon E would've entirely skipped the Alder/Raptor generations of CPUs. For a long time, pairing W680 boards with upper-end consumer CPU models was the solution Intel offered for entry-level servers and workstations. There was nothing else.

Furthermore, companies like Supermicro and ASRock Rack only make server & workstation boards. The whole point of the W680 chipset is to enable such products! These are not consumer products being misused.

Like that company that tried using AMD gaming graphics cards for professional AI workloads.
No, there's no comparison. Those were literally just gaming GPUs. AMD never claimed you could use them to build an AI server, unlike how Intel pitched the W680 chipset.

I thought Wendell knew something about servers.
Trying to spin this as some sort of crazy, mismatched configuration is either ignorant or dishonest. I knew you were an Intel fan, but trying to carry water for them, like this, is just pathetic.
 
Last edited:

bit_user

Titan
Ambassador
That said, I'm not defending Intel or craping on the game studio because games are started years ago but did they not test their game with the most popular CPUs through development?
The issue isn't lack of testing. The issue is that they setup servers using i9's and those started failing after 3-4 months. They're saying they've also had issues on their other systems used in development.

This is a hardware problem, nothing to do with their software. Trying to claim the problem is due to their lack of testing either misses that point or tries to shift the blame. That might work, if it wasn't just them, and if the failures didn't happen in a wide variety of software, with people claiming that even compiling large software projects are encountering crashes.
 

tamalero

Distinguished
Oct 25, 2006
1,154
163
19,470
Their solution to that will be bribing a bunch of influencers and hiring (more?) social media trolls to plug their products and trash AMD.

Note: If we had anyone like that in these forums, you would not be allowed to name them.
Reminds me of the other tech site sister of this one.. they had die hard the boys style anti AMD guy that bordered on lunacy XD
 

SyCoREAPER

Honorable
Jan 11, 2018
876
327
13,220
The issue isn't lack of testing. The issue is that they setup servers using i9's and those started failing after 3-4 months. They're saying they've also had issues on their other systems used in development.

This is a hardware problem, nothing to do with their software. Trying to claim the problem is due to their lack of testing either misses that point or tries to shift the blame. That might work, if it wasn't just them, and if the failures didn't happen in a wide variety of software, with people claiming that even compiling large software projects are encountering crashes.
I know it's not their software, I'm just surprised they didn't encounter it sooner. There's no doubt it's hardware and Intel isn't owning up.

This has been a dumpster fire year (no pun intended) between PSUs blowing up, GPU PCIe PCBs cracking, 12VHPWR-gate, AMD CPUs brlurning holes into themselves and MBs, Intel and their unstable chips... What's next? This has all been within the last 8-12 months. Just imagine what comes next...
 

bit_user

Titan
Ambassador
I know it's not their software, I'm just surprised they didn't encounter it sooner.
Because it's progressive! You deploy a system and it's 100% stable. Then, over time, the errors start to occur and get more and more frequent. That's what's so insidious about it. It's like a cancer.

There's no doubt it's hardware and Intel isn't owning up.
Actually, they did acknowledge the problem and even stated they still have yet to find the root cause.
 

tamalero

Distinguished
Oct 25, 2006
1,154
163
19,470
Because it's progressive! You deploy a system and it's 100% stable. Then, over time, the errors start to occur and get more and more frequent. That's what's so insidious about it. It's like a cancer.


Actually, they did acknowledge the problem and even stated they still have yet to find the root cause.
Ironically, I'm pretty sure my 13900k is affected. Because I could always throttle and sometimes crash when running cinebench even on watercooling. I set the power level in a way it never reaches throttle temps while tweaking the e-cores speeds and so far no crash.

But if degradation is the issue. I'm worried because I'm currently like Chappelle once said.
"N..ga.. I'm broke!"
 

slightnitpick

Upstanding
Nov 2, 2023
173
114
260
No it is not. 100x times fewer takes you into negative numbers. If you take it at face value, if Intel crashes 100 times, 100x fewer takes you into -9900 crashes for amd.

That coming from a developer, you know - people writing code - just yikes.
For real numbers, multiplication and division don't change the sign of the number unless an odd amount of the factors are negative.

Intel = AMD (Intel and AMD are equivalent
Intel = 100 x AMD == 1/100 Intel = AMD (AMD is 100 times fewer, or Intel is 100 times more)

But only IF you translate it into a multiplicand out of your own accord, they should have used one from the beginning.

Whatever the amount of failure rate is, one time fewer would already be 0.
Only multiplication by zero can return a value of zero.
"Time(s) " is the amount in question, 100 times fewer would multiply the original amount by 100. It is generally used as a common language expression for multiplication because you multiply by full values, by "the whole thing" , three times a whole apple and so on.
Three times the failure rate of, whatever it is, would be 300% and three times the failure rate less would be -300%

The original quote did not refer to percentages. A percentage is more similar to a log transformation in its use of positive and negative signs, not a straight multiplication. They are dealt with very differently than unitless numbers in multiplication. The use of negative percentages to indicate a proportionate decrease of a positive number is a convention used only with percentages and logarithms and the like, not with integer or real numbers.
Common sense says it is incorrect. Again, "times" refers to a 100% multiplication of the original value. Just flip the "fewer" with "more" and figure it out. 100 times more means you take the original value, and you increase it by 100 times. So in reverse, 100 times fewer means you take the original value and you decrease it 100 times. This is really elementary school level math.
Yes, this is exactly what I interpreted the original statement to mean. What's the problem?


All of us can agree to disagree, but if that is the case then it means that a couple of native English speakers who also passed math courses say that they see nothing syntactically wrong with the usage in the original statement. So we've got an argument about what mathematical expressions in English should be, not what they definitively are.
 

slightnitpick

Upstanding
Nov 2, 2023
173
114
260
Like, if you don't have the receipt or didn't buy from an authorized reseller, or bought an OEM/tray part. Those technically don't have warranty coverage, but returns might've been rare enough that they'd take them anyway and only now started to get picky?
They should still have an implied warranty for merchantability and fitness of purpose, though the period of time for an implied warranty may vary from state to state (a quick search shows the maximum term for an implied warranty is 1 year in California, for instance). They may have to return it to the reseller though (and then the reseller to Intel, or whoever they purchased it from).
"Supports Intel® Extreme Memory Profile (XMP)"
https://www.intel.com/content/www/us/en/gaming/extreme-memory-profile-xmp.html
"Intel® Extreme Memory Profile (Intel® XMP) lets you overclock "
The memory controller is integrated in the CPU....
If Intel makes a processor allowing memory overclock then by the Uniform Commercial Code they are warrantying it for this purpose.
 
Sep 2, 2023
24
22
15
We have already discussed ways to stabilize using power regulation to the default Intel settings. But what we haven't discussed is whether disabling HT is the final solution?
 
  • Like
Reactions: helper800

bit_user

Titan
Ambassador
They should still have an implied warranty for merchantability and fitness of purpose, though the period of time for an implied warranty may vary from state to state (a quick search shows the maximum term for an implied warranty is 1 year in California, for instance). They may have to return it to the reseller though (and then the reseller to Intel, or whoever they purchased it from).
Intel clearly states that OEM/Tray processors are warrantied through the reseller. They have a different product code, making it relatively easy to tell if a CPU was purchased in retail packaging or via OEM. I think Intel cannot dictate the terms of the reseller's warranty, although laws of the applicable state or jurisdiction might.
 

bit_user

Titan
Ambassador
We have already discussed ways to stabilize using power regulation to the default Intel settings. But what we haven't discussed is whether disabling HT is the final solution?
A significant number of gamers run with HT disabled. If the solution were that simple, I think it would've been noticed, by now.

I wouldn't expect HyperThreading to be a direct cause or solution, as the problem seems much more fundamental than that. However, anything you do to lower core utilization seems like it might at least help.
 
Common sense says it is incorrect. Again, "times" refers to a 100% multiplication of the original value. Just flip the "fewer" with "more" and figure it out. 100 times more means you take the original value, and you increase it by 100 times. So in reverse, 100 times fewer means you take the original value and you decrease it 100 times. This is really elementary school level math.
But only IF you translate it into a multiplicand out of your own accord, they should have used one from the beginning.

Whatever the amount of failure rate is, one time fewer would already be 0.

"Time(s) " is the amount in question, 100 times fewer would multiply the original amount by 100. It is generally used as a common language expression for multiplication because you multiply by full values, by "the whole thing" , three times a whole apple and so on.
Three times the failure rate of, whatever it is, would be 300% and three times the failure rate less would be -300%
No. This is factually incorrect. 'Common sense,' the penultimate gotcha as an appeal to authority that means if you say anything opposed you are incorrect in some fundamental way, not that the fundamental way you are being accused of can ever be explained or quantified.

Addition refers to an increase in value, correct? So as a way to deflate your entire reality, what happens when you add a negative number? Its the same exact equivalence by manner of multiplying by less than 1 to get a number less than the original, which according to Terry is not multiplication, because multiplication uses only "full values," which is not even the correct way to refer to whole numbers, or maybe that is an incorrect interpretation, correct me if I am wrong.

Every single case of division is multiplication in disguise and vice versa, and this leads to being able to say things that, although uncommon, are grammatically correct and mathematically correct, whether you like the word problem or not.
 
Last edited:
Sep 2, 2023
24
22
15
The article doesn't mention supermicro, the L1 video doesn't mention supermicro and only says that in the datacenter most systems are deployed with a motherboard based around the w680 chipset.
They do show supermicro in the video but they don't say anything about datacenters only using supermicro, they say they use w680 in general.
Just because supermicro was the first result he found when googling w680 doesn't mean anything.
Actually the one and only example he gave from a system that crashed hard was this.
Oh, look, weirdly it says asus right there!
Ud4Y3oT.jpg

That's low even for you!
If you watch the whole video you will find out that "they didn't use XMP". I'll even tell you why - servers use ECC modules that do not support XMP (W680 is the only Intel chipset for LGA1700 with full ECC support). Moreover, even with non-ECC memory, Intel does not officially support 2 DPC 2R XMP and it is common knowledge that this has never worked well.
 

bit_user

Titan
Ambassador
It is said that the memory controller and RAM speeds have something to do with the issue, but this is second hand information from another toms article. I do not know the source.
This video talks about RAM speeds and configurations (i.e. number of DIMMs). It seems like their server use case is a fantastic stress test for this issue.

At 17:43:

"The two populations of systems were a little different. The one provider uses dual DIMM configurations and that seemed to suffer a lot. The single DIMM configurations seem to work a little better. Concerning 2x 48 gig DIMMs versus 4x 32 gig DIMMs, opt for 2x 48 gig DIMMs, every time. The most stable configuration for testing YC cruncher 24 hours at a time, on the Linux side, was definitely configuring a max multiplier of 53 and configuring the DDR5 speed to 4200 for the 4x DIMM configuration. 5200 was fine for single DIMM."

View: https://youtu.be/QzHcrbT5D_Y?t=1075

Even these memory configuration changes (plus everything else they tried) didn't completely avoid the errors.
 

rluker5

Distinguished
Jun 23, 2014
706
431
19,260
LOL! And just what is this evidence against degradation?


The W680 boards do not support Xeons. The W680 chipset launched in 2022 and it wasn't until Q4 of the following year that Intel even decided to launch the Xeon E-2400 series. Had Meteor Lake landed on the desktop, as planned, it looks like Xeon E would've entirely skipped that generation of CPUs. For a long time, pairing W680 boards with upper-end consumer CPU models was the solution Intel offered for entry-level servers and workstations. There was nothing else.

Furthermore, companies like Supermicro and ASRock Rack only make server & workstation boards. The whole point of the W680 chipset is to enable such products! These are not consumer products being misused.


No, there's no comparison. Those were literally just gaming GPUs. AMD never claimed you could use them to build an AI server, unlike how Intel pitched the W680 chipset.


Trying to spin this as some sort of crazy, mismatched configuration is either ignorant or dishonest. I knew you were an Intel fan, but trying to carry water for them, like this, is just pathetic.
Supermicro w680s support up to 150w TDPs: https://www.supermicro.com/en/products/motherboard/x13sae
The ones you linked have lower supported TDPs.
What is the TDP of the 13900k, 14900k chips that were having issues? Are these motherboards designed for them? You don't know that the power systems are set up to stably manage these unintended CPUs. If they are doing worse on average than the motherboards designed for higher power CPUs then it probably isn't the CPUs causing the problems.
If you saw those VRMs in a 300w dGPU you were repasting because of inexplicably bad performance you would be like WTF? That cheap trash is the source of my troubles. You can see it in a glance, I can see it in a glance, Wendell can see it in a glance and probably knows how much wattage those motherboards are good for off the top of his head. It is a mismatched configuration, probably with gaming GPUs and cheap power supplies to boot.

Also the evidence against degradation aka electromigration caused degradation is: that since none of us have access to look at chips on an atomic level to see it, it is conjecture defined by both cause and effect. The cause being relatively high volts, amps, temps, and the effect being increasing instability as a result of being exposed to these over time.
On it's own, exposure to relatively high volts, amps, temps without increasing instability is just rough use.
Also on it's own, increasing perceived instability could be from many things. Corrupted OS, bad software, viruses, being a fed up complainer, capacitors that are all worn out on underpowered motherboards, migrating thermal paste causing hotspots, etc.

You need both for degradation, and we are hearing that the instability is coming whether you have excessive volts, amps, temps or not. So we do not have degradation by electromigration because the evidence of a cause for electromigration is irrelevant in the symptom of instability.

So, by definition, we do not have chip degradation. Even if it makes some people feel good to claim that there is, they are really just slandering others to feel better. Also If some chips are experiencing increased instability at 100-200mv less than others that are not experiencing instability (talking about those chips running 5.8GHz+ all core, delidded) then the instability is apparently caused by something else than electromigration because the fast and slow chips are very similar.

An analogy is if you thought you were allergic to peanut butter because you got hives occasionally and you overheard people saying that peanut butter causes hives in some people. If you completely cut peanut exposure out of your life and still got hives sometimes then it wouldn't be a peanut allergy would it. The peanut allergy in this case being defined by having bad effects (hives, instability) from a cause (peanuts, high volts, amps, temps). Just like you wouldn't be suffering from electromigration caused degradation if you had instability but your CPU weren't exposed to excessively high volts, amps or temps.
 

bit_user

Titan
Ambassador
Supermicro w680s support up to 150w TDPs: https://www.supermicro.com/en/products/motherboard/x13sae
...
What is the TDP of the 13900k, 14900k chips that were having issues?
125W.

Are these motherboards designed for them?
Yes.

Wendell can see it in a glance and probably knows how much wattage those motherboards are good for off the top of his head. It is a mismatched configuration, probably with gaming GPUs and cheap power supplies to boot.
That's funny, because he made no such claim.

Also the evidence against degradation aka electromigration caused degradation is: that since none of us have access to look at chips on an atomic level to see it,
So, the lack of evidence equivalent to evidence against?? Gosh, I sure hope you never end up in a jury!

Why do you even leap to "electromigration" as the sole mechanism of degradation? You're certainly no lithography process engineer, so how the heck do you know what other possible mechanisms of degradation might exist?

The only evidence we need of degradation is the fact that the failure doesn't occur on new systems, then starts occurring, and eventually occurs with increasing frequency. That's an evidence-based finding of degradation. It's the very definition of degradation. You're just trying to obfuscate the issue by launching into a tangent about unknowables.

Also on it's own, increasing perceived instability could be from many things. Corrupted OS, bad software, viruses, being a fed up complainer, capacitors that are all worn out on underpowered motherboards, migrating thermal paste causing hotspots, etc.
No, those are red herrings and gas lighting.

So, by definition, we do not have chip degradation. Even if it makes some people feel good to claim that there is, they are really just slandering others to feel better.
Unless you're calling L1Techs and their sources all liars (see vid from above post), there is indeed degradation!