News Linus Torvalds Blasts Intel For Strangling the ECC Memory Market

CerianK

Distinguished
Nov 7, 2008
260
50
18,870
About half of the computers I have owned and/or built in the last 30 year have used ECC, which has taught me a few things:
  1. Initial stress testing of memory on a PC should be done with ECC disabled.
  2. When disabled, error rates of even 1 bit/day would be excessive, as multi-bit errors become more probable, which ECC cannot necessarily correct.
Also, Linus seems to be unaware that memory manufacturers could build ECC into the memory chips themselves (e.g. internal error correction), which could be transparent to the CPU architecture. However, this brings up another important point: regardless of how ECC is achieved, the end user must be advised when excessive correctable errors have occurred, and be able to track the progression (e.g. OS-wide SMART), so they can make timely decisions (e.g. reseat and/or replace DIMMs, etc.) to correct the issue.

Since accurately predicting the future use-case of a new non-mission-critical computer cannot be done, I agree with Linus that the issue should be addressed, but I don't think one can rationally blame Intel, at least as long as bean-counters have a say in organizations, as implementation and proper support of ECC has a real cost.
 
Last edited:

chaz_music

Distinguished
Dec 12, 2009
84
51
18,640
Linus Torvalds blasts Intel for strangling the ECC memory market, praises AMD for making it an option on Ryzen platforms.

Linus Torvalds Blasts Intel For Strangling the ECC Memory Market : Read more

I agree with Linus 100%. We have hardware and software error correction on nearly everything in the PC including SSDs, PCIe bus, and even RAID. But Intel continues to force the consumer to pay their server chip tax to get ECC in the memory controller on the mainstream CPUs. This isn't even a cost issue: you can get ECC in $4 CPUs now. I have not purchased a consumer level Intel system in nearly a decade because of this.

At least with the new DDR-5, ECC is built in, and Intel can't continue to be a profiteering bad apple. Good marketing practice says that you are supposed to listen to the Voice-of-the-customer. Force-feeding your market means that when the monopoly is over, the consumer is going to punish you severely. As in Ryzen and Epyc.

Most technical PC owners do not realize that RAM errors are not just a hardware phenomenon, but also due to EMI (RF) and solar storms / gamma rays. Google published a report in 2011 showing their study on dramatically increased RAM ECC hits during high solar storm activity. They used data from their own server farms. Anyone who has designed for aerospace systems knows this. It is not IF you will have an ECC hit, but WHEN. And this is true whether you are in space or right a sea level. There was another report saying that they found lack of ECC to be another cause of BSOD screens - but people atribute that to an MS issue and not the hardware due to lack of knowledge of hardware limitations.

My first PC with ECC was an Intel system in 1992. It is now 29 years later - so Intel should stop milking that cow.
 
  • Like
Reactions: bigdragon

neojack

Reputable
Apr 4, 2019
605
173
5,140
Just one point to adress is performance. I mean, can ECC ram perform as well as non-ECC ram ? (i.e 3200/3600 c14 for DDR4 for exemple ?), if so, would we lost performance ?

if there is a performance loss, that would slow down adoption for enthousiasts and rest of the market.

@chaz_music thanks for the infromation about ECC being built-in in DDR5. is it managed by the memory sticks, or by the controler ? (CPU)
 

CerianK

Distinguished
Nov 7, 2008
260
50
18,870
... can ECC ram perform as well as non-ECC ram ?
@chaz_music thanks for the infromation about ECC being built-in in DDR5. is it managed by the memory sticks, or by the controler ? (CPU)
I am not sure about DDR5, but there is typically a 2% memory performance impact with ECC in general. This can be negated by larger CPU caches.

DDR5 ECC is built into the memory chips themselves, which raises a few good points:
  1. Only single-bit errors can be corrected, and I assume that multi-bit errors can still be detected and reported to the OS. This may be insufficient for some use-cases.
  2. Most bit errors, but not all, originate in the memory cells. Stable voltage supply (VRMs are built into the DDR5 spec now), shielding and connection integrity also play a part in minimizing bit errors throughout the data path. Still, non-memory related bit errors will not be detected.
 

jchang6

Reputable
Apr 22, 2016
4
0
4,510
I have been buying Intel Xeon E3 for some time, and the Xeon non-MP before that. Between the moderate price hike over the near equivalent Core, and higher priced chipset/motherboard, and memory - perhaps - $200-300. Its very difficult to get a good desktop use configuration from major vendors, either their workstation or entry server product lines, so this meant building my own from a supermicro motherboard. But in the last few years, its been impossible to get the current generation Xeon as a boxed CPU. Intel does not do a good job of segmenting the the group above desktop but below extreme high-end server
 

ezst036

Honorable
Oct 5, 2018
516
394
11,920
Maybe Intel (the largest contributor to the Linux codebase) should stop fixing his 2nd banana operating system and let him figure it all out.

Intel has clients which are billion dollar corporations, and as you know, linux is the most widely used operating system in servers in general and dominates, oh you know, all 500 of the top 500 computers on the planet in particular. Perhaps this isnt the hobbyist OS you think it is? In any case. Even Microsoft has said that Linux dominates Azure and their rising contributions confirm this. Why else would they?

So no, Intel can't afford to push linux contributions aside, unless it decided it wanted to go out of business.
 

Sleepy_Hollowed

Distinguished
Jan 1, 2017
501
195
19,070
I don't know where all this "All DDR 5 has ECC", but that's not the case, and Linus is correct.

The amount of data corruption that can happen on modern operating systems from RAM going bad all of a sudden is insane, I had one go bad and I was just thankful of snapshots being available from earlier on the day.

It's a much bigger deal on laptops with integrated RAM, as those are bit harder to troubleshoot because you can't remove sticks to troubleshoot. Intel deserves the worst, honestly.
 

chaz_music

Distinguished
Dec 12, 2009
84
51
18,640
I don't know where all this "All DDR 5 has ECC", but that's not the case, and Linus is correct.


Your comment made me dig deeper, and I was surprised to find that you are correct - with some clarity needed.

Due to the lower voltages and very low CMOS threshold voltages being used in DDR5, they are expecting significant numbers of poor cell reads, much like happens with SSD cells. The solution they are using was to add on chip ECC just like with SSDs. The implementation can vary from vendor to vendor, so they can change to what level of ECC that they want to use depending upon IC process yield and intended error reliability. Again, this is much like SSDs. SSDs for servers are more expensive, and also more reliable. Hence the need for ReFS and ZFS file systems (software level error correction within the file system).

So this is chip level ECC. Only.

The actual DDR5 spec also allows for ECC on the memory bus, just like is presently use for previous DDR4 on back through the original DDR. This allows for catching bad reads throughout the motherboard bus all the way to the CPU. This ECC level is optional, if I read the DDR5 spec correctly. This is the same ECC scheme used as before on the system DRAM bus.

I have to say I am bummed at this. They could have used a system wide solution to improve overall robustness, and they missed the opportunity. Hopefully, they did spend some time on the bus voltage control and noise, as well as impedance controls to improve the signal integrity.

For more reading, here is an article on Anandtech with good comments at the end:
https://www.anandtech.com/show/1591...sed-setting-the-stage-for-ddr56400-and-beyond

And yes - shame on Intel.
 
Jan 5, 2021
1
1
10
Your comment made me dig deeper, and I was surprised to find that you are correct - with some clarity needed.

Due to the lower voltages and very low CMOS threshold voltages being used in DDR5, they are expecting significant numbers of poor cell reads, much like happens with SSD cells. The solution they are using was to add on chip ECC just like with SSDs. The implementation can vary from vendor to vendor, so they can change to what level of ECC that they want to use depending upon IC process yield and intended error reliability. Again, this is much like SSDs. SSDs for servers are more expensive, and also more reliable. Hence the need for ReFS and ZFS file systems (software level error correction within the file system).

So this is chip level ECC. Only.

The actual DDR5 spec also allows for ECC on the memory bus, just like is presently use for previous DDR4 on back through the original DDR. This allows for catching bad reads throughout the motherboard bus all the way to the CPU. This ECC level is optional, if I read the DDR5 spec correctly. This is the same ECC scheme used as before on the system DRAM bus.

I have to say I am bummed at this. They could have used a system wide solution to improve overall robustness, and they missed the opportunity. Hopefully, they did spend some time on the bus voltage control and noise, as well as impedance controls to improve the signal integrity.

For more reading, here is an article on Anandtech with good comments at the end:
https://www.anandtech.com/show/1591...sed-setting-the-stage-for-ddr56400-and-beyond

And yes - shame on Intel.

A quick clarification on DDR5 ECC: DDR5 chip implements single error correction only (mostly for refresh related cell errors), there is no detection for >1 error within one internal array read (128 bits internally). There is no provision for a DRAM supplier to change the level of ecc correction depending on IC process/yield in the spec (they can not implement SECDED instead of SEC for DDR5).

DDR5 ECC DIMMs need more chips on the DIMM to implement ECC, the cost overhead for ECC DIMMs is 10 chips on the DIMM vs 9 chips for DDR4 (8 chips for a non ECC DIMM) so ECC is not free and most PC users (not power users) would be unwilling to pay 20% more for a PC DIMM that has ECC.
 
  • Like
Reactions: chaz_music