I would not consider faulty ram catching errors with ECC as properly functional in any way.
How else, then? In modern computing hardware, I've seen only two methods: ECC and full mirroring. Mirroring is obviously terrible for both performance and capacity, whereas OOB ECC has virtually no impact on either and just requires you buy more expensive DIMMs. As I mentioned, in-band ECC has only a modest impact on either, but (at least as Intel has implemented it) offers less protection than regular ECC DIMMs.
Here's an old study of DRAM reliability from Google. If anyone knows of anything more recent, please share.
"In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.
The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age?
We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account."
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf
ECC can also be faulty, is it less probable? of course, but if we go down this route, why ignore that?
ECC circuitry is incredibly simple. If the memory controller's ECC logic failed, you'd probably know it by a raft of false ECC errors being reported, but I mean it's like 0.001% of the logic in a CPU and pretty much impossible that only
it fails, while the rest of the CPU continues to work flawlessly.
The most probable failure of ECC - and the only one seriously worth considering - is that there's a software or configuration issue in ECC error reporting. That's why it's important to use validated solutions. When I bought an AM4 board for my home fileserver, I bought a server-grade board from ASRock Rack - not a more mainstream-style board where ECC support is a minor bullet point not important for most of its users.
And yes, ECC has limitations, in that it can only reliably detect up to 2-bit errors. However, I would again say that it's not worth seriously considering a scenario where your RAM is working flawlessly and the first/only error(s) you ever see are 3+ bit-flips per read, that it fails to detect.
In the maybe half dozen failed ECC DIMMs I've seen in the wild, only one of them ever even had any 2-bit errors. However, most of my experience with bad RAM is of the non-ECC variety.
What I am saying is that in almost all arenas of digital data, we implement overhead in abstraction layers that are not visible to the user, in any capacity - to make sure small errors get fixed.
When it comes to DRAM, that's called ECC memory. As I mentioned, DDR5 now has on-chip ECC, which has jealously been hidden from us, so we cannot know how many problems it's papering over, of what sort, and just how close a given DDR5 DIMM or memory chip is to generating uncorrectable errors.
Harddrives especially deploy massive overhead to keep your data safe. I dont remember the exact numbers, but I remember from school jawdropping a bit when I realised how much space is wasted in a harddrive , for pure data integrity reasons.
They also have tracking information, which imposes its own overhead. Be careful the figures you're looking at don't include that.
Anyway, yeah, I remember back when HDDs started embedding DSPs and doing on-the-fly error checking & correction. I think that was way back in the 1990's.
Frankly , implementing some simple error catching code is trivial, and you get a lot of integrity for the first 5/10% overhead.
Yes, that's ECC. Before DDR5, ECC DIMMs added 12.5% overhead, adding 8 bits per 64. With DDR5, they now add 8 bits per every 32, just because DDR5 bifurcated 64-bit DIMMs into two logically independent 32-bit channels.
With the "in-band" ECC solution I mentioned, Anandtech found the overhead of Intel's implementation is just 8 bits per 256 (i.e. 3.1% overhead), showing just how much weaker it is. I wish Intel would let us use in-band ECC on all of their CPUs & platforms that don't support full OOB ECC.