It would be great if the RAM itself could exclude the faulty chip, lowering the total capacity but keeping its working status.
Mainframes let you do things like that, although it turns out
Chipkill is actually an ECC scheme rather than directly disabling a specific IC.
Such schemes usually come at the expense of complexity and additional cost (i.e. for the extra IC, motherboard traces, etc.). There should also be some hit on energy-efficiency, as well.
Why exchange everything when only a small part is not working properly?
Well, at least using removable DIMMs gives you the option to replace
only a stick, rather than the CPU (if memory is in-package) or the entire board (if soldered down).
However, we could return to the point about ECC: conventional ECC can correct single-bit errors while detecting double-bit errors. At work, we had a server which had a low rate of single-bit errors for
years. We tried replacing the DIMM, but it seems the problem was either in the motherboard or the CPU. So, in the end, it kept running like that for quite a while and remained stable the entire time.
BTW, it was a compute server - not a fileserver or database server - mostly running regression tests. So, the potential memory errors didn't put any data at risk.