News Revamped MemTest86 Can Highlight Bad ICs on Your DIMMs

Admin · Aug 11, 2022

Highlighting precise DRAM chip error locations could open up repair and salvage options. Only works on Intel Alder Lake DDR5 and Z690 platforms for now.

Revamped MemTest86 Can Highlight Bad ICs on Your DIMMs : Read more

tennis2 · Aug 11, 2022

Back from the dead eh?
When was the last version release? Somewhere around 2011 - 2013?

I gave up on Memtest cuz it couldn't reliably/efficiently test high-GB sticks (aka >2GB) that are common in modern PCs.

InvalidError · Aug 11, 2022

Figuring out which chips are bad isn't rocket science: DRAM chips have a strict bit grouping due to per-byte strobes. Look at the bad bits pattern, divide each error bit position number by eight rounded down, that is the chip number counting from zero.

tennis2 said:
Back from the dead eh?
When was the last version release? Somewhere around 2011 - 2013?

I gave up on Memtest cuz it couldn't reliably/efficiently test high-GB sticks (aka >2GB) that are common in modern PCs.

The open-source mem86+ came back to active development life a year or two ago.

The commercial PassMark MemTest86 is a whole different thing.

Darkbreeze · Aug 11, 2022

tennis2 said:
Back from the dead eh?
When was the last version release? Somewhere around 2011 - 2013?

I gave up on Memtest cuz it couldn't reliably/efficiently test high-GB sticks (aka >2GB) that are common in modern PCs.

Yeah, I think you're confusing Memtest86+ with Passmark's Memtest86. Totally different products and Memtest86 has no such problems with testing any size or memory configuration that I've seen.

JWNoctis · Aug 11, 2022

I wonder what kind of assumptions would have to be made with the layout of double-sided modules, or modules with multiple rows of chips aka SO-DIMM. Are those really the same across manufacturers?

But yes, translating failed address range to specific module is already much better than nothing.

Darkbreeze · Aug 11, 2022

Does it really matter which side of a module the failure is on? It's not like anybody, including the manufacturer, is going to bother (Or be capable) of fixing it so whether one side or the other is bad the result is the same, you return it for replacement under the lifetime warranty which most memory manufacturers offer. And if we're really being honest, for most users it probably doesn't matter which module is bad either unless you are using DIMMs from multiple kits, which you often CAN do, but really isn't recommended as a "best practice". And the reason it really doesn't matter which DIMM in cases where all DIMMs came in one kit, is that you'd want to return the WHOLE KIT for replacement, not just a single DIMM, since these are tested for compatibility at the factory and you really don't want them sending you a single replacement DIMM to add back into a kit that it wasn't tested and validated as compatible with.

At lower speeds, within the JEDEC defaults, probably not as big of a deal as compatibility is pretty good there across the board when mixing DIMMs, but at higher speeds, not so much, so I'd recommend sending the entire kit for replacement and insisting on it, rather than settling for only having a single DIMM replaced.

bit_user · Aug 12, 2022

FYI, if you have a DIMM with only a handful of errors that are consistently at the same addresses, you can have Linux exclude them from use. I'm not sure if Windows has a similar capability, but I wouldn't be surprised if it did.

It's not a bad option for a DIMM that's out of warranty, but once some errors start cropping up, they're likely to be followed by others. Therefore, it should be seen as a stop-gap solution, while procuring replacement hardware.

BTW, I wish Intel supported in-band ECC on all their CPUs. So far, not even all of their Elkhart Lake models support it. It comes at a performance cost, but the upside is that it could work with any DIMM or motherboard (since no extra traces or chips are needed).

bit_user · Aug 12, 2022

Also, I'm a fan of MemTest86 and bought a personal copy to help fund its development.

It's saved me on a couple occasions. I always do an overnight run (or at least 2 full passes) whenever I build a machine or change its RAM.

salgado18 · Aug 12, 2022

It would be great if the RAM itself could exclude the faulty chip, lowering the total capacity but keeping its working status. Why exchange everything when only a small part is not working properly? And, if you don't want the lower capacity, there's always an RMA or purchase another stick.

InvalidError · Aug 12, 2022

salgado18 said:
It would be great if the RAM itself could exclude the faulty chip

The entire memory system and cache structure is designed around DIMMs transfering 128 bytes per burst. If you lose 1/8th of that by writing off a chip altogether, there would be a significant performance penalty from most memory accesses being oddly aligned.

If you want to keep using bad memory, Linux lets you flag memory pages with bad bits. That way, the only thing you lose is a 4kB page.

bit_user · Aug 12, 2022

salgado18 said:
It would be great if the RAM itself could exclude the faulty chip, lowering the total capacity but keeping its working status.

Mainframes let you do things like that, although it turns out Chipkill is actually an ECC scheme rather than directly disabling a specific IC.

Such schemes usually come at the expense of complexity and additional cost (i.e. for the extra IC, motherboard traces, etc.). There should also be some hit on energy-efficiency, as well.

salgado18 said:
Why exchange everything when only a small part is not working properly?

Well, at least using removable DIMMs gives you the option to replace only a stick, rather than the CPU (if memory is in-package) or the entire board (if soldered down).

However, we could return to the point about ECC: conventional ECC can correct single-bit errors while detecting double-bit errors. At work, we had a server which had a low rate of single-bit errors for years. We tried replacing the DIMM, but it seems the problem was either in the motherboard or the CPU. So, in the end, it kept running like that for quite a while and remained stable the entire time.

BTW, it was a compute server - not a fileserver or database server - mostly running regression tests. So, the potential memory errors didn't put any data at risk.

Search

News Revamped MemTest86 Can Highlight Bad ICs on Your DIMMs

Admin

Administrator

tennis2

Glorious

InvalidError

Titan

Darkbreeze

Retired Mod

JWNoctis

Reputable

Darkbreeze

Retired Mod

bit_user

Titan

bit_user

Titan

salgado18

Distinguished

InvalidError

Titan

bit_user

Titan

TRENDING THREADS

Latest posts

Moderators online

Share this page