I've been using "nearly nonstop" extended selftests to try to mitigate a bug in the firmware of the Crucial MX500 ssd that causes excessive write amplification: Before I began running the selftests, the Write Amplification Factor (WAF) averaged about 40 during a period of many weeks. With extended selftests running 19.5 minutes of every 20 minutes, WAF has been averaging about 3 since I began the trial on March 1st. (Smaller WAF is better, all else being equal, because the lifespan of an ssd is inversely proportional to WAF.) My theory is that the selftests keep the ssd busy reading itself so that the ssd gives much less runtime to its buggy lower-priority background process (perhaps Static Wear Leveling and/or Garbage Collection).
The reason why I don't run the selftests completely nonstop is that I don't know whether that would be okay for the ssd's health. In other words, nonstop selftests might not allow the ssd background processes enough runtime. That's why I'm providing 30 seconds of "idle" time out of every 20 minutes, even though nonstop selftests caused WAF to average about 1.5.
To be specific, I'm using Windows CMD.exe to run an infinite loop .bat file that performs the following steps each loop: (1) "smartctl.exe -t long -t force /dev/sda" to start an extended selftest, (2) "timeout 1170" to pause the .bat file for 19.5 minutes, (3) "smartctl -X" to abort the selftest, and (4) "timeout 30" to pause for 30 seconds. (Note: An extended selftest allowed to run to completion, rather than aborted, would take about 25 minutes on my 500 GB Crucial MX500 ssd.)
I'm starting this thread because I'm concerned that the periodic aborting of the selftests might cause the physically-highest-numbered sectors of the ssd to NEVER be selftested or read. In other words, each loop might selftest within the same subset of physical sectors, and the periodic abort commands might prevent the ssd from ever selftesting the physical sectors not in that subset. I don't know whether it's safe for the long term health of the ssd, and the data written on it, if some sectors are never selftested.
So my questions are:
Q1. Does the periodic aborting of the extended selftests indeed cause some sectors to NEVER be selftested?
Q2. If the answer to Q1 is Yes, is that unsafe for the ssd health, or have any other negative consequences / side effects?
Q3. If the answers to Q1 and Q2 are Yes, is there some way to ensure all sectors will (eventually or periodically) be selftested, while also ensuring that a selftest is running for 19.5 minutes of every 20, and without requiring the pc to frequently read SMART data to check whether a selftest is in progress?
The three goals in Q3 seem to conflict. If I use the "smartctl -t select,N-max" command to selftest a subset of sectors that includes all of the high sectors, I assume the selftest will terminate after it reaches the highest sector but before 19.5 minutes. (Or the pc will abort it at 19.5 minutes before it reaches the highest sector, if the starting point N is chosen too small.) If the selftest terminates after it reaches the highest sector, a new selftest would need to be immediately restarted and allowed to run for the remainder of the 19.5 minutes. Unless there's a way to command the ssd to automatically start another selftest after a selftest concludes, the pc would need to intervene to immediately start the next selftest, and I assume this would require the pc to frequently (every second or so) read SMART data to check whether a selftest is in progress.
The smartctl documentation doesn't say what will happen using the "smartctl -t select,N+size" command if N+size exceeds the highest LBA of the ssd. If the selftest would immediately wrap around to LBA 0 and continue selftesting, I think I could construct a solution. But my hunch is that the selftest would terminate (before 19.5 minutes) when it reaches the max LBA. Does the SMART spec specify whether a "select,N+size" selftest must wrap around or terminate, after it reaches the highest LBA?
I suspect the reason why other people haven't noticed the MX500 WAF bug is that my pc writes to the ssd much less than a typical pc does. It writes only about 100 kBytes/second, on average. For some reason, WAF isn't as large when the pc writes a lot more to the ssd. A small rate of writing may be a use case that the ssd firmware engineers didn't test. Or perhaps the engineers intentionally made write amplification unnecessarily high for small rates of writing, in order to make the ssd Remaining Life reach 0% soon after its warranty expires... planned obsolescence.
I should also mention that the high WAF problem accelerated, before I began the selftests. The ssd Remaining Life reached 99% on Aug 31, a few weeks after the new ssd was installed in the new pc, after the pc had written 1,772 GB to the ssd. Remaining Life reached 95% on Dec 23, after the pc had written a total of 5,782 GB to the ssd. The pc wrote an additional 390 GB to reach 94% on Jan 15. The pc wrote an additional 138 GB to reach 93% on Feb 4.
Two other interesting observations about the WAF problem:
1. The bug appears to manifest itself by occasional high speed huge bursts of writing multiples of about 37,000 NAND pages... multiples of about 1 GByte. (Most of the bursts are approximately 37,000 pages. Some are about 74,000 pages. I've seen a few as high as 5 x 37,000 pages.) I don't know whether ALL of those bursts are due to the bug, or whether some bursts are the result of proper design and the problem is many more bursts than necessary. When it's writing bursts, it takes nearly 5 seconds to write each 37,000 pages. When the ssd isn't writing bursts, its background processes write an average of about 15 pages per 5 seconds. Occasionally there are "mini-bursts" of about 2400 pages; perhaps those are the result of a different background process. The huge bursts that are multiples of 37,000 pages occur only during the idle time between selftests. The mini-bursts can occur during selftests.
2. Other people have reported another Crucial MX500 bug that I've determined is correlated with the huge NAND write bursts: People have reported the "Current Pending Sectors Count" SMART attribute sometimes mysteriously becomes 1, and goes back to zero after a few seconds. (The change to 1 triggers health alerts by the monitoring software that many people use, and the consensus among MX500 users and Smartmontools is that it's a firmware bug, even though Crucial denies it's a bug.) Smartmontools began calling the attribute "bogus" on MX500 ssds. By logging SMART data every 2 seconds, I determined the correlation is perfect: the Pending Count becomes 1 at the start of each huge NAND write burst and goes back to 0 at the end of each burst. (I'm unsure whether the mini-bursts correlate with the Pending Count bug too, but I think not. By accident, during the hour after the switch to Daylight Savings Time on March 8, my logging app logged SMART data as fast as it could, without pausing. During that hour there were several mini-bursts, and the logged Pending Counts stayed at zero.) I don't know what the causal relation is, underneath the correlation... whether the bogus change to 1 causes the huge write burst as a response, or whether the huge write burst causes the bogus change.
The reason why I don't run the selftests completely nonstop is that I don't know whether that would be okay for the ssd's health. In other words, nonstop selftests might not allow the ssd background processes enough runtime. That's why I'm providing 30 seconds of "idle" time out of every 20 minutes, even though nonstop selftests caused WAF to average about 1.5.
To be specific, I'm using Windows CMD.exe to run an infinite loop .bat file that performs the following steps each loop: (1) "smartctl.exe -t long -t force /dev/sda" to start an extended selftest, (2) "timeout 1170" to pause the .bat file for 19.5 minutes, (3) "smartctl -X" to abort the selftest, and (4) "timeout 30" to pause for 30 seconds. (Note: An extended selftest allowed to run to completion, rather than aborted, would take about 25 minutes on my 500 GB Crucial MX500 ssd.)
I'm starting this thread because I'm concerned that the periodic aborting of the selftests might cause the physically-highest-numbered sectors of the ssd to NEVER be selftested or read. In other words, each loop might selftest within the same subset of physical sectors, and the periodic abort commands might prevent the ssd from ever selftesting the physical sectors not in that subset. I don't know whether it's safe for the long term health of the ssd, and the data written on it, if some sectors are never selftested.
So my questions are:
Q1. Does the periodic aborting of the extended selftests indeed cause some sectors to NEVER be selftested?
Q2. If the answer to Q1 is Yes, is that unsafe for the ssd health, or have any other negative consequences / side effects?
Q3. If the answers to Q1 and Q2 are Yes, is there some way to ensure all sectors will (eventually or periodically) be selftested, while also ensuring that a selftest is running for 19.5 minutes of every 20, and without requiring the pc to frequently read SMART data to check whether a selftest is in progress?
The three goals in Q3 seem to conflict. If I use the "smartctl -t select,N-max" command to selftest a subset of sectors that includes all of the high sectors, I assume the selftest will terminate after it reaches the highest sector but before 19.5 minutes. (Or the pc will abort it at 19.5 minutes before it reaches the highest sector, if the starting point N is chosen too small.) If the selftest terminates after it reaches the highest sector, a new selftest would need to be immediately restarted and allowed to run for the remainder of the 19.5 minutes. Unless there's a way to command the ssd to automatically start another selftest after a selftest concludes, the pc would need to intervene to immediately start the next selftest, and I assume this would require the pc to frequently (every second or so) read SMART data to check whether a selftest is in progress.
The smartctl documentation doesn't say what will happen using the "smartctl -t select,N+size" command if N+size exceeds the highest LBA of the ssd. If the selftest would immediately wrap around to LBA 0 and continue selftesting, I think I could construct a solution. But my hunch is that the selftest would terminate (before 19.5 minutes) when it reaches the max LBA. Does the SMART spec specify whether a "select,N+size" selftest must wrap around or terminate, after it reaches the highest LBA?
I suspect the reason why other people haven't noticed the MX500 WAF bug is that my pc writes to the ssd much less than a typical pc does. It writes only about 100 kBytes/second, on average. For some reason, WAF isn't as large when the pc writes a lot more to the ssd. A small rate of writing may be a use case that the ssd firmware engineers didn't test. Or perhaps the engineers intentionally made write amplification unnecessarily high for small rates of writing, in order to make the ssd Remaining Life reach 0% soon after its warranty expires... planned obsolescence.
I should also mention that the high WAF problem accelerated, before I began the selftests. The ssd Remaining Life reached 99% on Aug 31, a few weeks after the new ssd was installed in the new pc, after the pc had written 1,772 GB to the ssd. Remaining Life reached 95% on Dec 23, after the pc had written a total of 5,782 GB to the ssd. The pc wrote an additional 390 GB to reach 94% on Jan 15. The pc wrote an additional 138 GB to reach 93% on Feb 4.
Two other interesting observations about the WAF problem:
1. The bug appears to manifest itself by occasional high speed huge bursts of writing multiples of about 37,000 NAND pages... multiples of about 1 GByte. (Most of the bursts are approximately 37,000 pages. Some are about 74,000 pages. I've seen a few as high as 5 x 37,000 pages.) I don't know whether ALL of those bursts are due to the bug, or whether some bursts are the result of proper design and the problem is many more bursts than necessary. When it's writing bursts, it takes nearly 5 seconds to write each 37,000 pages. When the ssd isn't writing bursts, its background processes write an average of about 15 pages per 5 seconds. Occasionally there are "mini-bursts" of about 2400 pages; perhaps those are the result of a different background process. The huge bursts that are multiples of 37,000 pages occur only during the idle time between selftests. The mini-bursts can occur during selftests.
2. Other people have reported another Crucial MX500 bug that I've determined is correlated with the huge NAND write bursts: People have reported the "Current Pending Sectors Count" SMART attribute sometimes mysteriously becomes 1, and goes back to zero after a few seconds. (The change to 1 triggers health alerts by the monitoring software that many people use, and the consensus among MX500 users and Smartmontools is that it's a firmware bug, even though Crucial denies it's a bug.) Smartmontools began calling the attribute "bogus" on MX500 ssds. By logging SMART data every 2 seconds, I determined the correlation is perfect: the Pending Count becomes 1 at the start of each huge NAND write burst and goes back to 0 at the end of each burst. (I'm unsure whether the mini-bursts correlate with the Pending Count bug too, but I think not. By accident, during the hour after the switch to Daylight Savings Time on March 8, my logging app logged SMART data as fast as it could, without pausing. During that hour there were several mini-bursts, and the logged Pending Counts stayed at zero.) I don't know what the causal relation is, underneath the correlation... whether the bogus change to 1 causes the huge write burst as a response, or whether the huge write burst causes the bogus change.