3rd Annual Report of Effect of Nearly Nonstop Selftests on my Crucial MX500 SSD
EXECUTIVE SUMMARY
The selftests regime has been running for three years. It continues to be effective at reducing the SSD's buggy Write Amplification, and there are no signs of any negative consequences (other than slightly increased power consumption due to the SSD not entering the "idle" low power mode).
During the three years of the selftests regime, WAF has been 3.25.
During these three years, Remaining Life has decreased by 4.87% (from 92.2% to 87.33%). Extrapolating from these three years, Remaining Lifetime (RLT) is now 54 years. Extrapolating instead from the 5 weeks prior to the start of the selftests regime (1/15/2020 to 2/22/2020), RLT was 5 years on 2/22/2020, and would presumably now be about 2 years. (Death soon after the 5 year warranty expires. Coincidence?)
I still suspect there's a correlation between excessive Write Amplification and the amount of time the SSD has been powered on. Every few weeks I paste the recent daily log data into my spreadsheet, and if I observe that the daily Write Amplification Factor has exceeded 4 for a few days, I power cycle the SSD by sleeping the pc for a few seconds. I also shut down the pc for a few seconds when a Windows update requires a restart or shutdown. But I haven't verified or debunked the correlation by analyzing the log data. I would need to study the statistical functions available in the spreadsheet or learn a statistics software package, and I haven't had a strong incentive to spend time on it due to the satisfactory long RLT.
LOG OF REMAINING LIFE
Remaining Life % | Date | Total Host Writes (GB) | Host Writes (GB) Per Row |
100 | 07/28/19 | 0 | |
99 | 08/31/19 | 1772 | 1772 |
98 | not logged | not logged | not logged |
97 | not logged | not logged | not logged |
96 | not logged | not logged | not logged |
95 | 12/23/19 | 5782 | |
94 | 01/15/20 | 6172 | 390 |
93 | 02/04/20 | 6310 | 138 |
92 | 03/13/20 | 6647 | 337* |
91 | 10/19/20 | 8178 | 1531 |
90 | 09/16/21 | 9395 | 1217 |
89 | 05/20/22 | 10532 | 1137 |
88 | 11/12/22 | 12082 | 1550 |
Comparing the "Host Writes Per Row" logged after 3/13/2020 (during which Remaining Life decreased from 92% to 88%) to the 138 GB logged on 2/04/2020 (when Remaining Life decreased from 94% to 93%) shows that
the selftests regime has allowed about ten times as much host data to be written per percent of SSD life used. [1531GB + 1217GB + 1137GB + 1550GB] / [92% - 88%] = 1359 GB per Percent of Life Used.
* The row logged on 3/13/2020 corresponds to the period of time in which Remaining Life decreased from 93% to 92%, which included the last three weeks prior to the start of the selftests regime plus the first three weeks of the selftests regime. Due to that mix, the 337 GB written during this period isn't useful data, so it was neglected in the previous paragraph's calculation of "ten times as much host data."
HISTORICAL BACKGROUND
Early in 2020, my MX500 SSD's Write Amplification was very excessive after about 5 months of use and was growing worse. During the 5 weeks prior to the start of the selftests regime, the Write Amplification Factor (WAF) was 45.63. Analysis of SMART attribute F8 (logged every second for a few hours) revealed excess writing by the SSD controller many times per day in brief bursts, each burst approximately 37,000 NAND pages (about 1 GByte) or a small integer times that amount. Each burst lasted about 5 seconds, or a small integer times about 5 seconds. The write bursts correlated perfectly with a well-known MX500 bug: Current Pending Sector Count occasionally briefly becomes 1 (which triggers warning alerts for users who run software that monitors SMART data).
I guessed that an SSD selftest, which reads SSD blocks to check for issues, might have higher runtime priority than the SSD firmware routine responsible for the excessive writes. Testing showed that guess is correct, and benchmarking showed selftests don't reduce read or write performance. So I wrote a .BAT script to automatically run SSD selftests nearly nonstop, to drastically reduce the runtime available to the SSD's buggy firmware routine. The selftests regime was begun in late February 2020 and has been running ever since (whenever the pc is powered on, which is 24 hours per day except for occasional maintenance or power outages). I also wrote a .BAT script that logs SMART data in comma-delimited format daily and every 2 hours, and have copy/pasted all the daily data into a spreadsheet to analyze it.
In the early days I experimented to try to find the optimal duty cycle of the selftests, to optimize the health of the SSD. Nonstop selftests reduced the SSD's Write Amplification the most, to less than 2. Running selftests 19.5 of every 20 minutes reduced it more than 19 of every 20 minutes did. I settled on 19.5 of every 20 minutes. The 30 seconds pauses between selftests are available for the SSD firmware to run any low priority maintenance tasks essential for SSD health. Examination of logged data showed that most pauses contain no write bursts, so it seems a good bet that pausing for 30 seconds every 20 minutes has been providing sufficient runtime for essential low priority tasks, if there are any.