Just read a post I think you made on reddit, here is my reply to it, tell me what you think?
https://www.reddit.com/r/unRAID/comments/gwighw/_/fthxdfv
View: https://www.reddit.com/r/unRAID/comments/gwighw/solution_to_i_can_not_recommend_crucial_ssds_for/fthxdfv/
I note also these drives are very cheap for what their market reputation puts them at, they had rave reviews everywhere, yet they seem to be constantly on sale. The reason what attracted me to buying two in the first place. It almost seemed too good to be true, and now we know it is.
I have offered my mx500 to a reviewer to see if it can get media coverage, he hasnt made a decision yet, I pointed him to the reddit thread and this thread.
I see you have this as a theory as well, I missed your post earlier sorry.
https://forums.tomshardware.com/threads/crucial-mx500-500gb-sata-ssd-remaining-life-decreasing-fast-despite-few-bytes-being-written.3571220/post-22477904
I wrote that post elsewhere (Unraid forum), and Reddit copied it.
Regarding the resetting of the Current Pending Sectors attribute from 1 to 0, you raise a good question. I haven't been running SMART-monitoring software that alerts me when the CPS attribute becomes 1, so I'm not 100% certain that CPS stays 0 during the 19.5 minutes selftests (and only changes to 1 during 30 seconds pauses between selftests). If it changes to 1 during a selftest, and stays 1 until a write burst during a pause resets it to 0, then I think my logs would show CPS=1 at the beginning of most pauses, and to the contrary CPS=0 at the beginning of nearly all of the pauses. So, here's a 2-part question:
If CPS changes briefly to 1 during a selftest, what changes it back to 0? If CPS rarely or never changes to 1 during a selftest, why doesn't it?
If CPS changes to 1 during a selftest, perhaps it gets quickly reset to 0 by a process with higher priority than a selftest, after spawning the lower priority process that writes the huge bursts.
The article about the firmware update to Samsung 840 ssds that caused them to write a lot to refresh cheap NAND didn't say the Samsung writing is as excessive as the Crucial MX500 writing. To totally rewrite a 500GB drive once every few months would require writing about 5GB per day. The following excerpt from my logs in February 2020 (before I began the selftests regime) shows the FTL controller was writing upwards of 10 million NAND pages per day (the "ΔF8" column), which is roughly 300 GB per day (assuming 1 GB is about 37,000 NAND pages). So: Crucial writing extremely excessive. Furthermore, only about 100 GB of the ssd was in use; most of the ssd was free space that should not need to be refreshed. If Crucial's goal was to mimic Samsung's firmware "fix" then Crucial's implementation of the algorithm is terrible.
ΔF7
1 day | ΔF8
1 day | Daily WAF
= 1 +
ΔF8/ΔF7 |
231,144 | 12,894,568 | 56.79 |
260,934 | 10,066,176 | 39.58 |
278,028 | 16,578,426 | 60.63 |
281,524 | 2,807,244 | 10.97 |
230,270 | 8,203,271 | 36.62 |
269,722 | 14,042,509 | 53.06 |
594,613 | 5,228,740 | 9.79 |
352,795 | 7,810,689 | 23.14 |
144,904 | 12,980,755 | 90.58 |
399,835 | 21,234,970 | 54.11 |
229,493 | 1,941,470 | 9.46 |
237,292 | 8,271,372 | 35.86 |
221,996 | 12,748,544 | 58.43 |
262,998 | 18,637,064 | 71.86 |
287,574 | 6,699,994 | 24.30 |
201,811 | 9,833,974 | 49.73 |
275,300 | 6,697,966 | 25.33 |
Regarding your "extension" of my theory... I don't think ssds require a background process to detect hard-to-read cells ("soft errors"). I think they're detected whenever ANY read process (host reads, selftests, etc) tries to read them. I don't know whether hard-to-read cells are triggering the bug. Maybe. My theory was that slow-to-read cells trigger the bug... that Crucial pushes cheap NAND to faster speeds than it can reliably handle in order to perform at speeds comparable to the competition.
I don't think selftesting refreshes cell data; I think only writing can refresh it. If the purpose of Crucial's write bursts is to refresh cells -- similar to Samsung's 840 "fix" -- and the selftests are preventing most of the write bursts, and if this prevention is risking data loss, perhaps I could find evidence of data loss by examining the output of a selftest. My selftests software has been throwing away the output of each selftest. It doesn't analyze the selftest output to see if the selftest encountered any errors.
Perhaps Crucial's selftest algorithm is smart enough to arrange for hard-to-read "soft error" cells to be refreshed, by appending their addresses to a list of cells that another process will deal with sooner or later. This would be a good thing, because instead of interfering with the discovery & handling of flaky cells, a selftest would aid their discovery.
Here's a wild theory: Perhaps the FTL controller runs a low priority background task that checks for flaky cells by trying to read cells at a speed that's faster than normal. Or something equivalent, such as measuring how long it takes to read cells and considering a cell flaky if the time exceeds some threshold. This theory could explain why other read processes -- host reads, selftests -- don't set CPS to 1, and CPS goes to 1 (for several seconds) only during pauses between selftests. To prove that the only processes that cause CPS to be set to 1 have lower priority than a selftest, I think someone would need to simultaneously run both of the following for many hours: (1) a selftests regime, and (2) software that monitors CPS at a high speed (once per second?) polling rate for many hours, and discard the CPS data that occurs during selftest pauses. Simplest would be to run nonstop selftests (no pauses) for that test, so that no data needs to be discarded (except perhaps during the brief moments between when a selftest is aborted and the next selftest is nearly immediately started). Not sure that I'll be able to find time to run that test.