Question Crucial MX500 500GB SATA SSD - - - Remaining Life decreasing fast despite only a few bytes being written to it ?

Page 12 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
The Remaining Life (RL) of my Crucial MX500 ssd has been decreasing rapidly, even though the pc doesn't write much to it. Below is the log I began keeping after I noticed RL reached 95% after about 6 months of use.

Assuming RL truly depends on bytes written, the decrease in RL is accelerating and something is very wrong. The latest decrease in RL, from 94% to 93%, occurred after writing only 138 GB in 20 days.

(Note 1: After RL reached 95%, I took some steps to reduce "unnecessary" writes to the ssd by moving some frequently written files to a hard drive, for example the Firefox profile folder. That's why only 528 GB have been written to the ssd since Dec 23rd, even though the pc is set to Never Sleep and is always powered on. Note 2: After the pc and ssd were about 2 months old, around September, I changed the pc's power profile so it would Never Sleep. Note 3: The ssd still has a lot of free space; only 111 GB of its 500 GB capacity is occupied. Note 4: Three different software utilities agree on the numbers: Crucial's Storage Executive, HWiNFO64, and CrystalDiskInfo. Note 5: Storage Executive also shows that Total Bytes Written isn't much greater than Total Host Writes, implying write amplification hasn't been a significant factor.)

My understanding is that Remaining Life is supposed to depend on bytes written, but it looks more like the drive reports a value that depends mainly on its powered-on hours. Can someone explain what's happening? Am I misinterpreting the meaning of Remaining Life? Isn't it essentially a synonym for endurance?


Crucial MX500 500GB SSD in desktop pc since summer 2019​
Date​
Remaining Life​
Total Host Writes (GB)​
Host Writes (GB) Since Previous Drop​
12/23/2019​
95%​
5,782​
01/15/2020​
94%​
6,172​
390​
02/04/2020​
93%​
6,310​
138​
 
  • Like
Reactions: demonized

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
[snip]
From what i have seen though, the Health value of these affected drives actually rollovers to 100% and starts counting down again, not ideal since you'd expect 0% health nand to go into read only mode.

That's interesting about Remaining Life rollover from 0% back to 100%. Where have you seen this?

I'm unsure whether it's desirable or undesirable for the ssd to continue to be writeable after rollover, because being able to continue to write to the drive may be a benefit.

Do you have any idea whether the ssd's write amplification algorithm changes after rollover? (Or earlier, when Remaining Life is low.) At that point, continuing to run a wear-leveling routine makes no sense to me and would seem entirely self-destructive. Unless my thinking about this has grown fuzzy since I last thought about it two years ago, I think the only desirable write amplification when Remaining Life is low is the minimal amplification that's necessary: copying (and then erasing) an entire block when the host pc partially rewrites the block's contents.
 

Diceman_2037

Distinguished
Dec 19, 2011
53
3
18,535
That's interesting about Remaining Life rollover from 0% back to 100%. Where have you seen this?

I'm unsure whether it's desirable or undesirable for the ssd to continue to be writeable after rollover, because being able to continue to write to the drive may be a benefit.

Do you have any idea whether the ssd's write amplification algorithm changes after rollover? (Or earlier, when Remaining Life is low.) At that point, continuing to run a wear-leveling routine makes no sense to me and would seem entirely self-destructive. Unless my thinking about this has grown fuzzy since I last thought about it two years ago, I think the only desirable write amplification when Remaining Life is low is the minimal amplification that's necessary: copying (and then erasing) an entire block when the host pc partially rewrites the block's contents.

Here
TechPowerUp user posted this

crystaldiskinfo_20220223074642-png.237649


smart data has exceeded 100% and the value has rolled over from 200, which it shouldn't,
it's actually at 165% life time used
 
The Average Block Erase Count is 0x9AB (2475) and the Percentage Life Used is 0xA5 (165%). This means that the rated number of P/E cycles is ...

(100% / 165%) x 2475 = 1500 P/E cycles​

If a 512GiB NAND array is reprogrammed 1500 times, that's a total of ...

1500 x 512GiB = 824 TB​
... and for 2475 times the figure is ...

2475 x 512GiB = 1360 TB​

CrystalDiskInfo is reporting 22.9TB for the Total Host Writes. Has this figure rolled over, too?
 
Last edited:

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
Here
TechPowerUp user posted this
<screencapture omitted>
smart data has exceeded 100% and the value has rolled over from 200, which it shouldn't,
it's actually at 165% life time used

Where do you get the "rolled over from 200" idea? The Raw Values column shows Percent Lifetime Used is A5, which in decimal is 165. The Average Block Erase Count (ABEC) value is 9AB, which in decimal is 2475. As is common knowledge, 100% Lifetime Used corresponds to ABEC = 1500, and the ratio of 2475 to 1500 is 1.65, which is the same as the ratio of 165% to 100%. To me it appears that Lifetime Used and ABEC appear to be counting up normally, with neither having rolled over.

It's unfortunate that the screencapture failed to capture the value of Background Program Page Count. It would be interesting to use it in combination with Host Program Page Count to see the total NAND pages written to the ssd and to calculate the ssd's Write Amplification Factor.

The ssd's Total Host Writes is 22,926 GB (and its Total Host Sector Writes is b31c9b88c, which in decimal is 48,079,943,820). This is so small that it seems a reasonably good bet that the ssd is a victim of the FTL write bug.

It would be interesting to learn how far beyond 165% that ssd reaches before it fails, because if it reaches something like 1000% it might mean the FTL write bug doesn't really exist. The actual bug might be grossly inaccurate counting of several SMART values: ABEC, Lifetime Used, and Background Program Page Count. In other words, perhaps those values can increase without actually writing to the ssd.
 
The ssd's Total Host Writes is 22,926 GB (and its Total Host Sector Writes is b31c9b88c, which in decimal is 48,079,943,820). This is so small that it seems a reasonably good bet that the ssd is a victim of the FTL write bug.

It would be interesting to learn how far beyond 165% that ssd reaches before it fails, because if it reaches something like 1000% it might mean the FTL write bug doesn't really exist. The actual bug might be grossly inaccurate counting of several SMART values: ABEC, Lifetime Used, and Background Program Page Count. In other words, perhaps those values can increase without actually writing to the ssd.

I suspect that the raw values of the various attributes may be affected by roll-over. For example, the real value of Total Host Sector Writes could be 0x10b31c9b88c, which would correspond to a maximum allocation of 40 bits. The total host writes would then be 587 TB.

https://www.google.com/search?client=opera&q=0x10b31c9b88c+x+512+bytes+in+TB

Where do you get the "rolled over from 200" idea?

I think @Diceman_2037 is referring to attribute 0xAD. The normalised value of this attribute usually counts down from 100, but in this case it appears to be counting down from 200.

Here are other MX500 SMART data:

https://www.techpowerup.com/forums/attachments/xxxx-png.217963/
https://i1.wp.com/www.thessdreview.com/wp-content/uploads/2017/12/Crucial-MX500-1TB-CDI.png?resize=686,768&ssl=1
View: https://i.imgur.com/mmUzcBN.jpeg
 
Last edited:

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
<snip>
If a 512GiB NAND array is reprogrammed 1500 times, that's a total of ...
1500 x 512GiB = 824 TB...
<snip>

CrystalDiskInfo is reporting 22.9 TB for the Total Host Writes. Has this figure rolled over, too?

and in a later post:
I suspect that the raw values of the various attributes may be affected by roll-over. For example, the real value of Total Host Sector Writes could be 0x10b31c9b88c, which would correspond to a maximum allocation of 40 bits. The total host writes would then be 587 TB.

An ssd drive can't be written in small chunks like ram can, so to rewrite a single byte in a block requires copying the entire block... write amplification. Also, the wear leveling algorithm uses up erases throughout the lifetime of the ssd in order to try to make every block have a similar number of erases (which I don't understand why it's worth the consumption of erases). And the FTL write bug is very wasteful, relatively speaking, if the host write rate isn't large. Thus most of the "reprogramming" is written by the ssd controller, not by the host pc. That 824 TB calculation is misleading, even without the FTL write bug.

The FTL write bug might explain why the Total Host Writes is only 22.9 TB, much less than the 180 TB endurance specification advertised by Crucial.

The ssd's 2475 Average Block Erase Count (ABEC) implies ABEC incremented at an average rate higher than once per day, because once per day would correspond to 6.8 years in service and Crucial launched the MX500 in 2018. Before I tamed the bug in my ssd using selftests, the rate at which my ssd's ABEC was incrementing appeared to be accelerating... it was nearly once per day, and perhaps it would have gone much higher if the bug hadn't been tamed.

Your conjecture that Total Host Sector Writes rolled over (and thus Total Host Writes too) seems plausible. One way to check it is with the Host Program Page Count, which is f44f0b22 (4,098,820,898 in decimal) and perhaps hasn't rolled over. Total Host Writes and Total Host Sector Writes are about twice my ssd's (11,784 GB and 24,713,381,021) -- but Host Program Page Count is 4,098,820,898 which is much more than twice my ssd's 445,568,819. It's closer to nine times mine, and suggests the actual value of Total Host Writes may be approximately 9 times mine: 9 x 11,784 GB = 108,000 GB (unless Host Program Page Count also rolled over). 108 TB is much larger than 22.9 TB so it suggests rollover of Total Host Sector Writes occurred, but it's much smaller than 587 TB so it suggests the true value of Total Host Sector Writes isn't 10b31c9b88c nor that it's stored in 40 bits. On the other hand, I would expect Crucial would have allocated enough bits to be able to track Total Host Sector Writes beyond the published duration spec of 180 TB, so perhaps Host Program Page Count rolled over too, or isn't a reliable measure of host writing.
 
It seems to me that a programmer would be more likely to use an integral number of bytes to store a variable. Any other arrangement would seem to be convoluted and would complicate the programming.

That said, I've been racking my brain to make sense of attribute 0xAD, but nothing leaps out at me. :-?
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
It seems to me that a programmer would be more likely to use an integral number of bytes to store a variable. Any other arrangement would seem to be convoluted and would complicate the programming.

That said, I've been racking my brain to make sense of attribute 0xAD, but nothing leaps out at me. :-?

Yes, all else being equal, a programmer should prefer to allocate an integer multiple of bytes, or an integer multiple of "words" (a word is often 16 bits or 32 bits or 64 bits, depending on the processor's data bus width) to each variable for the sake of speed & simplicity. But all else isn't always equal... in particular, in some small devices storage for variables might be such a scarce resource that bits must not be wasted. I don't know what kind of storage the MX500 uses and whether it's a scarce resource... perhaps a combination of ram for speed plus occasional copying from ram to NAND for nonvolatility?

Regarding attribute AD (Average Block Erase Count, or ABEC), perhaps the uppermost bits is a rollover count, and the lower 10 bits are the average erases after the most recent rollover. This seems unlikely, but if true it might mean 1500 + 427 = 1927. (1500 = one rollover; 427 = 0x1ab.) It seems unlikely for at least two reasons: (1) it would be simpler just to use all the bits for the count of erases, like 2475; and (2) the 165% ratio of 2475 to 1500 matches the 165 Percent Lifetime Used.

Something else to scratch one's brain about is the 191 in the Current and Worst columns for ABEC and Percent Lifetime Used. My ssd lists 89, which corresponds to the formula 100 - Percent Lifetime Used. (100 - 11 = 89.) Using that formula, 100 - 165 is -65, which would look like 191 if one byte of storage is used: 191 = 256 - 65. This seems like additional evidence in support of the theory that ABEC really does equal 2475, even though 2475 is a very high number of erases during only a few years of operation. Perhaps the rate of block erases rapidly increased after Percent Lifetime Used reached 100%, or after some error event associated with an old ssd beginning to fail.

EDIT: Thinking about it some more, I suppose I shouldn't have been surprised by the high ABEC count, 2475. Depending on what the ssd was used for, it could have had a very high rate of host writes. We can presume it really did exceed 100% Lifetime, and 165% is in the same order of magnitude as 100%.
 
Last edited:

Diceman_2037

Distinguished
Dec 19, 2011
53
3
18,535
EDIT: Thinking about it some more, I suppose I shouldn't have been surprised by the high ABEC count, 2475. Depending on what the ssd was used for, it could have had a very high rate of host writes. We can presume it really did exceed 100% Lifetime, and 165% is in the same order of magnitude as 100%.

Precisely,

Plus the disk is already demonstrating Erase fails,and reallocated blocks.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
Precisely. Plus the disk is already demonstrating Erase fails, and reallocated blocks.

The ssd's count of the number of reallocated sectors is 0xA (10 decimal), which is tiny compared to the 500GB ssd capacity (assuming the count is supposed to be interpreted literally).

We cannot deduce from the screenshot when the sector failures occurred. They might have occurred a long time before the ssd reached 100% Lifetime Used... perhaps due to manufacturing defects or lightning surges.

I had a hard drive that was new in 2008, which developed a few bad sectors within a year of being placed in service. It never developed any more bad sectors, and I retired it in 2019 when I built my current pc using the MX500 ssd.

EDIT:
Note: It's still unclear how many of the 2475 average erases per block were caused by the FTL write bug or by an overly aggressive wear leveling algorithm, rather than by host writes. If we assume Host Program Page Count didn't roll over, comparison to my ssd's Host Program Page Count and my 11,800 GB Total Host Writes implies the host pc wrote about 108,000 GB (my calculation is in a recent message). If we estimate the portion that had been written when the ssd reached 1500 average erases per block, by dividing 108,000 GB by 1.65, we get about 65 TB, which is much less than the 180 TB endurance spec.
 
Last edited:
I had a hard drive that was new in 2008, which developed a few bad sectors within a year of being placed in service. It never developed any more bad sectors, and I retired it in 2019 when I built my current pc using the MX500 ssd.

Hard drives in those days were more likely to be affected by bad media than bad heads. Today it's the reverse. If you start noticing reallocations today, then it's likely to be a sign of a degrading disc head.
 
I purchased an MX500 1TB SSD earlier this year. I haven't used it much, but now I'll be paying special attention to it. :-(

Here are scans of the PCBs:

http://users.on.net/~fzabkar/SSD/Micron/MX500/

The NAND flash is Micron MT29F2T08EMLEEJ4-QA:E (part marking NY133).

The flash controller is a Silicon Motion SM2259H-AC with a YYWW (Year/Week) date code of 2137 (week 37 of 2021).

The SDRAM is Micron MT41K256M16TW-107:p (part marking D9SHD ).

https://www.micron.com/products/dram/ddr3-sdram/part-catalog/mt41k256m16tw-107
https://media-www.micron.com/-/medi...dr3l.pdf?rev=c2e67409c8e145f7906967608a95069f

Firmware is M3CR043.

You can decode NAND part numbers here:

https://nand.gq/#/decode

Click the icon in the top right corner to switch between Chinese and English.
 
Last edited:

MWink64

Prominent
Sep 8, 2022
154
42
620
I purchased an MX500 1TB SSD earlier this year. I haven't used it much, but now I'll be paying special attention to it. :-(

Here are scans of the PCBs:

http://users.on.net/~fzabkar/SSD/Micron/MX500/

The NAND flash is Micron MT29F2T08EMLEEJ4-QA:E (part marking NY133).

The flash controller is a Silicon Motion SM2259H-AC with a YYWW (Year/Week) date code of 2137 (week 37 of 2021).

The SDRAM is Micron MT41K256M16TW-107:p (part marking D9SHD ).

https://www.micron.com/products/dram/ddr3-sdram/part-catalog/mt41k256m16tw-107
https://media-www.micron.com/-/medi...dr3l.pdf?rev=c2e67409c8e145f7906967608a95069f

Firmware is M3CR043.

You can decode NAND part numbers here:

https://nand.gq/#/decode

Click the icon in the top right corner to switch between Chinese and English.

Well, this is incredibly interesting!

I've been looking for ways to decode some of these part numbers and now I seem to have a way. Most of the results were not surprising, with one big exception. Is DRAM with the D9SHD marking always 512MiB in capacity? If so, it looks like Crucial has started cheaping out even more than I realized. EVERY single MX500 I've seen has D9SHD printed on the DRAM chip(s). The older 1TB models have two of these chips, which would equal 1GiB DRAM, which is what I expected. However, ALL the newer models I've encountered have only one of these chips. I've seen them in 250GB, 500GB, and 2TB drives. That would mean the 250GB drive has more DRAM than expected, the 500GB has the expected amount, and the 1TB+ drives have less than expected. If true, maybe it's time to start lambasting Crucial along with all the other brands (admittedly, nearly all of them) that have done silent downgrades. If I am wrong about what the D9SHD means, please correct me.

On the other hand, the NAND decoder actually seems to bolster my theory that firmware version M3CR023 (and any that can be upgraded to it) have 64-layer NAND, M3CR033=96-layer, and M3CR04x=176-layer. At least, that appears true of all the drives I've encountered. I find it very concerning that there is now a claim that there may even be a QLC variant of the MX500. I was originally going to claim that that would get them in trouble for false advertising but, I combed over Crucial's site and I can't actually find anywhere that they claim the MX500 uses TLC (only 3D NAND). There's actually no claims about DRAM at all. I guess I've been going on what has historically been reported about these drives. It's still incredibly scummy. I looked at Samsung's site and they do specify TLC (well, 3-bit MLC) and specific DRAM amounts for the 860/870 EVO.

If it is confirmed that QLC variants of the MX500 exist, I won't be recommending it anymore. I'm already not super happy with Crucial because of my terrible experience with one of the presumably TLC versions of the BX500. It's the only SSD I've been dissatisfied with, despite having used plenty of even cheaper DRAM-less drives from far less reputable companies. It's getting really hard to find a SATA SSD to recommend these days. Even before this realization, I've seen a few comments about elevated failure rates on recent MX500s. I've seen even more comments about failures with the Samsung 870 EVO. I'm not even sure what's going on with the Western Digital Blue. The well regarded Blue 3D seems to have been replaced with the Blue SA510. I haven't been able to find any professional reviews of that model but the user reviews are terrible. The people that aren't complaining that it failed within a few weeks/months are saying that it's much slower than its predecessor. Those three were considered the top SATA drives. What's left now?

As for the original point of this thread, I've got a couple questions for the people who are experiencing the rapid decrease in lifespan remaining.

  1. What is the use case for the drive? Is it a system drive (if so, what OS) or just a storage drive?
  2. What filesystems are being used on the drive (not counting EFI partitions)?
  3. Is the drive being used in a way where the proportion of reads:writes is unusual? For example, is it a drive where a lot of data was written once and then the drive was mostly just read from?
  4. Is the drive left powered but idle for large periods of time?

I've got a little speculation on the issue but I'd like some more data points before I consider sharing it.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
<snip>
As for the original point of this thread, I've got a couple questions for the people who are experiencing the rapid decrease in lifespan remaining.
  1. What is the use case for the drive? Is it a system drive (if so, what OS) or just a storage drive?
  2. What filesystems are being used on the drive (not counting EFI partitions)?
  3. Is the drive being used in a way where the proportion of reads:writes is unusual? For example, is it a drive where a lot of data was written once and then the drive was mostly just read from?
  4. Is the drive left powered but idle for large periods of time?
I've got a little speculation on the issue but I'd like some more data points before I consider sharing it.

I'll try to answer your questions, even though there are more than a couple:
  1. My MX500 is the Windows 10 system drive C:. My pc also has two internal hard drives that store most of my data and some of my installed programs.
  2. The file system of the C: volume is NTFS.
  3. My pc doesn't write much to the ssd, approximately 0.1 MBytes/second average. (I think most of the writing is by Windows logging.) During the most recent 51 hours (following the most recent pc restart) the pc writes averaged 0.072 MB/s. The pc reads roughly 4 billion sectors every 5 weeks, which is about 0.7 MBytes/second if I didn't make an arithmetic mistake and assuming a sector is 512 bytes.
  4. The pc is powered on essentially all the time... I typically turn it off only for brief hardware maintenance. Every few weeks I also "sleep" the pc for a few seconds, because that power cycles the ssd, which seems to restrain the ssd's excessive FTL writing. I don't know what you mean by "idle," since Windows logs to C: incessantly, I usually have dozens of Firefox tabs open, etc.

Note: My ssd no longer experiences rapid decrease of Remaining Life, due to the nearly nonstop (19.5 minutes of every 20 minutes) ssd selftests that my pc runs. Remaining Life reached 92% on 3/13/2020 and 88% on 11/12/2022, a decrease of 4% over about 2.7 years. That's not a rapid decrease in terms of chronological time, but I expect it will fall short of the 180 TB endurance spec since the 4% RL decrease corresponds to about 5.5 TB written by the pc, which extrapolates to about 140 TB. Before I began the selftests regime, it appeared that the rate of decrease of RL, relative to bytes written by the pc, was accelerating... WAF was roughly 50 during the weeks before I began the selftests regime. WAF has averaged 3.3 under the selftests regime.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
<snip>
I've got a little speculation on the issue but I'd like some more data points before I consider sharing it.

Are you still planning to share your speculation? Considering how much time has elapsed since you requested more data points, it seems unlikely you'll receive additional data in the future. Mine appears to be the only response to your four questions.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
3rd Annual Report of Effect of Nearly Nonstop Selftests on my Crucial MX500 SSD

EXECUTIVE SUMMARY
The selftests regime has been running for three years. It continues to be effective at reducing the SSD's buggy Write Amplification, and there are no signs of any negative consequences (other than slightly increased power consumption due to the SSD not entering the "idle" low power mode).

During the three years of the selftests regime, WAF has been 3.25.

During these three years, Remaining Life has decreased by 4.87% (from 92.2% to 87.33%). Extrapolating from these three years, Remaining Lifetime (RLT) is now 54 years. Extrapolating instead from the 5 weeks prior to the start of the selftests regime (1/15/2020 to 2/22/2020), RLT was 5 years on 2/22/2020, and would presumably now be about 2 years. (Death soon after the 5 year warranty expires. Coincidence?)

I still suspect there's a correlation between excessive Write Amplification and the amount of time the SSD has been powered on. Every few weeks I paste the recent daily log data into my spreadsheet, and if I observe that the daily Write Amplification Factor has exceeded 4 for a few days, I power cycle the SSD by sleeping the pc for a few seconds. I also shut down the pc for a few seconds when a Windows update requires a restart or shutdown. But I haven't verified or debunked the correlation by analyzing the log data. I would need to study the statistical functions available in the spreadsheet or learn a statistics software package, and I haven't had a strong incentive to spend time on it due to the satisfactory long RLT.

LOG OF REMAINING LIFE
Remaining Life %​
Date​
Total Host Writes (GB)​
Host Writes (GB) Per Row​
100​
07/28/19​
0​
99​
08/31/19​
1772​
1772​
98​
not logged
not logged
not logged
97​
not logged
not logged
not logged
96​
not logged
not logged
not logged
95​
12/23/19​
5782​
94​
01/15/20​
6172​
390
93​
02/04/20​
6310​
138
92​
03/13/20​
6647​
337*
91​
10/19/20​
8178​
1531​
90​
09/16/21​
9395​
1217​
89​
05/20/22​
10532​
1137​
88​
11/12/22​
12082​
1550​

Comparing the "Host Writes Per Row" logged after 3/13/2020 (during which Remaining Life decreased from 92% to 88%) to the 138 GB logged on 2/04/2020 (when Remaining Life decreased from 94% to 93%) shows that the selftests regime has allowed about ten times as much host data to be written per percent of SSD life used. [1531GB + 1217GB + 1137GB + 1550GB] / [92% - 88%] = 1359 GB per Percent of Life Used.

* The row logged on 3/13/2020 corresponds to the period of time in which Remaining Life decreased from 93% to 92%, which included the last three weeks prior to the start of the selftests regime plus the first three weeks of the selftests regime. Due to that mix, the 337 GB written during this period isn't useful data, so it was neglected in the previous paragraph's calculation of "ten times as much host data."

HISTORICAL BACKGROUND
Early in 2020, my MX500 SSD's Write Amplification was very excessive after about 5 months of use and was growing worse. During the 5 weeks prior to the start of the selftests regime, the Write Amplification Factor (WAF) was 45.63. Analysis of SMART attribute F8 (logged every second for a few hours) revealed excess writing by the SSD controller many times per day in brief bursts, each burst approximately 37,000 NAND pages (about 1 GByte) or a small integer times that amount. Each burst lasted about 5 seconds, or a small integer times about 5 seconds. The write bursts correlated perfectly with a well-known MX500 bug: Current Pending Sector Count occasionally briefly becomes 1 (which triggers warning alerts for users who run software that monitors SMART data).

I guessed that an SSD selftest, which reads SSD blocks to check for issues, might have higher runtime priority than the SSD firmware routine responsible for the excessive writes. Testing showed that guess is correct, and benchmarking showed selftests don't reduce read or write performance. So I wrote a .BAT script to automatically run SSD selftests nearly nonstop, to drastically reduce the runtime available to the SSD's buggy firmware routine. The selftests regime was begun in late February 2020 and has been running ever since (whenever the pc is powered on, which is 24 hours per day except for occasional maintenance or power outages). I also wrote a .BAT script that logs SMART data in comma-delimited format daily and every 2 hours, and have copy/pasted all the daily data into a spreadsheet to analyze it.

In the early days I experimented to try to find the optimal duty cycle of the selftests, to optimize the health of the SSD. Nonstop selftests reduced the SSD's Write Amplification the most, to less than 2. Running selftests 19.5 of every 20 minutes reduced it more than 19 of every 20 minutes did. I settled on 19.5 of every 20 minutes. The 30 seconds pauses between selftests are available for the SSD firmware to run any low priority maintenance tasks essential for SSD health. Examination of logged data showed that most pauses contain no write bursts, so it seems a good bet that pausing for 30 seconds every 20 minutes has been providing sufficient runtime for essential low priority tasks, if there are any.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
Did you ever get any feedback from Crucial/Micron?

Yes, my interaction with Crucial's tech support was a little over three years ago. They agreed the write amplification was excessive and agreed to exchange the SSD, but they said they would ship the replacement only after they received my SSD. That would have caused a serious problem since my computer would have become inoperable with no system drive for an unknown period of time, possibly weeks. Also, they didn't promise the replacement would be new, and they didn't say they understood the cause of the excessive write amplification, so it seemed likely the replacement would have the same bug and might have been a refurbished drive in worse condition. Since the selftests regime mitigated the problem, I decided not to replace it.
 
It must be frustrating that, despite all your investigative work, Crucial appears not to have done anything about the problem.

On a side note, I notice that Samsung's 990 Pro is/was affected by excessive wear. Samsung claims to have addressed their problem with a firmware update. I'm wondering whether the 990 Pro also had a WAF bug.