Question Crucial MX500 500GB SATA SSD - - - Remaining Life decreasing fast despite only a few bytes being written to it ?

Page 11 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
The Remaining Life (RL) of my Crucial MX500 ssd has been decreasing rapidly, even though the pc doesn't write much to it. Below is the log I began keeping after I noticed RL reached 95% after about 6 months of use.

Assuming RL truly depends on bytes written, the decrease in RL is accelerating and something is very wrong. The latest decrease in RL, from 94% to 93%, occurred after writing only 138 GB in 20 days.

(Note 1: After RL reached 95%, I took some steps to reduce "unnecessary" writes to the ssd by moving some frequently written files to a hard drive, for example the Firefox profile folder. That's why only 528 GB have been written to the ssd since Dec 23rd, even though the pc is set to Never Sleep and is always powered on. Note 2: After the pc and ssd were about 2 months old, around September, I changed the pc's power profile so it would Never Sleep. Note 3: The ssd still has a lot of free space; only 111 GB of its 500 GB capacity is occupied. Note 4: Three different software utilities agree on the numbers: Crucial's Storage Executive, HWiNFO64, and CrystalDiskInfo. Note 5: Storage Executive also shows that Total Bytes Written isn't much greater than Total Host Writes, implying write amplification hasn't been a significant factor.)

My understanding is that Remaining Life is supposed to depend on bytes written, but it looks more like the drive reports a value that depends mainly on its powered-on hours. Can someone explain what's happening? Am I misinterpreting the meaning of Remaining Life? Isn't it essentially a synonym for endurance?


Crucial MX500 500GB SSD in desktop pc since summer 2019​
Date​
Remaining Life​
Total Host Writes (GB)​
Host Writes (GB) Since Previous Drop​
12/23/2019​
95%​
5,782​
01/15/2020​
94%​
6,172​
390​
02/04/2020​
93%​
6,310​
138​
 
  • Like
Reactions: demonized

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
it does indeed appear this is resolved in the MX500's with the new controller/firmware.

[Deleted Diceman's CrystalDiskInfo screenshot. You can see it in his post above.]

There hasn't been any significant increases to F8 that weren't linked to F7s.

It's unclear what you meant by "this" where you wrote "it does indeed appear this is resolved [...]" Perhaps you meant the "excessive F8 writing" WAF bug. But you might have meant the "low Power On Hours" bug that we've also discussed here. Or maybe you meant both. You highlighted POH in your screenshot, but you wrote about F8. (People often use pronouns ambiguously without realizing it.)

I don't think your CrystalDiskInfo data is solid evidence that the new controller/firmware has fixed the excessive F8 writing bug. Your ssd's F6, F7 and ABEC are still very low, indicating the host pc hasn't yet written much to the ssd:

F6 = 2,042,516,406 sectors. Since each sector is 512 bytes, it means 974 GB written by the host.​
F7 = 27,672,197 NAND pages. Since each NAND page is approximately 37,000 bytes, it means approximately 954 GB written by the host.​
ABEC = 10 (Average Block Erase Count).​

When I first noticed the problem on my 500GB MX500 about 2 years ago, my pc had written more than 6 times as much to the ssd as yours has. The first time that I logged F7 was on 1/15/2020 (about 3 weeks after I first noticed the WAF problem), and F7 was 214,422,794, F6 was 6172 GB, and ABEC was 90. Your ssd's ratio of F7 to ABEC is similar to mine before I began running the selftests mitigation.

On 8/31/2019 when my ssd was about a month old, its ABEC reached 15 (which corresponds to 1% of lifetime used). F6 indicated the host pc had written 1,772 GB. This ABEC=15 data is the closest data I have to your ABEC=10 data, which is why it's very relevant. My ssd's ratio of F6 to ABEC when its ABEC reached 15 is similar to your ssd's ratio: 1772GB/15 versus 974GB/10.

Anecdotal evidence suggests it takes awhile for the "excessive F8 writing" WAF bug to become noticeable, and then the magnitude of the problem accelerates. So, please keep logging your ssd's SMART data at least occasionally, and occasionally post it here so we can see whether your ssd eventually develops the problem.

At least one person in this forum thread suggested the problem might not manifest on every unit, and might be triggered on those units by a bad event that corrupts a database maintained by the firmware: perhaps a power-off before a clean pc shutdown, or perhaps a power surge. If the chance of having a triggering event isn't zero, the chance would increase with time.

One thing for potential buyers to keep in mind while it's still in doubt whether a new version fixed the bug is that the new firmware might prevent selftests from mitigating the bug. Presumably the selftests work because the buggy routine is a lower priority than the selftest routine, and the new firmware might reverse the priorities or make them equal.

It does look like Crucial's new controller/firmware may have fixed the "low Power On Hours" bug. Your POH is 1506. My POH was only 883 on 1/15/2020, which was about 4,000 hours (5.5 months) after the ssd was installed, and the pc had been powered on nearly 24 hours per day.
 

Diceman_2037

Distinguished
Dec 19, 2011
53
3
18,535
its fixed, drive has not logged a single C5 and the F7 to F8 ratio has held consistent,

unknown.png

unknown.png
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
its fixed, drive has not logged a single C5 and the F7 to F8 ratio has held consistent,

Your new ssd, which has the M3CR043 firmware, has only 2701 power-on hours in the screencapture you posted here on 2/07/2022. That might not be long enough for the WAF problem to begin to cause trouble. But let's hope the new firmware really fixed the problem.

Someone posted here the hypothesis that the problem only begins in drives that experience some bad event, such as a power surge or loss of power that corrupts a control table in the ssd. If true, that could explain why the problem doesn't begin soon after installation. (But other explanations are possible too.)

I don't believe there's a way to collect stats from a representative sample of MX500 owners. The people whose MX500 has a problem are the people most motivated to describe their experience here or in other forums, and thus aren't a representative sample unless most MX500s have the problem. If the new firmware doesn't really fix the problem, eventually someone will recognize something is wrong, and will hopefully be motivated enough to find this forum thread and write about it.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
It's been about 2 years since I began running the selftests regime on my MX500, beginning in late February 2020. Here's an update.

First, a quick summary. Running the selftests, it took 18 months -- from 3/13/2020 to 9/16/2021 -- for the ssd to lose 2% of its remaining life, during which the host pc wrote about 2.7 TB to the ssd. Then, during the 5.5 months from 9/16/2021 to now, the ssd lost approximately 0.5% of remaining life, during which the host pc wrote about 0.45 TB. Remaining life reached 90% on 9/16/2021, and at this rate it will be about 65 more years before remaining life reaches zero. I don't expect to need the ssd that many years, so it's tempting to move some frequently written files back to the ssd from hard drive, both for performance and to simplify the replacement procedure when the hard drive fails.

Two tables are below. Table 1 shows some info logged each time the ssd's Remaining Life decreased by a percent. (Table 1 is missing three rows from when the ssd was fairly new and I had not yet acquired the habit of logging ssd data). Table 2 shows a lot of info logged each time the Average Block Erase Count increased by one, for the last 5.5 months (since 9/16/2021).

There's one item of concern: Table 2 shows that WAF and the ratio of LifeUsed to HostWrites -- two key measures of how well the ssd is doing -- have been getting worse during the last couple of months. In previous posts I wrote a couple of times about the hypothesis that the WAF bug worsens when the ssd is allowed to run many days without being power-cycled (even with the selftests running, because the 30 second pause between 19.5 minute selftests allows some time for the buggy routine to execute). If this hypothesis is true, it could explain the worsening LifeUsed/HostWrites, because the overall success has led me to become less diligent about periodically power-cycling the ssd (by manually putting the pc to sleep for a few seconds). I used to power-cycle the ssd about once per week when I saw daily WAF rise, but during the last few months I've let it go about 2 or 3 weeks between power cycles. In Table 2 you can deduce a rough correlation between the Power Cycle Count column and the LifeUsed/HostWrites column, which supports the hypothesis. To test the hypothesis I could alternate periods of power-cycling diligence with periods of laxity, to see whether the correlation is strong. I already have two years of daily log data that could be analyzed to check the correlation, if I could find the time to analyze it (and to figure out which functions in LibreOffice Calc or some statistics software would make it easy to do).
TABLE 1. Data logged when Remaining Life decreased by a percent, 8/31/2019 to 9/16/2021
Date​
Attribute 173
(ABEC)
= 15 x (100-RL)​
Remaining Life %
Total Host Writes (GB)
08/31/2019
15
99
1,772
30
98
45
97
60
96
12/23/2019
75
95
5,782
01/15/2020
90
94
6,172
02/04/2020
105
93
6,310
03/13/2020
120
92
6,647
10/19/2020
135
91
8,178
09/16/2021
150
90
9,395

TABLE 2. Data logged when Average Block Erase Count increased, from 9/16/2021 to 2/15/2022
Date​
Time​
Total Host Writes (GB)​
S.M.A.R.T. F7​
S.M.A.R.T.
F8​
Power On Hours​
Average Block Erase Count​
Power Cycle Count
ΔLifeUsed /
ΔhostWrites
(%/TB)

1 row​
ΔF7
1 row​
ΔF8
1 row​
WAF
1 row
= 1 +
ΔF8/ΔF7
Days per ABEC increment, 1 row​
NAND pages written per ABEC increment, 1 row​
09/16/2021​
12:14​
9,395​
345,207,056​
1,695,530,220​
14,452​
150
189​
.91
3,405,618​
11,007,662​
4.23
27.5
14,413,280
10/12/2021​
20:43​
9,470​
348,580,443​
1,706,549,894​
15,074​
151
191​
.91
3,373,387​
11,019,674​
4.27
26.3
14,393,061
11/04/2021​
08:22​
9,535​
351,502,829​
1,718,348,100​
15,606​
152
192​
1.05
2,922,386​
11,798,206​
5.04
22.5
14,720,592
11/21/2021​
11:22​
9,587​
353,847,516​
1,731,298,811​
16,011​
153
193​
1.31
2,344,687​
12,950,711​
6.52
17.1
15,295,398
12/11/2021​
18:28​
9,654​
356,828,550​
1,742,861,459​
16,491​
154
195​
1.02
2,981,034​
11,562,648​
4.88
20.3
14,543,682
12/28/2021​
00:28​
9,700​
358,939,286​
1,756,221,383​
16,875​
155
196​
1.48
2,110,736​
13,359,924​
7.33
16.2
15,470,660
01/16/2022​
22:28​
9,761​
361,697,086​
1,768,359,822​
17,345​
156
198​
1.12
2,757,800​
12,138,439​
5.40
19.9
14,896,239
01/31/2022​
06:52​
9,804​
363,669,473​
1,781,983,176​
17,684​
157
199​
1.59
1,972,387​
13,623,354​
7.91
14.3
15,595,741
02/15/2022​
18:52​
9,846​
365,566,422​
1,795,898,857​
18,051​
158
199​
1.63
1,896,949​
13,915,681​
8.34
15.5
15,812,630
 

Diceman_2037

Distinguished
Dec 19, 2011
53
3
18,535
Your new ssd, which has the M3CR043 firmware, has only 2701 power-on hours in the screencapture you posted here on 2/07/2022. That might not be long enough for the WAF problem to begin to cause trouble. But let's hope the new firmware really fixed the problem.

its plenty of time for it to manifest, i've also now seen evidence that the issue has been fixed since M3CR033 too.

diskinfo64_zoc0bwczxq-png.237796


diskinfo64_7txkzzlelk-png.237797
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
its plenty of time for it to manifest, i've also now seen evidence that the issue has been fixed since M3CR033 too.

Maybe we'll have to agree to disagree. On your side now is data from your two additional two ssds that have the 033 firmware. They have about 5000 power-on hours and don't (yet) show the WAF problem. So that's even more reason to be optimistic that the problem has been fixed on new MX500s.

On the other hand, your pc writes about twice as much to each of your ssds as mine does, and examination of data from several MX500s suggests the problem is much more noticeable when the pc doesn't write much to the ssd. There's also the hypothesis mentioned earlier, that the onset of the WAF bug is triggered by an event such as a power surge or power loss, so your experiences with the newer firmware might mean only that your ssds haven't experienced the trigger event.

It would be interesting to compare the average power consumption of the MX500s that have the newer firmware to the power consumption of the MX500s that have the older firmware. It's pretty clear from your data that the Power On Hours count of MX500s that have the newer firmware increases at a much faster rate than the older MX500s, but unclear is why. It might mean that the newer firmware doesn't allow the ssd to enter the low power mode, which consumes (wastes?) more energy. Or it might mean that time spent in the low power mode is simply now counted as On. These two possibilities could presumably be distinguished by their different power consumption. If the former, and assuming the WAF bug has been fixed, one can speculate that Crucial deduced that the WAF bug is related to low power mode, and avoiding low power mode was a convenient solution. (The selftests regime also avoids low power mode, and perhaps that's related to why the selftests mitigate the bug.)

Another thing that's unclear is whether Crucial changed the MX500 hardware too, or only revised the firmware. Crucial's website still shows no upgrade path from MX500s that have firmware 023, and the only upgrade path from MX500s that have firmware older than 023 is to 023. These are hints that the newer firmware requires different hardware. It's also a hint that the bug in the old MX500s can't be solved by newer firmware and is due to a hardware design flaw.
 
Apr 14, 2022
1
0
10
I wonder if my too have this firmware issue. 16% remaining for 19 TB writter


Code:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.13.0-39-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
Device Model:     CT1000MX500SSD1
Serial Number:    1748E10778DC
LU WWN Device Id: 5 00a075 1e10778dc
Firmware Version: M3CR010
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 14 16:26:29 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  30) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x0031)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       22248
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       128
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   016   016   000    Old_age   Always       -       1273
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       58
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       38
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   067   044   000    Old_age   Always       -       33 (Min/Max 0/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       4
202 Percent_Lifetime_Remain 0x0030   016   016   001    Old_age   Offline      -       84
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       42522657351
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       11369347341
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       35128149298

SMART Error Log Version: 1
ATA Error Count: 0
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 ec 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  c8 00 00 00 00 00 00 00      00:00:00.000  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17711         -
# 2  Short offline       Completed without error       00%     10278         -
# 3  Short offline       Completed without error       00%     10055         -
# 4  Extended offline    Completed without error       00%      6857         -
# 5  Short offline       Completed without error       00%      6856         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Another thing that's unclear is whether Crucial changed the MX500 hardware too, or only revised the firmware. Crucial's website still shows no upgrade path from MX500s that have firmware 023, and the only upgrade path from MX500s that have firmware older than 023 is to 023. These are hints that the newer firmware requires different hardware. It's also a hint that the bug in the old MX500s can't be solved by newer firmware and is due to a hardware design flaw.

https://forums.tomshardware.com/thr...g-in-storage-executive.3757723/#post-22658376

The 023 firmware contains a reference to Silicon Motion's SM2258AA controller whereas the 033 firmware refers to SM2259AA. They also differ in the NAND flash chips (512Gbit versus 2Tbit).
 

chrysalis

Distinguished
Aug 15, 2003
145
4
18,715
I replaced mine finally with a 870 EVO which guess what has its own issues lol, luckily my 870 EVO is ok so far, but its gaining a bad rep on the internet.

In terms of my two MX500's I have a RMA approved for one of them and fighting them for the second. They are really resisting. Both have unreadable data now.
 

Pextaxmx

Reputable
Jun 15, 2020
418
59
4,840
I replaced mine finally with a 870 EVO which guess what has its own issues lol, luckily my 870 EVO is ok so far, but its gaining a bad rep on the internet.

In terms of my two MX500's I have a RMA approved for one of them and fighting them for the second. They are really resisting. Both have unreadable data now.
840EVO was a disaster, 850EVO (2 versions) and 860EVO turn out to be rock solid. Then 870EVO early batches didn't look promising, even though Samsung is no doubt capable of making it robust. Probably manufacturers don't care about the quality of their SATA products anymore. We might find ourselves having only Chinese no-name brand SATA drives to choose from in the near future.

(If you find lightly used PM883 drive with FW version HXT7904Q (a must... earlier versions are buggy), which is datacenter grade 64 layer TLC SATA drive - large OP (higher TBW), PLP, binned higher quality NAND chips - I would grab it as long as price is right. Most robust TLC SATA drive you can get... SM series (MLC) is too expensive $$$)
 
Last edited:

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
I wonder if my too have this firmware issue. 16% remaining for 19 TB written
Code:
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       42522657351
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       11369347341
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       35128149298

Your attribute 246 "Total LBAs Written" implies about 19.8 TB written by the host pc, as you noted. However, that amount is inconsistent with the amount implied by attribute 247 "Host Program Page Count." My experiments indicate each NAND page written by the host pc corresponds to about 37,000 bytes. (On my 500GB MX500.) Multiplying 37,000 bytes/page x 11369347341 pages is about 382 TB written by the host pc (unless I erred). That's a LOT more than 19.8 TB.

Your ssd is 1TB, not 500GB, so maybe 37,000 bytes per NAND page is a bad approximation. But even in that case, 19.8 TB looks like a very incorrect value.

Your write amplification factor (WAF) is about 4, which isn't horrible. This is another indication that the 19.8 TB value is very wrong. The formula for WAF is:
1 + (FTL_Program_Page_Count / Host_Program_Page_Count)

My tentative conclusion is that your MX500 has NOT been killed by the WAF bug. If your pc writes to the ssd at a high rate, the excess NAND writes caused by the bug would be only a relatively small fraction of the total FTL NAND writes.

Perhaps you've discovered a bug in attribute 246? Are you certain that you haven't accidentally truncated the lengths of the lines of the SMARTCTL output?

Do you have any other ways to estimate the total bytes written by the host to the ssd?
 
Jul 27, 2022
1
0
10
It's been about 2 years since I began running the selftests regime on my MX500, beginning in late February 2020. Here's an update.

First, a quick summary. Running the selftests, it took 18 months -- from 3/13/2020 to 9/16/2021 -- for the ssd to lose 2% of its remaining life, during which the host pc wrote about 2.7 TB to the ssd. Then, during the 5.5 months from 9/16/2021 to now, the ssd lost approximately 0.5% of remaining life, during which the host pc wrote about 0.45 TB. Remaining life reached 90% on 9/16/2021, and at this rate it will be about 65 more years before remaining life reaches zero. I don't expect to need the ssd that many years, so it's tempting to move some frequently written files back to the ssd from hard drive, both for performance and to simplify the replacement procedure when the hard drive fails.

Two tables are below. Table 1 shows some info logged each time the ssd's Remaining Life decreased by a percent. (Table 1 is missing three rows from when the ssd was fairly new and I had not yet acquired the habit of logging ssd data). Table 2 shows a lot of info logged each time the Average Block Erase Count increased by one, for the last 5.5 months (since 9/16/2021).

There's one item of concern: Table 2 shows that WAF and the ratio of LifeUsed to HostWrites -- two key measures of how well the ssd is doing -- have been getting worse during the last couple of months. In previous posts I wrote a couple of times about the hypothesis that the WAF bug worsens when the ssd is allowed to run many days without being power-cycled (even with the selftests running, because the 30 second pause between 19.5 minute selftests allows some time for the buggy routine to execute). If this hypothesis is true, it could explain the worsening LifeUsed/HostWrites, because the overall success has led me to become less diligent about periodically power-cycling the ssd (by manually putting the pc to sleep for a few seconds). I used to power-cycle the ssd about once per week when I saw daily WAF rise, but during the last few months I've let it go about 2 or 3 weeks between power cycles. In Table 2 you can deduce a rough correlation between the Power Cycle Count column and the LifeUsed/HostWrites column, which supports the hypothesis. To test the hypothesis I could alternate periods of power-cycling diligence with periods of laxity, to see whether the correlation is strong. I already have two years of daily log data that could be analyzed to check the correlation, if I could find the time to analyze it (and to figure out which functions in LibreOffice Calc or some statistics software would make it easy to do).
TABLE 1. Data logged when Remaining Life decreased by a percent, 8/31/2019 to 9/16/2021
Date​
Attribute 173
(ABEC)
= 15 x (100-RL)​
Remaining Life %
Total Host Writes (GB)
08/31/2019
15
99
1,772
30
98
45
97
60
96
12/23/2019
75
95
5,782
01/15/2020
90
94
6,172
02/04/2020
105
93
6,310
03/13/2020
120
92
6,647
10/19/2020
135
91
8,178
09/16/2021
150
90
9,395

TABLE 2. Data logged when Average Block Erase Count increased, from 9/16/2021 to 2/15/2022
Date​
Time​
Total Host Writes (GB)​
S.M.A.R.T. F7​
S.M.A.R.T.
F8​
Power On Hours​
Average Block Erase Count​
Power Cycle Count
ΔLifeUsed /
ΔhostWrites
(%/TB)

1 row​
ΔF7
1 row​
ΔF8
1 row​
WAF
1 row
= 1 +
ΔF8/ΔF7
Days per ABEC increment, 1 row​
NAND pages written per ABEC increment, 1 row​
09/16/2021​
12:14​
9,395​
345,207,056​
1,695,530,220​
14,452​
150
189​
.91
3,405,618​
11,007,662​
4.23
27.5
14,413,280
10/12/2021​
20:43​
9,470​
348,580,443​
1,706,549,894​
15,074​
151
191​
.91
3,373,387​
11,019,674​
4.27
26.3
14,393,061
11/04/2021​
08:22​
9,535​
351,502,829​
1,718,348,100​
15,606​
152
192​
1.05
2,922,386​
11,798,206​
5.04
22.5
14,720,592
11/21/2021​
11:22​
9,587​
353,847,516​
1,731,298,811​
16,011​
153
193​
1.31
2,344,687​
12,950,711​
6.52
17.1
15,295,398
12/11/2021​
18:28​
9,654​
356,828,550​
1,742,861,459​
16,491​
154
195​
1.02
2,981,034​
11,562,648​
4.88
20.3
14,543,682
12/28/2021​
00:28​
9,700​
358,939,286​
1,756,221,383​
16,875​
155
196​
1.48
2,110,736​
13,359,924​
7.33
16.2
15,470,660
01/16/2022​
22:28​
9,761​
361,697,086​
1,768,359,822​
17,345​
156
198​
1.12
2,757,800​
12,138,439​
5.40
19.9
14,896,239
01/31/2022​
06:52​
9,804​
363,669,473​
1,781,983,176​
17,684​
157
199​
1.59
1,972,387​
13,623,354​
7.91
14.3
15,595,741
02/15/2022​
18:52​
9,846​
365,566,422​
1,795,898,857​
18,051​
158
199​
1.63
1,896,949​
13,915,681​
8.34
15.5
15,812,630
I'm like extremely late into the party and im not big on disk science and coding, but i want to ask a question.
I have a 023 firmware with 58H controller respectively, will your self-test routine help me with decreasing health and constant 197 0<->1 error?
If yes, can i have your bat file for this regime?
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
I have a 023 firmware with 58H controller respectively, will your self-test routine help me with decreasing health and constant 197 0<->1 error?
If yes, can i have your bat file for this regime?

Yes, my firmware is 023 too. I believe you can find my .bat files already posted in this thread... maybe during the first half of 2020? If memory serves, the earliest one I posted was a simple one that doesn't log any status and doesn't try for precise timing, but will do the selftests job well. You'll also need to download the free Smartmontools package available elsewhere, because the .bat file calls the SMARTCTL.exe utility included in the package.

You'll probably need to edit some of the lines in the .bat to match your system's setup. Folder locations, for example.

You can set Windows Task Scheduler to run the .bat everytime Windows starts. Note that SMARTCTL.exe requires administrator privileges, so check that box in the Task Scheduler task definition.

How long ago did you purchase your MX500 ssd? How much Remaining Life does it have? How many bytes or NAND pages have been written to it by the host pc?

I'm unsure what you mean by "constant 197 error." I'll assume that by "constant" you mean frequent... many times per day. While that value is 1, it indicates the ssd's controller is writing a large burst of data to the ssd: a multiple of approximately 37,000 NAND pages, and each 37,000 NAND pages is about 1 gigabyte. (It's a good bet that it's moving that data around in the ssd, in other words reading as much as it's writing in order to do the moving.) That's the buggy behavior that wastes some of the ssd's remaining life. It wastes a lot, relatively speaking, unless the pc writes to the ssd at a much higher average rate than the ssd controller's buggy write rate.

The selftests don't entirely eliminate those write bursts, but their frequency becomes MUCH less.

The ssd selftests regime continues to work well for me. My MX500 Remaining Life has decreased only about 3.5% since I began the nearly nonstop selftests in late February 2020. My spreadsheet log predicts the ssd has 67 years remaining (if the pc's average write rate remains the same as it's averaged the last couple of years).

Sorry I didn't see your post earlier. The incoming email server at Microsoft stored the email from Tomshardware in my junk folder, and I didn't notice it until today.
 

worstalentscout

Distinguished
Nov 1, 2016
294
9
18,685
Yes, my firmware is 023 too. I believe you can find my .bat files already posted in this thread... maybe during the first half of 2020? If memory serves, the earliest one I posted was a simple one that doesn't log any status and doesn't try for precise timing, but will do the selftests job well. You'll also need to download the free Smartmontools package available elsewhere, because the .bat file calls the SMARTCTL.exe utility included in the package.

You'll probably need to edit some of the lines in the .bat to match your system's setup. Folder locations, for example.

You can set Windows Task Scheduler to run the .bat everytime Windows starts. Note that SMARTCTL.exe requires administrator privileges, so check that box in the Task Scheduler task definition.

How long ago did you purchase your MX500 ssd? How much Remaining Life does it have? How many bytes or NAND pages have been written to it by the host pc?

I'm unsure what you mean by "constant 197 error." I'll assume that by "constant" you mean frequent... many times per day. While that value is 1, it indicates the ssd's controller is writing a large burst of data to the ssd: a multiple of approximately 37,000 NAND pages, and each 37,000 NAND pages is about 1 gigabyte. (It's a good bet that it's moving that data around in the ssd, in other words reading as much as it's writing in order to do the moving.) That's the buggy behavior that wastes some of the ssd's remaining life. It wastes a lot, relatively speaking, unless the pc writes to the ssd at a much higher average rate than the ssd controller's buggy write rate.

The selftests don't entirely eliminate those write bursts, but their frequency becomes MUCH less.

The ssd selftests regime continues to work well for me. My MX500 Remaining Life has decreased only about 3.5% since I began the nearly nonstop selftests in late February 2020. My spreadsheet log predicts the ssd has 67 years remaining (if the pc's average write rate remains the same as it's averaged the last couple of years).

Sorry I didn't see your post earlier. The incoming email server at Microsoft stored the email from Tomshardware in my junk folder, and I didn't notice it until today.


i don't think you need to worry about your SSD unless you're planning to use it for over 5-6 years.......

recently i saw a Youtube video that was posted a few years back...............several 128GB SSDs from Transcend, PNY and other minor brands were tested............the worst performing SSD ''died'' after writing 100TB............the last SSD standing wrote 1000TB...............so your 500GB SSD should be okay, especially since it's a better brand too..........i'm also using the MX500 (1 x 500GB + 2 x 250GB)
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
worstalentscout wrote on 8/08/2022:
"i don't think you need to worry about your SSD unless you're planning to use it for over 5-6 years......."
-snip-

But I DO want to be able to use my drives longer than 5-6 years, and my ssd is supposed to last much longer than that because my pc doesn't write a lot to it. On average during the last few weeks, the pc wrote 0.088 MBytes per second (according to HWINFO64) to the ssd, which is roughly 3TB per year. During the last 2.5 years (3/01/2020 to today) the pc wrote 4,680 MB to the ssd, which is roughly 2TB per year.

Also, there was some evidence that the problem was worsening, before I tamed it using the ssd selftests regime:
Date​
Remaining Life %​
Total Host Writes (GB)​
Host Writes (GB) per recent 1% decrease of Remaining Life​
08/31/2019
99
1,772
1,772​
12/23/2019
95
5,782
01/15/2020
94
6,172
390
02/04/2020
93
6,310
138
03/13/2020
92
6,647
337​
10/19/2020
91
8,178
1,531​
09/16/2021
90
9,395
1,217​
05/20/2022
89
10,532
1,137​
During its first 5 months of service, the ssd lost 5% of Remaining Life while the pc wrote 5782 GB to it, an average of 1,156 GB per percent of Remaining Life. I began logging each 1% decrease in late December 2019 when I noticed Remaining Life was decreasing faster than I expected (based on the ssd's durability rating). From 12/23/2019 to 1/15/2020 Remaining Life decreased from 95% to 94% while the pc wrote 390 GBytes, and from 1/15/2020 to 2/04/2020 Remaining Life decreased from 94% to 93% while the pc wrote 138 GBytes. (I started the selftests regime in late February 2020, and the log shows the host has been writing much more per 1% decrease, similar to the rate as when the ssd was very young.) Those two values, 390 and 138 (GBytes), are highlighted in red in the table above, and suggest the problem was worsening.
 

worstalentscout

Distinguished
Nov 1, 2016
294
9
18,685
worstalentscout wrote on 8/08/2022:
"i don't think you need to worry about your SSD unless you're planning to use it for over 5-6 years......."
-snip-

But I DO want to be able to use my drives longer than 5-6 years, and my ssd is supposed to last much longer than that because my pc doesn't write a lot to it. On average during the last few weeks, the pc wrote 0.088 MBytes per second (according to HWINFO64) to the ssd, which is roughly 3TB per year. During the last 2.5 years (3/01/2020 to today) the pc wrote 4,680 MB to the ssd, which is roughly 2TB per year.

Also, there was some evidence that the problem was worsening, before I tamed it using the ssd selftests regime:
Date​
Remaining Life %​
Total Host Writes (GB)​
Host Writes (GB) per recent 1% decrease of Remaining Life​
08/31/2019
99
1,772
1,772​
12/23/2019
95
5,782
01/15/2020
94
6,172
390
02/04/2020
93
6,310
138
03/13/2020
92
6,647
337​
10/19/2020
91
8,178
1,531​
09/16/2021
90
9,395
1,217​
05/20/2022
89
10,532
1,137​
During its first 5 months of service, the ssd lost 5% of Remaining Life while the pc wrote 5782 GB to it, an average of 1,156 GB per percent of Remaining Life. I began logging each 1% decrease in late December 2019 when I noticed Remaining Life was decreasing faster than I expected (based on the ssd's durability rating). From 12/23/2019 to 1/15/2020 Remaining Life decreased from 95% to 94% while the pc wrote 390 GBytes, and from 1/15/2020 to 2/04/2020 Remaining Life decreased from 94% to 93% while the pc wrote 138 GBytes. (I started the selftests regime in late February 2020, and the log shows the host has been writing much more per 1% decrease, similar to the rate as when the ssd was very young.) Those two values, 390 and 138 (GBytes), are highlighted in red in the table above, and suggest the problem was worsening.


i too wanted to use my SSD as long as possible................but i think after 5 years, reliability will be quite suspect............my boot drive (MX500 250GB) is now at 92% after 10 months.............i think if you have a separate boot drive + storage drive, it's better as in case of a failure - you still have storage drive keeping your data intact..........

what i do is this............i have a 250GB boot drive and a 250GB clone (also MX500) of it and i re-clone the clone every month or so............my older 500GB (MX500) holds my info...........

the SSD test i mentioned earlier had a PNY 128GB SSD write 1000TB before it went bust.............so potentially your 500GB MX500 (definitely better than PNY) will write 4000-5000TB ?
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
i too wanted to use my SSD as long as possible................but i think after 5 years, reliability will be quite suspect............my boot drive (MX500 250GB) is now at 92% after 10 months.............i think if you have a separate boot drive + storage drive, it's better as in case of a failure - you still have storage drive keeping your data intact..........

what i do is this............i have a 250GB boot drive and a 250GB clone (also MX500) of it and i re-clone the clone every month or so............my older 500GB (MX500) holds my info...........

the SSD test i mentioned earlier had a PNY 128GB SSD write 1000TB before it went bust.............so potentially your 500GB MX500 (definitely better than PNY) will write 4000-5000TB ?

Your argument about "suspect reliability" appears to be based on the fact that your SSD's Remaining Life dropped to 92% after only 10 months. That reasoning doesn't apply to my SSD, which has Remaining Life dropping less than 2% per year (due to the selftests regime) as shown by the table in my previous post.

If any of your SSDs have a high WAF, the buggy behavior is probably responsible and the selftests regime would likely benefit you too. How many GB has your pc written to each of your SSDs? Do you record this value each time Remaining Life decreases, to check whether the problem is worsening for any of your SSDs?

I disagree with your calculation that my 500GB MX500 might endure 4000 to 5000 TB. Here's my calculation: If memory serves, my understanding is that an SSD can no longer be written to after its Remaining Life reaches 0%... it becomes a read-only device, which isn't very useful. My pc has written about 3.2 TB to the SSD during the last 2 years. Assuming write rates in the future continue like the last 2 years, my spreadsheet predicts Remaining Life will reach 0% about 67 years from now. Continuing like the last 2 years means about 107.2 TB will be written by the pc during the next 67 years. (3.2 TB x 67 / 2.) Add 107.2 to the 11 already written, and this is MUCH less than your 4000-5000 estimate. If I were to abandon the selftests regime, and assuming Remaining Life would resume decreasing at its 1/15/2020 to 2/04/2020 rate when Remaining Life dropped from 94% to 93% while the pc wrote only 138 GB, then Remaining Life would reach 0% after only about 12 TB more is written by the pc. (Note: Crucial advertises a 180 TB durability spec.)

The only reasons I'm aware of to NOT run the selftests regime are: (1) its 2.5 years of apparent success doesn't prove beyond a reasonable doubt that there will be no long term negative effect, and (2) the selftests cost about 1 watt of power because they prevent the SSD from entering low power mode, and this extra watt of electricity costs about $1 per year. A possible beneficial side effect is that avoiding low power mode prevents the SSD's temperature from fluctuating... temperature changes tend to cause wear & tear on electronic and non-electronic materials.
 

worstalentscout

Distinguished
Nov 1, 2016
294
9
18,685
Your argument about "suspect reliability" appears to be based on the fact that your SSD's Remaining Life dropped to 92% after only 10 months. That reasoning doesn't apply to my SSD, which has Remaining Life dropping less than 2% per year (due to the selftests regime) as shown by the table in my previous post.

If any of your SSDs have a high WAF, the buggy behavior is probably responsible and the selftests regime would likely benefit you too. How many GB has your pc written to each of your SSDs? Do you record this value each time Remaining Life decreases, to check whether the problem is worsening for any of your SSDs?

I disagree with your calculation that my 500GB MX500 might endure 4000 to 5000 TB. Here's my calculation: If memory serves, my understanding is that an SSD can no longer be written to after its Remaining Life reaches 0%... it becomes a read-only device, which isn't very useful. My pc has written about 3.2 TB to the SSD during the last 2 years. Assuming write rates in the future continue like the last 2 years, my spreadsheet predicts Remaining Life will reach 0% about 67 years from now. Continuing like the last 2 years means about 107.2 TB will be written by the pc during the next 67 years. (3.2 TB x 67 / 2.) Add 107.2 to the 11 already written, and this is MUCH less than your 4000-5000 estimate. If I were to abandon the selftests regime, and assuming Remaining Life would resume decreasing at its 1/15/2020 to 2/04/2020 rate when Remaining Life dropped from 94% to 93% while the pc wrote only 138 GB, then Remaining Life would reach 0% after only about 12 TB more is written by the pc. (Note: Crucial advertises a 180 TB durability spec.)

The only reasons I'm aware of to NOT run the selftests regime are: (1) its 2.5 years of apparent success doesn't prove beyond a reasonable doubt that there will be no long term negative effect, and (2) the selftests cost about 1 watt of power because they prevent the SSD from entering low power mode, and this extra watt of electricity costs about $1 per year. A possible beneficial side effect is that avoiding low power mode prevents the SSD's temperature from fluctuating... temperature changes tend to cause wear & tear on electronic and non-electronic materials.


the 5 year thing is not becoz of write endurance but 5 years is a long time and so i would use a clone to run the pc and ''retire'' the original SSD............

as for the percentage of lifespan.................that's based on the warranty.............so if the warranty is based on 100TB written and you've wrote 5TB.................then it'll show 95% left

remember the test i told you of a 128GB SSD writing 100TB before it died ?...............the best performing one wrote 1000TB before dying.............the MX500 is a better brand and with 500GB, i think you need not worry..........
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
the 5 year thing is not becoz of write endurance but 5 years is a long time and so i would use a clone to run the pc and ''retire'' the original SSD............

as for the percentage of lifespan.................that's based on the warranty.............so if the warranty is based on 100TB written and you've wrote 5TB.................then it'll show 95% left

remember the test i told you of a 128GB SSD writing 100TB before it died ?...............the best performing one wrote 1000TB before dying.............the MX500 is a better brand and with 500GB, i think you need not worry..........

1. Five years isn't necessarily a long time. One of the hard drives in my system is 14 years old and has no bad sectors. My previous computer lasted 11 years and could have been repaired by replacing a few electrolytic capacitors on the motherboard.

2. Your claim that the decrease of Remaining Life equals the percentage of manufacturer-rated endurance (180 TB for a 500GB MX500 ssd) already written is incorrect, as you should be able to deduce from the table I posted. It's actually directly related to the Average Block Erase Count attribute... each 1% of Remaining Life corresponds to 15 average erases per NAND memory block. Many of the erases are caused by a low priority background process in the ssd that moves data from block to block with the intent of leveling the wear on the memory blocks. You might want to google 'ssd write amplification.'

3. My hunch is that the test you cited is unrealistic. To test how fast an ssd can be killed by writing to it, the test will presumably write and erase data as fast as possible. This continual writing minimizes the runtime given to the ssd's low priority wear-leveling process that I mentioned above... it's similar to how continual reading by a selftest minimizes the runtime of the wear-leveling process. (In order to make a small amount of runtime available to the wear-leveling process, my selftest regime doesn't run a selftest for 30 seconds out of every 20 minutes.)

Although you keep saying Crucial is a better brand than PNY in general, you appear to be neglecting the destructive effect of the Crucial bug (which the selftests tame).
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
So here we are in August, and my two drives are respectively

Affected
[screencapture deleted]

Still Unaffected
[screencapture deleted]

Thanks for the status update. For the sake of other readers who don't remember your previous comment, what you're showing here is evidence that the MX500 hardware revision plus firmware update has eliminated the WAF bug... a lot of time has passed since your newer ssd was placed into service, and it appears to be "still unaffected" by the bug... which is evidence that it doesn't have the bug. Presumably the bug is in the older controller chip, and can't be fixed by just a firmware update.
 

Diceman_2037

Distinguished
Dec 19, 2011
53
3
18,535
Yeah, luckily i only have one of these drives affected drives in a difficult to replace install (placed in a toshiba qosmio i use for a low load terraria/starbound server),

From what i have seen though, the Health value of these affected drives actually rollovers to 100% and starts counting down again, not ideal since you'd expect 0% health nand to go into read only mode.