Question Crucial MX500 500GB sata ssd Remaining Life decreasing fast despite few bytes being written

Page 5 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
@fzabkar: Yes, I get the data from attributes, not from the log. My selftests .bat and my logger .bat use the "smartctl -A" or "smartctl -x" command, depending on whether they want to read the basic SMART attributes or the extended attributes.

I became aware of Crucial's 32bit SectorsReadByHost bug from a thread in the HWiNFO forum. Martin, the developer of HWiNFO, patched HWiNFO so it would obtain Total Written from the SMART attribute instead of from the ATA Statistics. He wrote on Feb 29 that Crucial's 32bit bug makes Total Read impossible to obtain: https://www.hwinfo.com/forum/threads/crucial-mx500-tbw-not-accurate.5243/page-2#post-23895
 
Is this the log from GSmartControl?
Code:
smartctl 6.6 2017-11-05 r4594 [x86_64-w64-mingw32-w10-b18363] (sf-6.6-1)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SiliconMotion based SSDs
Device Model:     CT1000BX100SSD1
Serial Number:    1504F0023EEC
LU WWN Device Id: 5 00a075 1f0023eec
Firmware Version: MU02
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jun 26 01:11:02 2020 GMTST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x71) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0002)    Does not save SMART data before
                    entering power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  10) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x0035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     ------   100   100   000    -    0
  5 Reallocated_Sector_Ct   ------   100   100   000    -    0
  9 Power_On_Hours          ------   100   100   000    -    3079
12 Power_Cycle_Count       ------   100   100   000    -    3378
160 Uncorrectable_Error_Cnt ------   100   100   000    -    0
161 Valid_Spare_Block_Cnt   ------   100   100   000    -    129
163 Initial_Bad_Block_Count ------   100   100   000    -    70
164 Total_Erase_Count       ------   100   100   000    -    86861
165 Max_Erase_Count         ------   100   100   000    -    119
166 Min_Erase_Count         ------   100   100   000    -    6
167 Average_Erase_Count     ------   100   100   000    -    41
168 Max_Erase_Count_of_Spec ------   100   100   000    -    2000
169 Remaining_Lifetime_Perc ------   100   100   000    -    100
175 Program_Fail_Count_Chip ------   100   100   000    -    0
176 Erase_Fail_Count_Chip   ------   100   100   000    -    0
177 Wear_Leveling_Count     ------   100   100   000    -    26
178 Runtime_Invalid_Blk_Cnt ------   100   100   000    -    0
181 Program_Fail_Cnt_Total  ------   100   100   000    -    0
182 Erase_Fail_Count_Total  ------   100   100   000    -    0
192 Power-Off_Retract_Count ------   100   100   000    -    76
194 Temperature_Celsius     ------   100   100   000    -    36
195 Hardware_ECC_Recovered  ------   100   100   000    -    1705221
196 Reallocated_Event_Count ------   100   100   000    -    0
197 Current_Pending_Sector  ------   100   100   000    -    0
198 Offline_Uncorrectable   ------   100   100   000    -    0
199 UDMA_CRC_Error_Count    ------   100   100   000    -    0
232 Available_Reservd_Space ------   100   100   000    -    100
241 Host_Writes_32MiB       ------   100   100   000    -    664066
242 Host_Reads_32MiB        ------   100   100   000    -    1167628
245 TLC_Writes_32MiB        ------   100   100   000    -    1389776
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL,SL  R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL,SL  R/O      1  NCQ Command Error log
0x11       GPL,SL  R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O     64  Current Device Internal Status Data log
0x25       GPL     R/O     64  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log (GP Log 0x03) not supported

SMART Error Log not supported

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       0 (0x0000)
SCT Support Level:                   0
Device State:                        Active (0)
Current Temperature:                     0 Celsius
Power Cycle Min/Max Temperature:     33/38 Celsius
Lifetime    Min/Max Temperature:      0/41 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        0 minutes
Min/Max recommended Temperature:      0/100 Celsius
Min/Max Temperature Limit:            0/100 Celsius
Temperature History Size (Index):    128 (34)

Index    Estimated Time   Temperature Celsius
  35    2020-06-25 23:04    36  *****************
...    ..(126 skipped).    ..  *****************
  34    2020-06-26 01:11    36  *****************

SCT Error Recovery Control command not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4            3378  ---  Lifetime Power-On Resets
0x01  0x010  4            3079  ---  Power-on Hours
0x01  0x018  6       570578176  ---  Logical Sectors Written
0x01  0x020  6       710027405  ---  Number of Write Commands
0x01  0x028  6      3507257538  ---  Logical Sectors Read
0x01  0x030  6       984137480  ---  Number of Read Commands
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              76  ---  Resets Between Cmd Acceptance and Completion
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4           20751  ---  Number of Hardware Resets
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               2  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            0  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC
 
Last edited:
Code:
Device Statistics (GP Log 0x04)

0x01  0x018  6       570578176  ---  Logical Sectors Written    -> 272 GiB
0x01  0x028  6      3507257538  ---  Logical Sectors Read       -> 1672 GiB
Code:
SMART Attributes Data

241 Host_Writes_32MiB       ------   100   100   000    -    664066     -> 20.3 TiB
242 Host_Reads_32MiB        ------   100   100   000    -    1167628    -> 35.6 TiB
245 TLC_Writes_32MiB        ------   100   100   000    -    1389776    -> 42.4 TiB
It would appear that this model also has a 32-bit bug when reporting the Logical Sectors Written / Read in the Device Statistics log.

I can see where the erroneous (?) 272 GiB and 1672 GiB figures come from, but I can't see how the Total NAND Writes figure of 10GiB is computed. Perhaps the units for NAND writes are assumed to be 8KiB rather than 32MiB, in which case the result would be 10GiB???

In any case I would go with the SMART attribute data (not Device Statistics) reported by GSmartControl. They seem to be the most plausible.

Code:
Host_Writes_32MiB ------ 20.3 TiB
Host_Reads_32MiB  ------ 35.6 TiB
TLC_Writes_32MiB  ------ 42.4 TiB
 
Reactions: Flayed

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
@Flayed: Your smartctl (GSmartControl) output says the ssd's Remaining Life is 100%, which disagrees with the 98% reported by HWiNFO. There's a much newer version of smartctrl that might give different (more accurate?) results. Are you using an old version of HWiNFO too?

@fzabkar: Yes. Martin, the developer of HWiNFO, wrote there are 32bit bugs in both the Device Statistics bytes written and bytes read extended attributes. For the MX500 he worked around the bytes written bug by patching HWiNFO to report bytes written from the S.M.A.R.T. attribute instead of from Device Statistics. There's no similar workaround for the bytes read bug for the MX500 because it has no S.M.A.R.T. bytes read attribute.
 
@Flayed: Your smartctl (GSmartControl) output says the ssd's Remaining Life is 100%, which disagrees with the 98% reported by HWiNFO. There's a much newer version of smartctrl that might give different (more accurate?) results. Are you using an old version of HWiNFO too?

@fzabkar: Yes. Martin, the developer of HWiNFO, wrote there are 32bit bugs in both the Device Statistics bytes written and bytes read extended attributes. For the MX500 he worked around the bytes written bug by patching HWiNFO to report bytes written from the S.M.A.R.T. attribute instead of from Device Statistics. There's no similar workaround for the bytes read bug for the MX500 because it has no S.M.A.R.T. bytes read attribute.
I have HWiNFO v6.28-4200 may not be the latest but less than a month old. I downloaded GSmartControl when I posted it's output.
 
Jun 16, 2020
5
0
10
0
@Lucretia19

Thank you for your bat file, it started working immediately once I activated it on the 28th of June 2020 as you can see in the results below.

As you also posted earlier the speed at which the life expectancy deteriorated increased dramatically after each month and would definitely not have lasted the 5 years.

I will keep monitoring it and post my results after a month again.

This is my latest stats that I recorded that shows the WAF is busy decreasing again:

DateTotal Host Writes (GB)S.M.A.R.T. F7S.M.A.R.T. F8WAF = (F7+F8)/F7
16 Jun 20
246,525,596​
3,668,514,207​
15.88
17 Jun 20
247,596,431​
3,693,337,626​
15.92
18 Jun 20
250,351,148​
3,718,474,384​
15.85
19 Jun 20
250,933,222​
3,738,149,215​
15.90
20 Jun 20
251,584,547​
3,757,898,839​
15.94
21 Jun 20
252,222,762​
3,784,978,908​
16.01
22 Jun 20
252,910,747​
3,804,643,159​
16.04
23 Jun 20
253,875,446​
3,831,992,628​
16.09
24 Jun 20
257,148,500​
3,853,351,613​
15.98
25 Jun 20
258,080,161​
3,865,554,431​
15.98
26 Jun 20
258,831,585​
3,889,189,851​
16.03
27 Jun 20
259,300,872​
3,920,259,800​
16.12
28 Jun 20
260,232,165​
3,940,894,467​
16.14
29 Jun 20
261,136,679​
3,953,413,414​
16.14
30 Jun 20
262,089,122​
3,961,810,405​
16.12
1 Jul 20
264,238,853​
3,966,676,450​
16.01
2 Jul 20
265,930,626​
3,971,530,721​
15.93
3 Jul 20
266,778,652​
3,975,586,906​
15.90
4 Jul 20
267,539,923​
3,979,862,300​
15.88
5 Jul 20
268,322,199​
3,983,989,224​
15.85
6 Jul 20
269,666,723​
3,987,879,415​
15.79
7 Jul 20
270,300,779​
3,992,871,529​
15.77
8 Jul 20
271,490,831​
3,996,957,070​
15.72
9 Jul 20
274,936,859​
4,003,075,490​
15.56
 

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
@NamZIX: Which of my .bat files have you been running? I'll guess the selftests .bat since your Total WAF has been decreasing steadily since June 29. (Total WAF covers not just the recent period, but all the months/years that your ssd has been in operation.)

It would be easier to see what's been happening recently if you copy your data to a spreadsheet that has extra columns of formulas that automatically calculate "recent WAF." Here's the calculation of your WAF for the period from June 29 to July 9:
06/29/2020
261,136,679
3,953,413,414
16.14
F7 Increase
F8 Increase
WAF
07/09/2020
274,936,859
4,003,075,490
15.56
13,800,180
49,662,076
4.60

Have you also kept records of the Average Block Erase Count attribute? You can use the rate of increase of ABEC to estimate how many years of ssd life remain, since each 15 increments of ABEC correspond to 1% of Remaining Life. The estimate assumes, of course, that the host pc will continue to write to the ssd at a rate comparable to the rate it has been writing.
 

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
@NamZIX: For comparison with your 4.60 recent WAF, your WAF was 20.87 for the period from June 16 (your earliest data ) to June 28 (the day you began running ssd selftests), as the following calculation shows:
06/16/2020
246,525,596
3,668,514,207
15.88
F7 Increase
F8 Increase
WAF
06/28/2020
260,232,165
3,940,894,467
16.14
13,706,569
272,380,260
20.87
So the selftests have definitely improved your WAF.

If you benchmark the ssd read & write speeds while a selftest is running, I hope you'll post those results here too, so we can see whether your experience matches mine: selftests don't cause slowdown.

My pc writes to my ssd (attribute F7) at about one fourth the rate yours does... my pc wrote about 3.7 million NAND pages during the most recent 12 days, and your pc wrote about 13.8 million NAND pages during the most recent 10 days. So there may be more that you can do to preserve your ssd. In particular, if you also have a hard drive in the system, you could relocate the folders of frequently written files from the ssd to a hard drive like I did. One way to do that is to create symlinks (using the "mklink /J" command).
 

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
I have a new idea that might reduce WAF even more than the selftests regime already does. Recall that my selftests .bat file runs an ssd selftest every 20 minutes (or whatever) and aborts the selftest 30 seconds (or whatever) before the end of each 20 minute loop. The purpose of the 30 second idle pause is to allow any necessary, lower priority ssd background processes a little time to run. But during some of the pauses is when the occasional FTL write bursts occur. The new idea is that during each 30 second pause, the .bat file could monitor the ssd SMART attributes every second or so to check whether an FTL write burst is occurring, and if so start the next selftest immediately (or after a few seconds) instead of waiting until the end of the 30 second pause. (A write burst is presumably occurring if the C5 SMART attribute incremented. Alternatively, a write burst is occurring if the F8 SMART attribute is increasing at the maximum speed.)

I've observed that there are about one or two large FTL write bursts per day with the selftests regime running, which is a small fraction of the number of 30 second pauses per day. So, if I interrupt the occasional bursts by starting the next selftest when a burst is detected, other necessary lower priority ssd background processes should continue to receive nearly all the runtime they've been receiving.

The reason why I wrote "or after a few seconds" above is that it might be safest to let the FTL write burst make some progress before interrupting the burst with a selftest.

Does anyone have thoughts about whether interrupting the FTL write bursts would be unwise?

Here's an update: My ssd's Remaining Life is still 92%, which it reached on March 13th. The ABEC attribute has increased by only 12 since March 13th, and is now 132, which it reached on August 27th. (When ABEC reaches 135, Remaining Life will reach 91%.) At this rate, the remaining lifetime is about 45 years.

Recently I found a way to reduce the host pc's ssd writing a little more than I'd already done: I created a Windows startup task runs Windows' logman.exe utility to redirect two frequently written log files (NetCore.etl and LwtNetLog.etl) to the hard disk instead of ssd. I started redirecting those two files 10 days ago. The result is that the host pc has been writing to the ssd at an average rate of 72 kBytes/second over the last 10 days (according to HWiNFO). According to the F7 SMART attribute, F7 has increased about 214,000 NAND pages per day during the 10 days. For the 5 months prior, F7 increased about 320,000 NAND pages per day. However, the rate of total ssd NAND writing, F7+F8, hasn't decreased much because the FTL write bursts have increased, which has increased the rate of F8 writing and increased WAF. A good measure of the rate that ssd remaining life is decreasing is F7+F8, which has been increasing at about 818,000 NAND pages per day during the 10 days, compared to about 839,000 per day for the prior 5 months. I should add that I'm not confident that the rate of F7+F8 writing is truly less than it was before I redirected the two log files, because the fluctuations in the rate of F8 writing are large. More time (more data points) will be needed before I can reach a conclusion.
 
Last edited:

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
I implemented the idea described in my previous post: During the 30 second pauses between 19.5 minute selftests, a .bat file named "monitorFTLWriteBursts.bat" reads the ssd's Current Pending Sectors attribute once per second, logs when it toggled from 0 to 1 or from 1 to 0, and -- optionally, depending on the version of the .bat file -- interrupts the FTL write burst by starting a selftest immediately (ahead of schedule). The .bat is started by the selftests controller .bat just before beginning its 30 second pause, and the two .bat files execute in parallel.

I stopped using the version that interrupts the FTL write bursts because it might have been responsible for two extreme slowdowns of the pc that occurred hours later. Perhaps it was the ssd that slowed down, which would look like the pc as a whole slowed down. However, the cause of the slowdowns might not have been my interruptions of the write bursts... another possible culprit is Firefox, which had about 100 tabs open, probably more than I'd ever had open before. To recover from the slowdowns, I had to hold down the pc's power button long enough to shutdown the pc. After the second occurence, I saved & closed most of those Firefox tabs and reverted the .bat so it only keeps a log of the write bursts and doesn't interrupt them.

Someday I'll experiment again with interrupting the FTL write bursts, with Firefox not so heavily loaded, to check whether the interruptions cause the extreme slowdowns. If they're the culprit, I could try interrupting only the longest bursts -- say, bursts that last longer than 6 seconds -- to see whether that causes slowdowns too.

If interruptions of FTL write bursts by selftests can cause the ssd to slow down, I would categorize that as another ssd bug. Starting selftests shouldn't cause problems undocumented by the ssd manufacturer.

Here's an excerpt of the BurstSummary log, covering the last 24-ish hours:
09/04/2020 6:12:17.21, None
09/04/2020 6:32:17.17, None
09/04/2020 6:52:12.18, None
09/04/2020 7:12:12.15, None
09/04/2020 7:31:39.11, Burst began 1 seconds into selftest pause, duration 4
09/04/2020 7:52:14.11, None
09/04/2020 8:12:15.21, None
09/04/2020 8:32:16.15, None
09/04/2020 8:52:13.20, Burst began 25 seconds into selftest pause, duration 2
09/04/2020 9:12:17.13, None
09/04/2020 9:32:12.11, None
09/04/2020 9:52:12.13, None
09/04/2020 10:12:13.18, None
09/04/2020 10:32:14.13, None
09/04/2020 10:52:15.12, None
09/04/2020 11:12:17.16, None
09/04/2020 11:32:17.17, None
09/04/2020 11:52:02.16, Burst began 18 seconds into selftest pause, duration 5
09/04/2020 12:12:12.21, None
09/04/2020 12:32:12.13, None
09/04/2020 12:52:13.21, None
09/04/2020 13:12:14.14, None
09/04/2020 13:31:43.15, Burst began 2 seconds into selftest pause, duration 5
09/04/2020 13:52:16.12, None
09/04/2020 14:12:17.13, None
09/04/2020 14:32:48.17, None
09/04/2020 15:12:12.12, None
09/04/2020 15:32:13.18, None
09/04/2020 15:52:14.12, None
09/04/2020 16:12:15.12, None
09/04/2020 16:32:16.15, None
09/04/2020 16:52:17.20, None
09/04/2020 17:12:17.17, None
09/04/2020 17:32:12.16, None
09/04/2020 17:52:12.14, None
09/04/2020 18:12:13.20, None
09/04/2020 18:32:14.13, None
09/04/2020 18:52:15.13, None
09/04/2020 19:12:16.18, None
09/04/2020 19:32:17.19, None
09/04/2020 19:52:18.21, None
09/04/2020 20:12:12.11, None
09/04/2020 20:32:12.19, None
09/04/2020 20:52:13.19, None
09/04/2020 21:12:14.18, None
09/04/2020 21:32:15.17, None
09/04/2020 21:52:16.12, None
09/04/2020 22:12:17.15, None
09/04/2020 22:32:17.15, None
09/04/2020 22:52:12.15, None
09/04/2020 23:12:12.19, None
09/04/2020 23:32:13.19, None
09/04/2020 23:52:14.14, None
09/05/2020 0:12:15.17, None
09/05/2020 0:32:16.16, None
09/05/2020 0:52:17.16, None
09/05/2020 1:12:18.19, None
09/05/2020 1:32:12.18, None
09/05/2020 1:52:12.12, None
09/05/2020 2:12:13.12, None
09/05/2020 2:32:14.12, None
09/05/2020 2:52:15.17, None
09/05/2020 3:12:16.20, None
09/05/2020 3:32:17.19, None
09/05/2020 3:52:18.17, None
09/05/2020 4:12:12.16, None
09/05/2020 4:32:12.14, None
09/05/2020 4:52:13.13, None
09/05/2020 5:12:14.14, None
09/05/2020 5:32:15.21, None
09/05/2020 5:52:16.16, None
09/05/2020 6:12:17.15, None
09/05/2020 6:32:17.11, None
09/05/2020 6:52:12.23, None
09/05/2020 7:12:12.14, None
Here's an excerpt of the BurstDetails log over the same 24-ish hours period; note the two times highlighted in red that were delayed longer than the programmed one second delay:
____
2 FTL burst began 1 seconds into selftest pause.
2 09/04/2020 7:31:34.23, C5=1
3 09/04/2020 7:31:35.18, C5=1
4 09/04/2020 7:31:36.23, C5=1
5 09/04/2020 7:31:37.18, C5=1
6 09/04/2020 7:31:38.17, C5=0
____
26 FTL burst began 25 seconds into selftest pause.
26 09/04/2020 8:52:02.21, C5=1
27 09/04/2020 8:52:08.24, C5=1
28 09/04/2020 8:52:12.16, C5=0
____
19 FTL burst began 18 seconds into selftest pause.
19 09/04/2020 11:51:56.23, C5=1
20 09/04/2020 11:51:57.19, C5=1
21 09/04/2020 11:51:58.25, C5=1
22 09/04/2020 11:51:59.22, C5=1
23 09/04/2020 11:52:00.19, C5=1
24 09/04/2020 11:52:01.14, C5=0
____
3 FTL burst began 2 seconds into selftest pause.
3 09/04/2020 13:31:37.16, C5=1
4 09/04/2020 13:31:38.25, C5=1
5 09/04/2020 13:31:39.25, C5=1
6 09/04/2020 13:31:40.25, C5=1
7 09/04/2020 13:31:41.23, C5=1
8 09/04/2020 13:31:42.19, C5=0
I was surprised that write bursts occasionally begin near the end of the pause period. (The most common time for bursts to begin is not at all, as shown by the Burst Summary log, which shows that most pauses do not include a burst. The next most common time is as soon as the pause begins, although the excerpt doesn't show any of those.) A write burst that starts near the end of the pause period might be interrupted by the next regularly scheduled selftest. The burst shown above that began at 9/04/2020 8:52:02 might have been interrupted by the next regularly scheduled selftest, as the burst began 25 seconds after the pause began, near the end of the pause. (Note that the pause durations are actually shorter than 30 seconds due to the precise timing by my selftests controller .bat, which takes into account the non-constant several seconds that smartctl.exe takes to execute the command to launch a selftest.) The timestamps of the entries in the BurstDetails log show that more than one second elapsed between log entries for the burst that began 25 seconds after the pause began, and I believe this is because the ssd queued the smartctl -A command that read the ssd attributes rather than executing the command immediately. I assume the ssd queued the command because it was working on the command from the selftests controller .bat to start the next selftest.
 
Last edited:
Oct 13, 2020
2
0
10
0
Wow, this is kinda interesting. I checked the health of my MX500 250GB the other day and I was shocked. After just under 8 months (bought mid-February) or so it had only 69% of its health remaining.
This was the state of the SSD on October 13, when I first checked the SMART values.

Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1453
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       253
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   069   069   000    Old_age   Always       -       475
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       0
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       34
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   062   035   000    Old_age   Always       -       38 (Min/Max 0/65)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Bogus_Current_Pend_Sect 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   069   069   001    Old_age   Offline      -       31
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       18634704239
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       327654315
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       3754477478
So I googled a lot and found this thread and immediately started to let the long selftest run on repeat, it takes just a bit under 27 minutes for me. I'm on Linux. My system is encrypted and thus TRIM disabled. The Overall WAF was around 12.45 at that point.

Here's the WAF of the last ~10 hours of running the tests:
1+(1329199/1399009)
= 1.95010039249211406074

AT first I let it run at 30 minutes and messed up a bit, but then dropped it down to 27, over the past 5 hours:
1+(57446/671656)
= 1.08552890169967959789

Which is crazy. At first I thought it went up again, but then I realized I was mistaking the 247 for the 248 value, because it was way lower than the 248 was. I'm now back at 30 minutes per run, as 1.08 seems awfully low and I think the SSD did feel sluggish at times.
I'm not sure if such a low value due to the selftesting is bad or not, but I think the SSD did feel sluggish at times. Anyway, I'm now back to 30 minutes to let it have some more space to write stuff.

The value for 173 reached 480 yesterday, it's still at 480 since. So it indeed slows down the wear, I guess.

As reference, here the values from after booting the computer:
247 - 330712167
248 - 3794770309

5 hours ago:
247 - 331439520
248 - 3796042062

Now
247 - 332111176
248 - 3796099508
 

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
So I googled a lot and found this thread and immediately started to let the long selftest run on repeat, it takes just a bit under 27 minutes for me.
Where you say "on repeat" do you mean you don't have any delay between selftests?

I'm now back at 30 minutes per run, as 1.08 seems awfully low and I think the SSD did feel sluggish at times. I'm not sure if such a low value due to the selftesting is bad or not, but I think the SSD did feel sluggish at times. Anyway, I'm now back to 30 minutes to let it have some more space to write stuff.
Do you mean the 30 minutes (per loop) includes a 27 minute selftest plus 3 minutes where the ssd is free to behave poorly?

There's an "abort selftest" command that you could use if you want to control the duty cycle of "selftest running" versus "selftest not running." I've been using a duty cycle of 19.5 minutes selftest followed by 30 seconds of no selftest. In other words: each loop a selftest is launched, 19.5 minutes later the selftest is aborted, and 30 seconds later the next loop begins. All of the annoying FTL controller write bursts occur during the 30 seconds periods.

To determine whether your ssd is truly sluggish -- and not just your imagination or that it was busy doing a lot of reading and/or writing to service pc background processes -- why not run benchmark tests of its speed during a selftest? When I did that, I found my ssd was a tiny bit faster than its specs. (Described in a post here about 7 months ago.)

You should expect a lot of day-to-day fluctuation in the rate that the 248 attribute increases, so resist the temptation to make hasty conclusions. I've experienced great weeks and not-so-great weeks since I began the selftests regime in early March. On average, I'm pleased... my ssd's Remaining Life is still 92%, the same as it was in mid-March. (I expect it to reach 91% a few days from now, based on the rate that my attribute 173 -- ABEC -- is increasing. ABEC reached 134 on Oct 4th and when it reaches 135 that will be 9 x 15, or 9% of Total Life used.)

Also, keep in mind that WAF isn't the best measure of how rapidly the Remaining Life is being consumed. A much better measure is the rate of increase of the sum of attributes 247 and 248, since that sum is the total written to the ssd NAND. The selftests reduce the increase of 248, and you might be able to reduce the increase of 247 by configuring your OS and apps to write less to ssd. (I've redirected a lot of writes to a hard drive. Maybe too much. I have the impression that there's a sweet spot somewhere around 50 kBytes/second average write rate, and below that rate the number of FTL write bursts mysteriously increases. But I don't yet have enough data points to say for sure.)

5 hours ago:
247 - 331439520
248 - 3796042062
Now
247 - 332111176
248 - 3796099508
Increase of 247 over last 5 hours is 332111176 - 331439520 = 671,656
Increase of 248 over last 5 hours is 3796099508 - 3796042062 = 57,446

During those 5 hours, the amplification appears to have been very tame, possibly too low to be healthy. (Allow more pause time between selftests? Abort each selftest earlier?) But 5 hours is insignificant, and there can be a long lag time between writing by the pc and the amplification.

Your large increase of 247 during those 5 hours, 671,656, could be a concern if that rate continues. It corresponds to more than 3.2 million NAND pages per day. My 247 increases about 100,000 NAND pages per day (and based on the ratio of 247 to 246 I think the size of each NAND page on my 500GB drive is about the same as the size on your 250GB drive).

Your high rate of 247 writing by your pc might explain why it appeared sluggish. If so, that's unrelated to the selftests.

Here's a chart of my ssd's daily increases of 247 and 248 from Oct 1 to Oct 16:
Δ247
Δ248
Δ247 + Δ248
68,912​
287,268​
356,180​
58,960​
377,314​
436,274​
116,280​
486,270​
602,550​
66,455​
1,807,065​
1,873,520​
63,976​
468,118​
532,094​
94,992​
331,787​
426,779​
47,197​
486,413​
533,610​
139,079​
420,262​
559,341​
100,982​
542,747​
643,729​
161,132​
1,271,672​
1,432,804​
145,152​
794,243​
939,395​
115,457​
1,036,807​
1,152,264​
70,452​
836,945​
907,397​
53,717​
589,290​
643,007​
62,926​
1,049,829​
1,112,755​
75,428​
1,412,108​
1,487,536​
You can see that the numbers fluctuate a lot from day to day. You can see that my 247 rate is very low compared to yours. My daily WAF (1 + Δ248/Δ247) is high, because the denominator Δ247 is so small. My WAF used to be much lower than it has been recently, averaging 2.5 from March 24 to August 20; the end of August is when I began to reduce 247 further by redirecting more writes to hard drive. But it's the sum in the third column that matters. I suspect I could reduce the sum by increasing 247, because I suspect 247 is below the sweet spot, but there's so much fluctuation that it's too soon to be confident about that. I kept notes on all those redirections to hard drive, so I know the day when each began, and someday I'll analyze my 247 and 248 data to try to determine whether I went too far.
 
Last edited:
Oct 13, 2020
2
0
10
0
The test itself takes a bit under 27 minutes and I was running it every 27 and 30 minutes. The latter has about 3 minutes of pause in between each selftest.

October 17
Whole day:
1+(2619206/2276672)
= 2.15

This was with 28 minutes for a short while, and later switched to 30m (the average for that time was 2.48)

I did a short write speed test and it was abysmal with and without the selftest running.
dd if=/dev/zero of=/home/shared/tmp/tempfile bs=1M count=1024 conv=fdatasync,notrunc status=progress
with selftest: 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 24.3312 s, 44.1 MB/s
without selftest: 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.87136 s, 121 MB/s

Not sure what was at fault there, maybe due to having run the tests all day?
However, next day after the reboot it was fine
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.87233 s, 374 MB/s

October 18
The whole day
1+(19679722/1662932)
= 12.83

I gave different I/O scheduler and kernel a go and let it run without the selftests for a while to see if it changes anything. I checked after half a day and unfortunately didn't change a thing.
1+(15068580/975547)
= 16.44

I turned on the selftest every 31 minutes and for the rest of the day I got
1+(4611142/687385)
= 7.70
So quite a bit higher than previously when I had the test running for about that interval.

Now I don't know anything else I could try that could maybe get rid of this, so I'll just let the selftest run every 30 minutes and just note the 247 and 248 once after booting and before shutting down the computer and see what happens.
I'll also check the write speed in the evening. If the speed doesn't change and I get less WAF I'm okay with it. Don't really want to bother even more with it, it's stressing me out and consuming time. I'll just suck it up and save up for a new SSD from a different brand that hopefully doesn't have that issue.

I definitely wouldn't mind 5% loss after 6 months as yours did, but a third after 8 months is ridiculous. Is my increase of 247 really 'too' high or just compared to yours? Maybe you just don't use your SSD as much? I have the complete OS and everything running on it and I'm using the computer all day. Is there any other people's data for reference? Is there a limit known for 247 and 248?
It looks like I wrote 200GB in under 6 days.
October 13 afternoon 246 Total_LBAs_Written 18634704239 ~ 8.68TB
October 19 morning 246 Total_LBAs_Written 19077392553 ~ 8.88TB
 
Last edited:

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
Here's a brief update about the long term effectiveness of the ssd selftests regime: a chart showing data logged on the days when the ssd's Remaining Life decreased, followed by a bit of analysis.
DateAttribute 173
(ABEC)
Remaining Life %Total Host Writes (GB)Host Writes (GB) per 1% of SSD Life
(in other words, the increase of Total Host Writes over previous row)
08/31/2019
15
99
1,772
1,772
was not logging
unknown
was not logging
unknown
was not logging
unknown
12/23/2019
75
95
5,782
unknown
01/15/2020
90
94
6,172
390
02/04/2020
105
93
6,310
138
03/13/2020
120
92
6,647
337
10/19/2020
135
91
8,178
1,531
The rightmost column shows the selftests have been very effective. The bottom right number, 1531 GB, is how much the host pc wrote to the ssd between 3/13/2020 and 10/19/2020. That 1531 GB cost only 1% of ssd Life. Before the selftests, 1% of Life corresponded to much less host writing, as shown by the three smaller numbers above the 1531 GB.

The 1531 GB is nearly as large as when the ssd was new: 1772 GB had been written when the ssd dropped from 100% to 99%. However, the host pc was writing at a much higher rate when the ssd was new: it was installed at the beginning of August 2019, which means 1772 GB written during August 2019 versus 1531 GB written during the 7 months since 3/13/2020. I don't know whether the host write rate can affect the "Life Used versus Host Bytes Written" performance, but my hunch is that it can if the host write rate is very high or very low: A very high host write rate might not give the ssd's FTL controller as much idle time in which to misbehave, or it might overflow the ssd's SLC mode NAND cache and force the ssd to switch to direct TLC mode NAND writing. Also, preliminary data with a very low host write rate suggests the existence of a host write rate "sweet spot" that I wrote about recently.

Notes:
1. The drop of Remaining Life to 91% occurred this morning. (Or possibly very late last night... it happened during the 2 hour log period between 11:40pm last night and 1:40am this morning.)

2. The selftests regime began on 3/01/2020.

3. Because the selftests regime began on 3/01/2020, the 337 GB written by the host pc between 2/04/2020 and 3/13/2020 includes approximately 26 days before the selftests regime began and approximately 12 days with the selftests regime running. My logs include daily SMART data beginning on 2/06/2020, so I could do a more precise analysis of the "before 3/01 versus after 3/01" effect of the selftests, but I don't think it would be worth the effort.
 

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
The test itself takes a bit under 27 minutes and I was running it every 27 and 30 minutes. The latter has about 3 minutes of pause in between each selftest.
-snip-
I turned on the selftest every 31 minutes and for the rest of the day I got
1+(4611142/687385)
= 7.70
So quite a bit higher than previously when I had the test running for about that interval.

Now I don't know anything else I could try that could maybe get rid of this, so I'll just let the selftest run every 30 minutes and just note the 247 and 248 once after booting and before shutting down the computer and see what happens.

-snip-

Is my increase of 247 really 'too' high or just compared to yours? Maybe you just don't use your SSD as much? I have the complete OS and everything running on it and I'm using the computer all day. Is there any other people's data for reference? Is there a limit known for 247 and 248?
It looks like I wrote 200GB in under 6 days.

-snip-
You say you don't know anything else you could try. But you could reduce that 3 minutes of pause between selftests and see whether that helps. My selftests regime has only 30 seconds of pause time between selftests. (My selftests controller aborts each selftest after 19.5 minutes, and since each selftest would take about 26 minutes if not aborted, the selftest is known to run for the entire 19.5 minutes.)

I think your host pc write rate is high compared not just to mine, but also compared to typical users. But this is just from vague memory and I'm not 100% sure. However, many months ago I posted a message here that shows SMART data googled from several other users' Crucial MX500 drives, and you could google to research for other users' write rates.

If Linux offers a tool that lets you see which OS and app processes account for most of the writing (like Windows' Resource Monitor and Microsoft's Procmon do), you might be able to take simple steps to reduce unnecessary writing, such as disabling unnecessary logs (or redirecting them to a hard drive if your system has a hard drive) or increasing the period between automatic data saves. Also, if any of your drivers are old, try updating them... that signicantly reduced the ssd writing on my brother's pc.
 
Oct 19, 2020
4
0
10
0
Hi

I saw your post on hdsentinel forum about relation of this bug to Current_Pending_Sector bug.

Developers of smartmontools decided to ignore the Attribute #197 on MX500 SSDs recently, but they don't seem to be aware of the relation between this bug and that one. It'd be a good idea to contact and explain them. (it's ticket 1227 on smartmontools bug tracker www. smartmontools. org/ticket/1227 )
 

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
I did a short write speed test and it was abysmal with and without the selftest running.
dd if=/dev/zero of=/home/shared/tmp/tempfile bs=1M count=1024 conv=fdatasync,notrunc status=progress
with selftest: 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 24.3312 s, 44.1 MB/s
without selftest: 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.87136 s, 121 MB/s

Not sure what was at fault there, maybe due to having run the tests all day?
However, next day after the reboot it was fine
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.87233 s, 374 MB/s
Assuming your pc's write rate to the ssd is high like it was, and assuming that was going on at the same time as your speed test, perhaps that would make the speed appear abysmal.

I understand the value of keeping ssd write tests short, but on the other hand there can be large fluctuations of the other reading and writing that goes on at the same time as the speed test.

Do you have enough data to say what the average rate of host pc writing to ssd has been over a reasonably long period of time? (The increase of SMART attribute 246 or 247, divided by the elapsed time.) Also, does that average drop significantly after a reboot?

I've noticed my write amplification appears to be better for the first few days after a power-off power-on shutdown than after the pc has been running for many days. My ssd logs record the ssd's Power Cycle Count attribute too, so in principle the logs could be used to determine whether that notion is true. I'm too busy, sadly.
 

Lucretia19

Prominent
Feb 5, 2020
142
11
595
2
I saw your post on hdsentinel forum about relation of this bug to Current_Pending_Sector bug.

Developers of smartmontools decided to ignore the Attribute #197 on MX500 SSDs recently, but they don't seem to be aware of the relation between this bug and that one. It'd be a good idea to contact and explain them. (it's ticket 1227 on smartmontools bug tracker www. smartmontools. org/ticket/1227 )
I tried to report it at smartmontools months ago, and I tried again today after I read your post. I can't, though, because I never receive the account verification email they send to me. (If the problem is that they don't like my email provider's domain, they should say so instead of saying the email has been sent.)

I hope you will take a moment to let them know about the discussions here and at hdsentinel. You could paste a link to each discussion, and you could paste the following:

Tomshardware forum user Lucretia19 wrote: "By logging SMART data at a high rate (every second) using smartctl.exe, I established that the Bogus_Current_Pending_Sectors bug correlates perfectly with the Crucial MX500's excessive write amplification bug. Specifically, Current_Pending_Sectors changes to 1 when the ssd's FTL controller begins writing a multiple of about 37000 NAND pages (37000 NAND pages is approximately 1 GByte) and changes back to 0 when the FTL write burst ends. Although the correlation is perfect, it's unknown which is more closely related to the cause and which is more closely related to the effect. (Crucial presumably knows.) Fortunately, the excessive write amplification can be largely tamed by running ssd selftests nearly nonstop. (I insert a 30 seconds pause between 19.5 minutes selftests as a precaution, just in case the ssd's health depends on occasional FTL write bursts.) My logs show that the FTL write bursts occur only during the pauses between selftests, presumably because an FTL write burst is a lower priority process than a selftest. Selftests appear not to slow the ssd performance, presumably because a selftest is a lower priority process than host reads and writes. The only known downside is that the ssd appears to consume about 1 watt extra while running a selftest. The selftests raise the ssd temperature by a few degrees Celsius and keep the ssd temperature more stable."

UPDATE (2020-10-20): I sent an email to a Smartmontools Developers maillist mentioned in their website's Help tab. My email contains an expanded & edited version of the above and a description of the problem I had with their email address verification system. An automatic reply said my email is being held for review by a Moderator since I'm not a member of the maillist. As yet, no response from the Moderator.
 
Last edited:

ASK THE COMMUNITY

TRENDING THREADS