Question Crucial MX500 500GB sata ssd Remaining Life decreasing fast despite few bytes being written

Page 10 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

danyulc

Distinguished
Oct 12, 2010
27
0
18,540
1
Your calculations of erase cycles appear to neglect write amplification. Did you look at the actual Average Block Erase Count in your 250GB ssd that has 94% remaining? Is that SMART attribute available?
Sadly the 940 EVO doesn't include any information about WAF or Average Block Erase Count. They wouldn't even give the press an estimated TBW endurance rating when the product was released as it was the first mainstream product to use TLC. The 840 also used TLC, but wasn't really a big seller like the 840 EVO.

I just based it off the writes made so far and the endurance remaining. Which as you said is not correct, but it's all I have to go on. The "wear leveling count" indicator reads "62" and the threshold value is set to 0 and the current value is 94, which is the health rating of the SSD according to HD Sentinel.
 

Lucretia19

Prominent
Feb 5, 2020
143
11
595
2
Sadly the 940 EVO doesn't include any information about WAF or Average Block Erase Count.
I assume you mean 840 EVO, not 940 EVO.

Average Block Erase Count is an ssd SMART attribute that can be read by the host pc using SMART monitoring software, such as CrystalDiskInfo or Smartmontools, both of which are free. The attribute might have a different label depending on the software you use to display it. (Good software may adhere to the manufacturer's choices of labels.) It appears to be attribute 177 (which is B1 in base 16) and is labeled "Wear Leveling Count" at https://www.anandtech.com/show/7173/samsung-ssd-840-evo-review-120gb-250gb-500gb-750gb-1tb-models-tested/3 and at https://www.smartmontools.org/ticket/692

WAF is calculated using two other values: (1) NAND pages written by the host pc, and (2) NAND pages written by the ssd's internal controller. The formula is (#2 / #1) + 1. But I don't see either of these values in the 840 EVO's SMART output. Perhaps they're available in "extended" SMART output. Assuming not, here's a well-written article that seems relevant to how to do an alternative calculation: Request for Samsung 840 EVO SSD owners (write amplification calculation)

Here's an interesting article about a problem with the 840 EVO, which might also be relevant to the MX500:
Samsung 840 EVO - how to easily lose your data forever
https://forum.acelaboratory.com/viewtopic.php?t=8735
That article claims a firmware "fix" by Samsung causes a lot of extra NAND writing in order to reduce the risk of data loss. If true, maybe the MX500 firmware does the same "fix"... could it be the cause of the high WAF?
 

danyulc

Distinguished
Oct 12, 2010
27
0
18,540
1
Yes, I meant the 840 EVO, my apologies.

I know that when they initially released the updated firmware it was to fix issues with data that had been on the SSD for a long time reading at very slow speeds. The program SSD Read Tester was written specifically because of the 840 EVO issues. It was meant to test the disks and see the impacts of read performance degradation over time. I believe it was an issue with the voltage levels of the TLC memory. Eventually Samsung managed to fix the firmware and released a firmware update. In addition to updating the firmware you were supposed to run a utility which would re-write all the data to the drive to 'refresh' it.

People were concerned that Samsung's newest firmware would just re-write the data to the drive over and over to keep it fresh, which would of course lower the drives overall endurance. It appears they actually fixed the issue by adjusting the voltage levels of the cells when the data is written to the drive. I last ran SSD Read Tester on 9-26-2021. Here is an image of the results I received.



When I look at the text based output 35 entries read below 400MB/s out of 2313 entries in total. 107 are under 450MB/s. All of those files are smaller in size.

Most of the files on there are 858 days old and read at full speed.

The 840 EVO firmware is incredibly generous with its "life remaining" percentage. Possibly because of the increased scrutiny the drive received after the firmware update and ensuing concerns about reduced endurance. They would likely have done anything to end the 840 EVO saga.

The 840 EVO never had a TBW value published. The 850 EVO did have endurance specs published and the 250GB version is rated for 75TBW. TechReport ran an endurance test on the 840 EVO 250GB and managed to write something like 900+TB before the drive completely failed and stopped functioning.

Judging from the TechReport article on SSD endurance most of the drives tested are capable of handling far more writes than the specs state. No telling if that holds for the newer drives as I don't think a similar test has been performed. Obviously once you pass the specified TBW the drive is no longer covered under warranty. My 840 EVO has almost 7 years of power on time.

I'm going to assume the max TBW for my 840 EVO to be 50TB at the absolute most. In all likelihood I'll never even get anywhere near that amount of data written. Based on my past usage it's much more likely the drive will fail before the NAND begins wearing out. In the last 2 years the drive has seen 1TB of writes. In total I wrote 14TB in just under 7 years.

I use the 840 EVO as a scratch drive so anything on there is either backed up elsewhere or it isn't a problem if I lose it. I did the math for the WAF. My WAF is 1.05049 following the steps suggested in the link. That seems low, but the raw wear leveling count value on my drive is only 62, so 62x250 15550/14755 = 1.050491358861403. I double checked with Crystal Disk Info to make sure the numbers were the same. Below are the SMART stats for the 840 EVO in decimal format. The 11 CRC errors were from a bad cable that has long since been replaced.



In regards to my old 3TB Seagate HDD drive, I had a 5 year extended warranty that I bought for $10 back in 2017. I finally got a shipping label for it and the e-mail said that once they receive it I'll get a gift card for the purchase price. That being said it wasn't easy to get the shipping label. I had to jump through some hoops. It was the usual effort to wear the person down and avoid a warranty claim.
 

chrysalis

Distinguished
Aug 15, 2003
134
3
18,715
6
Thank you for converting it to a table and calculating WAF.

I am running your script now for data collection, however I have already decided to buy a EVO for the laptop as a long term replacement, the drive adds circa 12-13C to the running temp with continuous smart checks, I would rather have a drive that just performs as it should.

Crucial I will tell them I rather have a refund, and that would allow them to use the ssd I send them back for testing to find their firmware problem, I will keep my other MX500 as a spare.

I will post some more data next week collected from your script.

This all points to me as an issue with either wear levelling or moving data from SLC to TLC. Their code probably has some kind of issue that raises itself with high uptime. I expect as I think you have already posted the smart checks prevent this maintenance from happening which may or may not have other consequences down the line. I wonder what would happen if I did a speed test after a week of continuous smart, would I be writing direct to TLC hmmm.

Scary thought is if this happens on NVME, we wouldnt know as the SMART for that doesnt show erase cycles.
 
Last edited:

dermoth

Distinguished
Dec 18, 2013
2
1
18,515
0
This seems to be a common theme for the MX500 series - OTOH I have plenty of other Crucial and Micron SSDs (the later brand being an OEM laptop drive sharing same controller/smart as Crucial MX/MB series) that have no issues.

I have all smart attributes logged every day since I installed the MX500 SSD in my computer, and also 4.5years of the 7 years of the previous one (a Crucial M4, half the size so running almost full until the switch). I remember doing some really heavy writes at the beginning of the M4. it doesn't have the same smart attributes but perc life used can be summed up as follow:

First 2.5 years: pct life used went from 0 to 26 or 27 (best guess based on the value after reinstalling). You can understand why I stopped doing heavy IO on my SSD
Next 4.5 years: went from 27 to 35 (8% increase only!)

Then I switched to the MX500, same OS, same usage:

First 1 year pct life went up to 20%, lifetime WAF of 9.11
Next 9 month: pct life went from 20 to 46%, lifetime WAF of 12.78 but 17.79 for the 9mo period.

Another interesting comparison, an older BX300 SSD that is slightly smaller and almost twice as much power-on hours, used on a laptop until it died then on an old desktop. Although I don't have historical smart data for that one I matched it at the wear of the MX500, based on host program count (attr 246) which turned out to be almost 3 weeks before this SSD's 1-year anniversary.

BX300 is at 3% pct lifetime used, MX500 is at 18%
BX300 WAF is at 1.4170 vs. 8.7385 for the MX500
BX300 has written 23.2B LBA sectors (512b), MX500 has written 12.2B

Another interesting metric, my MX500, currently at 48% life used, reports 11.4TBW (converted form the LBAs, attr. 246). My son's Crucial P5 NVME, about 6mo. old, reports 16.3 TBW reports (guaranteed 300 or 5 years). If that's accurate, then I can understand how some ppl ran through an MX500 in its first year.

I contacted crucial after 9 months as I could see the drive was going at best to fail around the end of the 5y warranty. They were trying to argue it's normal back then, so I contacted them a year later when the SSD was clearly going to fail much before that. I had to insist to get it escalated and I'm still waiting for an answer, but tbh I'm tempted to just buy a bigger BX500 (they're running pretty cheap these days) and hopefully stop worrying about this one. Ugh... No. It appears the BX500 is much worse than the BX300, especially for the relative write endurance of the 480GB one.
 
Last edited:

Lucretia19

Prominent
Feb 5, 2020
143
11
595
2
[snip]
I am running your script now for data collection, however I have already decided to buy a EVO for the laptop as a long term replacement, the drive adds circa 12-13C to the running temp with continuous smart checks [selftests], I would rather have a drive that just performs as it should.
By "the running temperature" I assume you mean the temperature of the ssd. My desktop pc presumably has much better air flow and cooling than your laptop, and my ssd's temperature rise due to selftests is about 5C.

Is there room in your laptop to add a heat sink onto the case of the ssd? I stuck an M.2 heat sink (about $7 at Amazon) onto the 250GB MX500 M.2 ssd in my laptop -- which doesn't need selftests because the laptop is usually off -- and the heat sink passively cools the ssd by about 5C. (I would have chosen a heat sink with taller cooling fins, but there's only about 1/2 inch of air space between the M.2 ssd and the inside of the laptop case bottom.) If your MX500 is a 2.5" ssd and not an M.2 ssd, you might be able to put a much wider, more effective heat sink on it than I did to my M.2 ssd.

[snip]
This all points to me as an issue with either wear levelling or moving data from SLC to TLC. Their code probably has some kind of issue that raises itself with high uptime. I expect as I think you have already posted the smart checks [selftests] prevent this maintenance from happening which may or may not have other consequences down the line.
NOTE: I don't advise truly continuous selftests. My selftest regime aborts each selftest after 19.5 minutes and then pauses 30 seconds before launching the next selftest.

I'm not too concerned about negative consequences because during most of the 30 second pauses I've seen no indications that accumulated unperformed maintenance operations are finally getting a brief opportunity to run. During most of the pauses, the increase of F8 is tiny (not a write burst).

I wonder what would happen if I did a speed test after a week of continuous smart [selftests], would I be writing direct to TLC hmmm.
The selftest is lower priority than host pc reads & writes so selftests don't reduce short term performance. I don't know whether selftest is also lower priority than compacting to TLC mode any data that was written in high speed SLC mode. So the ssd write speed benchmark test that you suggested might be revealing. On the other hand, it could be very sensitive to the host pc write rate during the week of selftests... if the host pc wrote at a low rate, there won't be much SLC data that needs to be compacted to TLC.

Doesn't the MX500 dynamically allocate NAND to SLC write mode as needed? If the ssd has a lot of available space, it could take a long time (a lot of writing) to fill that space in SLC mode. Before the available space is filled, I wouldn't expect the ssd's write performance to suffer -- in other words, I wouldn't expect the ssd to write "direct to TLC" and make the host pc wait.
 

chrysalis

Distinguished
Aug 15, 2003
134
3
18,715
6
It allocates SLC which all writes go to, but part of the background activities is moving data from the SLC to the TLC area, but the self tests might be preventing that, it might be that is the buggy background activity, no idea.

I am doing the 30 second pauses as is in your script, and its a SATA SSD with a very tight fit.

These 500 GIG MX500's can hit almost same speed direct to TLC anyway so wouldnt be much of a performance hit.
 

Lucretia19

Prominent
Feb 5, 2020
143
11
595
2
It allocates SLC which all writes go to, but part of the background activities is moving data from the SLC to the TLC area, but the self tests might be preventing that, it might be that is the buggy background activity, no idea.
Below is the last 24 hours of one of my other logs. Each row corresponds to a 30 seconds pause between selftests, and it shows the duration of any write burst that occurred during the pause. Most of the pauses -- the rows that say "none" -- don't include a write burst. This is why I think the selftests don't have a negative side effect. In other words, I'm assuming that if unperformed internal maintenance (such as copying SLC to TLC) is accumulating, then the ssd would use most of the pauses to perform some of that maintenance.

The units of the three right-most columns are seconds.
Date​
Time​
PauseDuration​
BurstStartOffset​
BurstDuration​
10/20/21​
10:10:34.10​
29​
none​
10/20/21​
10:30:35.15​
29​
none​
10/20/21​
10:50:36.12​
28​
none​
10/20/21​
11:10:37.12​
27​
none​
10/20/21​
11:30:38.13​
26​
none​
10/20/21​
11:50:39.11​
25​
none​
10/20/21​
12:10:40.16​
24​
none​
10/20/21​
12:30:34.14​
23​
none​
10/20/21​
12:50:34.13​
29​
none​
10/20/21​
13:10:35.11​
29​
none​
10/20/21​
13:30:36.20​
28​
none​
10/20/21​
13:50:37.19​
27​
none​
10/20/21​
14:10:38.17​
26​
none​
10/20/21​
14:30:39.14​
25​
none​
10/20/21​
14:50:40.11​
24​
none​
10/20/21​
15:10:34.20​
23​
none​
10/20/21​
15:30:34.19​
29​
none​
10/20/21​
15:50:35.14​
29​
none​
10/20/21​
16:10:36.20​
28​
none​
10/20/21​
16:30:37.16​
27​
none​
10/20/21​
16:50:38.15​
26​
none​
10/20/21​
17:10:39.13​
25​
none​
10/20/21​
17:30:40.14​
24​
none​
10/20/21​
17:50:34.12​
23​
none​
10/20/21​
18:10:34.11​
30​
none​
10/20/21​
18:30:35.12​
29​
none​
10/20/21​
18:50:36.10​
28​
none​
10/20/21​
19:10:04.19​
27​
1​
5​
10/20/21​
19:30:38.18​
26​
none​
10/20/21​
19:50:39.13​
25​
none​
10/20/21​
20:10:40.18​
24​
none​
10/20/21​
20:30:34.18​
23​
none​
10/20/21​
20:50:34.14​
29​
none​
10/20/21​
21:10:35.14​
29​
none​
10/20/21​
21:30:36.15​
28​
none​
10/20/21​
21:50:04.15​
27​
1​
5​
10/20/21​
22:10:38.17​
26​
none​
10/20/21​
22:30:39.13​
25​
none​
10/20/21​
22:50:39.13​
24​
none​
10/20/21​
23:10:34.19​
23​
none​
10/20/21​
23:30:34.16​
29​
none​
10/20/21​
23:50:35.15​
29​
none​
10/21/21​
00:10:36.15​
28​
none​
10/21/21​
00:30:37.16​
27​
none​
10/21/21​
00:50:38.14​
26​
none​
10/21/21​
01:10:39.14​
25​
none​
10/21/21​
01:30:40.20​
24​
none​
10/21/21​
01:50:34.15​
23​
none​
10/21/21​
02:10:34.19​
29​
none​
10/21/21​
02:30:35.13​
29​
none​
10/21/21​
02:50:37.16​
28​
none​
10/21/21​
03:10:37.15​
27​
none​
10/21/21​
03:30:38.19​
26​
none​
10/21/21​
03:50:39.10​
25​
none​
10/21/21​
04:10:39.11​
24​
none​
10/21/21​
04:30:34.18​
24​
none​
10/21/21​
04:50:34.15​
29​
none​
10/21/21​
05:10:35.11​
29​
none​
10/21/21​
05:30:04.12​
28​
2​
5​
10/21/21​
05:50:37.19​
27​
none​
10/21/21​
06:10:38.17​
26​
none​
10/21/21​
06:30:05.19​
25​
0​
5​
10/21/21​
06:50:40.12​
24​
none​
10/21/21​
07:10:34.11​
23​
none​
10/21/21​
07:30:01.14​
29​
1​
4​
10/21/21​
07:50:36.13​
29​
none​
10/21/21​
08:10:36.20​
28​
none​
10/21/21​
08:30:37.16​
27​
none​
10/21/21​
08:50:05.17​
26​
1​
5​
10/21/21​
09:10:39.16​
25​
none​
10/21/21​
09:30:40.19​
24​
none​
10/21/21​
09:50:34.12​
23​
none​
 

chrysalis

Distinguished
Aug 15, 2003
134
3
18,715
6
Just read a post I think you made on reddit, here is my reply to it, tell me what you think?

The post is here.

https://www.reddit.com/r/unRAID/comments/gwighw/_/fthxdfv View: https://www.reddit.com/r/unRAID/comments/gwighw/solution_to_i_can_not_recommend_crucial_ssds_for/fthxdfv/


My reply

you might be on to something with this second theory.

Remember the original 840 from samsung which I believe was their first gen planar TLC? They over estimated the nand capabilities, and it resulted in unreadable data after only a few months of been written, so their eventual fix was to frequently refresh the data, this would have the same side effect as what we seeing here, excessive internal writes.

As you said said pending sectors are caused by read errors that are not yet confirmed hardware errors, I have had one on a WD spindle before, which got cleared when the sector was written to.

The only issue I have though is if selftests significantly slow down the frequency of these data refreshes, one would maybe expect pending counter to be on a non 0 value for much longer periods as the correctional work is been prevented from running by the selftests, so I extend your theory in that this background activity is perhaps also what is detecting the soft errors by routinely checking if data is still readable. Maybe if the error correction controller hits a certian workload or if pending ever goes above 0, it triggers the cycle, then it fully makes sense to me.
I note also these drives are very cheap for what their market reputation puts them at, they had rave reviews everywhere, yet they seem to be constantly on sale. The reason what attracted me to buying two in the first place. It almost seemed too good to be true, and now we know it is.

I have offered my mx500 to a reviewer to see if it can get media coverage, he hasnt made a decision yet, I pointed him to the reddit thread and this thread.

I see you have this as a thoery as well, I missed your post earlier sorry.

https://forums.tomshardware.com/threads/crucial-mx500-500gb-sata-ssd-remaining-life-decreasing-fast-despite-few-bytes-being-written.3571220/post-22477904

The spare mx500 which I will keep will probably be repurposed as a scratch drive for my video editing.
 
Last edited:

ASK THE COMMUNITY

TRENDING THREADS