Question Crucial MX500 500GB SATA SSD - - - Remaining Life decreasing fast despite only a few bytes being written to it ?

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
The Remaining Life (RL) of my Crucial MX500 ssd has been decreasing rapidly, even though the pc doesn't write much to it. Below is the log I began keeping after I noticed RL reached 95% after about 6 months of use.

Assuming RL truly depends on bytes written, the decrease in RL is accelerating and something is very wrong. The latest decrease in RL, from 94% to 93%, occurred after writing only 138 GB in 20 days.

(Note 1: After RL reached 95%, I took some steps to reduce "unnecessary" writes to the ssd by moving some frequently written files to a hard drive, for example the Firefox profile folder. That's why only 528 GB have been written to the ssd since Dec 23rd, even though the pc is set to Never Sleep and is always powered on. Note 2: After the pc and ssd were about 2 months old, around September, I changed the pc's power profile so it would Never Sleep. Note 3: The ssd still has a lot of free space; only 111 GB of its 500 GB capacity is occupied. Note 4: Three different software utilities agree on the numbers: Crucial's Storage Executive, HWiNFO64, and CrystalDiskInfo. Note 5: Storage Executive also shows that Total Bytes Written isn't much greater than Total Host Writes, implying write amplification hasn't been a significant factor.)

My understanding is that Remaining Life is supposed to depend on bytes written, but it looks more like the drive reports a value that depends mainly on its powered-on hours. Can someone explain what's happening? Am I misinterpreting the meaning of Remaining Life? Isn't it essentially a synonym for endurance?


Crucial MX500 500GB SSD in desktop pc since summer 2019​
Date​
Remaining Life​
Total Host Writes (GB)​
Host Writes (GB) Since Previous Drop​
12/23/2019​
95%​
5,782​
01/15/2020​
94%​
6,172​
390​
02/04/2020​
93%​
6,310​
138​
 
  • Like
Reactions: demonized

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
@fzabkar: I think your analysis of CrystalDiskInfo neglected the possibility of some unexpected interaction with ssd operations and ssd internal scheduling. But I don't think Crystal can be responsible for the high WAF since I tried disabling it for a couple of days.

Although I like the idea of cloning the ssd back and forth with a secure erase step to reset the FTL, to see whether that affects the subsequent WAF, it would require me to buy a new drive.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
The most recent experiment failed to reduce WAF, and in fact the most recent "daily WAF" increased to 90.58. My log is below. It now includes some additional columns that might perhaps be relevant, which I recently began tracking.

The most recent experiment began around mid-day on Feb 12th, and included the following actions:
  1. Uninstalled Storage Executive.
  2. Reenabled CrystalDiskInfo.
  3. Disabled Bill2 Process Manager (which is a resident utility that can control process priorities and core affinities).
  4. Brief shutdown of pc.
  5. Didn't run Firefox. (Occasionally briefly ran MS Edge.)
  6. A Windows update installed on the evening of Feb 12th.

The host write rate has continued to average less than 0.1 MByte/second.

Here's my log... The three Daily WAFs in red are for the three days of the most recent experiment. Like the previous experiment, the Daily WAF for the first day was much lower, which makes me continue to wonder whether power-cycling the pc (including the ssd) has a short term benefit and, if so, why that might be. Another observation is that the total hosts writes during the most recent day was the smallest yet (3 GB), corresponding to the largest-yet Daily WAF (and caused Average Block Erases to increment).

Date​
Time​
Total Host Reads (GB)​
Total Host Writes (GB)​
S.M.A.R.T. F7​
S.M.A.R.T. F8​
Average Block Erases​
Power On Hours​
WAF = 1 + F8/F7​
Total NAND Writes (GB)​
ΔF7 (1 row)​
ΔF8 (1 row)​
Daily WAF = 1 + ΔF8/ΔF7​
02/06/2020
6,323
219,805,860
1,229,734,020
6.59​
41,698​
02/07/2020
6,329
220,037,004
1,242,628,588
6.65​
42,071​
231,144​
12,894,568​
56.79
02/08/2020
6,334
220,297,938
1,252,694,764
108
6.69​
42,351​
260,934​
10,066,176​
39.58
02/09/2020
6,342
220,575,966
1,269,273,190
109
6.75​
42,836​
278,028​
16,578,426​
60.63
02/10/2020
6,351
220,857,490
1,272,080,434
109
6.76​
42,931​
281,524​
2,807,244​
10.97
02/11/2020
6,357
221,087,760
1,280,283,705
109
6.79​
43,169​
230,270​
8,203,271​
36.62
02/12/2020
6,365
221,357,482
1,294,326,214
110
6.85​
43,583​
269,722​
14,042,509​
53.06
02/13/2020
11:35
6,380
221,952,095
1,299,554,954
110
1,010
6.86​
43,736​
594,613​
5,228,740​
9.79
02/14/2020
13:29
6,390
222,304,890
1,307,365,643
111
1,014
6.88​
43,969​
352,795​
7,810,689​
23.14
02/15/2020
07:42
1,669
6,393
222,449,794
1,320,346,398
112
1,021
6.94​
44,339​
144,904​
12,980,755​
90.58
 

USAFRet

Titan
Moderator
Although I like the idea of cloning the ssd back and forth with a secure erase step to reset the FTL, to see whether that affects the subsequent WAF, it would require me to buy a new drive.
You could do it with an Image, rather than a full clone back and forth.
A Macrium Reflect Image off to some other drive.
Secure Erase the SSD.
Recover that Image back to the SSD.

Assuming, of course, you have enough free space on some other drive to hold an Image the size of your current data on the Crucial.
Doesn't have to be a blank drive, just enough 'free space'.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
@USAFRet: My backup drive does indeed have enough free space to hold the approximately 116 GB currently in use on the ssd. When I find time (unlikely during the next week or two) I'll need to learn how to run Macrium Reflect (or some other backup software such as Paragon Backup & Restore) from a bootable cd or usb drive, and how to secure erase.

By the way, my most recent data log (see above) undermines my theory that an unusually large number of ssd transitions between normal power state and low power state might be related to the high WAF. (In other words, the guess that each transition normally causes a NAND write, so an abnormally large number of transitions would cause an abnormally large amount of NAND writes.) During the most recent 18 actual hours, Power On Hours increased by 7, which is more than the 4 that it increased during the previous 26 actual hours and thus would suggest fewer transitions during the most recent 18 than during the previous 26, yet WAF was much larger during the most recent 18 than during the previous 26.
 

USAFRet

Titan
Moderator
Macrium Reflect. The free version will do.
Create a Rescue USB. You'll use this later.
Run the software, and create an Image of the SSD, all partitions.
This results in a single file off on some other drive.
Secure Erase the SSD.
Boot from that Rescue USB, and tell it where the Image is and which drive (the SSD) to recover it to.

Yes, this works. I've done it.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
Macrium Reflect. The free version will do.
Create a Rescue USB. You'll use this later.
Run the software, and create an Image of the SSD, all partitions.
This results in a single file off on some other drive.
Secure Erase the SSD.
Boot from that Rescue USB, and tell it where the Image is and which drive (the SSD) to recover it to.

Yes, thanks, I think I understand those steps. I just need to make sure I can boot okay from a usb flash drive (may need to enable it in BIOS) and learn how to secure erase. No need to spend your time explaining those details... I expect I can find the BIOS boot order setting if needed, and can learn how to secure erase by googling. The reason I don't expect to try the experiment soon is that I have a lot of work to do using my pc during the next week or two... this isn't a convenient time for the pc to be unavailable for hours.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
I found some more SMART snapshots for Crucial MX500 500GB ssds (and for many other drives) at: https://github.com/linuxhw/SMART

Five of the additional MX500 500GB ssds (URLs below) have enough Power On Hours to be worth looking at. A cursory look shows that a large ratio of POH to Power Cycle Count corresponds with "lifetime WAF" that's higher than normal (but not nearly as crazy high as the Daily WAF mine has reached).

Here's a compilation of SMART data from ten MX500 500GB drives. The first five are the new ones I found. The next three are the ones linked above by fzabkar. The next one is Charlie98's, missing the F7 & F8 data so its WAF is an estimate based on total host writes and average block erase count. The last two rows are my drive, using the SMART values I logged when it reached 94% RL and today's.
SDD ID#​
Power On Hours​
Power Cycle Count​
F7 Host Page Writes​
F8 FTL Background Page Writes​
Average Block Erase Count​
Remaining Life %​
POH/PCC​
WAF = 1 + F8/F7​
B9A1A3E78127​
889​
1067​
263343469​
210499276​
56​
97​
0.83​
1.80​
58A732AD4741​
1565​
1738​
100555421​
141793361​
18​
99​
0.90​
2.41​
D472FF24BE77​
571​
200​
113059179​
105747605​
24​
99​
2.86​
1.94​
78DA41D1D402​
5912​
545​
214838823​
758396846​
76​
95​
10.85​
4.53​
7E902D148339​
1813​
14​
132850575​
765980653​
65​
96​
129.50​
6.77​
1828E148B537​
13​
14​
1127008​
46217​
0​
100​
0.93​
1.04​
1813E134D584​
3468​
2058​
735469391​
3295350788​
288​
81​
1.69​
5.48​
1826E1466CB5​
85​
115​
12191194​
10203030​
2​
100​
0.74​
1.84​
Charlie98​
1550​
131​
105​
93​
11.83​
~25​
Mine 2020-02-17​
1029​
99​
223137063​
1344405734​
113​
93​
10.39​
7.03​
Mine 2020-01-15​
883​
94​
214422794​
959278784​
90​
94​
9.39​
5.47​

The two rightmost columns show a strong hint that frequently turning off the computer (in other words, a high ssd Power Cycle Count) helps keep WAF low.

Yesterday I began a new experiment (uninstalled VMware Player, Oracle VirtualBox, and two apps installed in January just in case they're relevant). After uninstalling, I briefly shut off the pc (and the ssd). The "Daily WAF" since the experiment began was 9.46 (not great but better than usual). This is consistent with the theory that cycling the ssd power temporarily helps WAF. Each of my last three shutdowns has been followed by a (daily) WAF approximately 10 during the 24-ish hours after the shutdown, which then jumped considerably during the subsequent days. (I'm assuming the latest experiment will also fail... that the 9.46 will be followed by much higher WAFs over the next days. I'll continue to log data daily and will post it in a few days.)

For all the drives except mine, the data is just a one-time SMART snapshot, which unfortunately doesn't reveal whether their WAF has been increasing over time. For mine, the two snapshots show "lifetime WAF" of 5.47 on the day the ssd's Remaining Life reached 94%, and 7.03 today... the "lifetime WAF" has been gradually climbing because the Daily WAF has been huge during the past few weeks/months, consistent with the theory that WAF grows over time.

One of those drives has a fairly large "lifetime WAF" (5.48) despite also having a fairly low POH/PCC ratio (1.69). However, it also has the largest POH (3468), the largest Avg Block Erases (288) and the largest Total Host Writes, which is consistent with the theory that WAF grows over time, independent of the POH/PCC ratio... like my ssd has been doing.

Here are the five additional URLs:

https://github.com/linuxhw/SMART/blob/master/SSD/Crucial/CT500/CT500MX500SSD1/58A732AD4741

https://github.com/linuxhw/SMART/blob/master/SSD/Crucial/CT500/CT500MX500SSD1/78DA41D1D402

https://github.com/linuxhw/SMART/blob/master/SSD/Crucial/CT500/CT500MX500SSD1/7E902D148339

https://github.com/linuxhw/SMART/blob/master/SSD/Crucial/CT500/CT500MX500SSD1/B9A1A3E78127

https://github.com/linuxhw/SMART/blob/master/SSD/Crucial/CT500/CT500MX500SSD1/D472FF24BE77
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
At this stage my advice would be: Resource Monitor, Performance Monitor, etc. See what's engaging the drive. You can track this.
What would be gained by investigating which host processes are writing to the ssd? The host write rate to my ssd has been VERY low since late December, when I moved frequently written files to the hard drive. For example, HWiNFO reports the host write rate to the ssd during the last 73 hours has averaged only 68 KBytes/second. Although other people's ssd issues can often be solved by finding out which software has been writing excessively, in my case excessive host writing isn't the problem.

As for power on hours: the drive will still go to sleep/idle and I do have some drives that only track active controller time. This inherently doesn't impact WAF though.
Are you absolutely certain that transitions between low power state and normal power state can't somehow cause additional NAND pages to be written? It seems to me that, in principle, a firmware bug could cause a NAND write after each transition from low power to normal. Such a bug might have gone unnoticed because most people have a much larger host write rate than I have. In other words, perhaps a high rate of writing (or reading) prevents most transitions to low power state, reducing the number of times the bug is triggered.

One of the experiments I've been considering is to move frequently written files back to the ssd from the hard drive, to increase the ssd write rate, to see whether that affects WAF. But perhaps a better experiment to try first would be to run a process that reads frequently, because if it successfully reduces WAF, reading would be a better solution than writing.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
I found an intriguing claim that may explain why a low rate of writing by the host pc causes very high WAF:
"Wear Leveling makes sure that all blocks receive approximately the same number of P/E cycles. The best wear leveling is of the static type, whereby data blocks circulate, even if rarely written to (static data is moved)."

That's an excerpt from a blog post at: https://www.delkin.com/blog/managing-errors-nand-flash/

If true, the implication is that it's not straight-forward to try to increase an ssd's years of life by reducing the host write rate (by moving frequently written temporary files to a hard drive). "Life Consumed as a function of Host Writes" isn't necessarily reasonably monotonic; it may have a sweet spot.

Here's a similar excerpt from Wikipedia ( https://en.wikipedia.org/wiki/Wear_leveling#Static_wear_leveling ):
Static wear leveling works the same as dynamic wear leveling except the static blocks that do not change are periodically moved so that these low usage cells are able to be used by other data. This rotational effect enables an SSD to continue to operate until most of the blocks are near their end of life.

This is another reason to test the theory that WAF will be helped by running an app that reads frequently from the ssd (such as a virus scanner). It's plausible that reading reduces the rate at which the ssd can circulate blocks.

It seems to me the truly "best" wear leveling algorithm should depend on the host write rate... the algorithm when write rate is low shouldn't be the same as the algorithm when write rate is high or moderate. I see no value in "over-circulating" static data blocks when write rate is low... it seems counter-productive to heavily waste scarce NAND writes in order to maximize equality of P/E cycles. [EDIT:] I think it makes more sense to tolerate more inequality through most of the life of the ssd, and maybe near the end of its life restore the balance. Or perhaps the problem could be simply solved by having the controller perform static wear leveling less frequently... waiting until a reasonably large number of bytes (dozens of GB?) have been written by the host since the last time the static wear leveling process was run.
 
Last edited:

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
I found an academic paper about wear leveling that seems relevant (although my guess is that it's more than 10 years old, based on the dates of its references): https://pdfs.semanticscholar.org/2665/66067b541a15f9553e3d5b41f5669ef65457.pdf

Here's an excerpt:
By carefully examining the existing wear leveling algorithms, we have made the following observations. First, one important aspect of using flash memory is to take advantage of hot and cold data. If hot data is being written repeatedly to a few blocks then those blocks may wear out sooner than the blocks that store cold data. Moreover, the need to increase the efficiency of garbage collection makes placement of hot and cold data very crucial. Second, a natural way to balance the wearing of all data blocks is to store hot data in less-worn blocks and cold data in most-worn blocks. Third, most of the existing algorithms focus too much on reducing the wearing difference of all blocks throughout the lifetime of flash memory. This tends to generate additional migrations of cold data to the most-worn blocks. The writes generated by this type of migration are considered as an overhead and may reduce the lifetime of flash memory. In fact, a good wear-leveling algorithm only needs to balance the wearing level of all blocks at the end of flash memory lifetime. In this paper, as our main contribution, we propose a novel wear leveling algorithm, named Rejuvenator. This new algorithm optimally performs stale cold data migration and also spreads out the wear evenly by natural hot and cold data allocation. It places hot data in less-worn blocks and cold data in the more-worn blocks. Storing hot data in less-worn blocks will allow the wearing level of these blocks to increase and catch up with the more-worn blocks.

(By "hot data" it means data that changes frequently.)

A slideshow by the same author is at: https://www.storageconference.us/2011/Presentations/Research/14.Murugan.pdf
It's 9 years old. Page 30 of the slideshow shows a graph that indicates the Rejuvenator algorithm reduces Wear Leveling Block Erase Overhead by a factor of 15 to 18. (In simulations.)

I found abstracts of more recent papers by various researchers that, taken together, suggest research into optimal wear leveling algorithms is still a hot topic without a consensus about what kind of algorithm is best. I assume manufacturers don't reveal much information about which algorithm(s) they use.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
Does anyone know of SMART monitoring software that's able to periodically append the SMART data to a log file?

I've recently started recording some SMART values multiple times per day to see whether WAF variations have a pattern that once-a-day logging won't reveal, and this has required me to manually type values displayed by HWiNFO & CrystalDiskInfo into a spreadsheet, multiple times per day. I'd prefer to automate as much of that recording as possible. It would become practical to record the values MANY times per day.

I googled to try to find SMART software capable of saving data to a file, and couldn't find even one winner.

I'd settle for a DOS-like cli program that can write to Standard Output, since I could set Windows Scheduler to periodically launch it and use ">> smartlog.txt" at the end of the command line to redirect its text output to (append to) a text file.

EDIT: I received an answer in the AnandTech forum. The smartctl.exe utility in SmartMonTools can 'print' SMART data to stdout, which I could redirect to append to a file as described above.

The smartctl utility can also launch the SMART self-test, which I believe is a read-only test of ssd sectors. I'm testing whether frequently running a self-test in the ssd will improve WAF by slowing the ssd's ability to background-write NAND pages. I suppose it will depend on which process is given a higher priority by the ssd: the self-test or the wear leveler.
 
Last edited:

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
Problem solved? Using smartctl to command the ssd to run non-stop self-tests has reduced average WAF to 2.04 over the last 280 minutes. Here are the fourteen most recent 20 minute WAFs logged today:
1.32, 2.07, 1.96, 2.11, 2.22, 2.17, 2.30, 1.91, 2.37, 1.95, 1.68, 2.15, 2.25, 1.91

At this rate, it will take about 30 years before 180 TB have been written to NAND.

My theory is that self-tests keep the ssd busy reading itself, so the static wear leveling process doesn't get much time to run.

I guess the questions now are about negative side effects:

1. Power consumption. Only the ssd does more work to run self-tests; the cpu and motherboard aren't involved. An increase of power consumption is obvious, but the 5 degree C rise in ssd temperature suggests the increase is small. I want to check that out, to estimate whether the increased cost of electricity over 5 years will be much less than the cost of a new ssd.

2. Speed? Eventually I'll benchmark read & write speed to see if self-tests reduce performance. There's reason to believe the ssd pauses the self-test whenever the host tries to read or write, so that performance won't be hurt at all. It's even possible that performance increases, since the ssd will never need to transition from low power state.

3. Endurance? The consequence of preventing the Static Wear Leveling process from running is unclear. Is it running enough? I don't know whether there's a way to see how much wear inequality accumulates over time; the Average Block Erase Count doesn't give a clue about inequality. (Also, it's possible that ABEC and Remaining Life will be inaccurate if the SWL process doesn't run as much as the firmware designers assumed it would run.) It might be prudent to revert back to what I was doing before this morning -- starting a new self-test every 26 minutes or so, with each self-test taking about 25 minutes to complete -- in case SWL needs a little time to run and isn't getting enough time with non-stop self-tests. Or maybe it would make sense to wait 10 years before reverting back to 26 minutes. I plan to look at ABEC every few days for awhile to make sure ABEC grows reasonably slowly, since zero growth would be a strong hint that there's a problem, and a clear reason to try reverting to 26 minutes.
 
  • Like
Reactions: Spaceghaze

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
WAF averaged 1.44 while running ssd selftests nonstop for nearly a day:
Date
Time
THW
F7
F8
WAF
02/24/2020
12:55
6,456
224,844,172
1,408,951,022
02/25/2020
07:21
6,473
225,441,341
1,409,211,435
1.44

That WAF is better than I need. So to be conservative -- to give the ssd a little "idle" time to run low priority background processes in case they're necessary for ssd health and longevity -- I modified my "infinite loop ssd selftests" .bat file so that, each loop, it aborts the selftest and pauses ("idles") for a little while before starting the next loop. I've been testing values of selftest time and idle time with the goal of finding the maximum ratio of idle time to selftest time that doesn't blow up WAF excessively. I think I'd be satisfied if WAF averages 3 or 4, assuming I continue to write at a low rate to the ssd for.

The two ssd background ("idle") processes that I'm aware of are:
  1. Static Wear Leveling
  2. The writing associated with Dynamic Write Acceleration, which uses TLC NAND in fast SLC mode as a write cache and later, during idle time, writes the data from the less dense "SLC" to the more dense TLC.
(There might be additional background processes I don't know about.)
Running the selftests nonstop might starve the background idle processes and cause the ssd to fail prematurely... that was my concern about endurance in my previous post.

My initial experiment with some idle time had a ratio of 1 minute of idle to each 19 minutes of selftest. WAF averaged 2.31 over a 24 period:
02/25/2020
08:30
6,473
225,450,621
1,409,224,585
02/26/2020
08:30
6,481
225,754,369
1,409,623,443
2.31

My next experiment was 2 minutes of idle to each 18 minutes of selftest. WAF averaged 5.63 over nearly a day:
02/26/2020
08:30
6,481
225,754,369
1,409,623,443
02/27/2020
08:05
6,488
226,012,071
1,410,817,654
5.63

The experiment that's currently running has 1.5 minutes of idle to each 18.5 minutes of selftest. It's been running about 13 hours, averaging 3.57 so far.

The .bat file also logs SMART data (using 'smartctl -a c: >>file.log' to append the data to a file). A third command-line parameter controls how often the data is logged. It's the number of loops, and is currently set to 6, so that 6 x 20 minutes = 2 hours. From each 2 hour snapshot, I've been pasting several SMART attributes (F7, F8, Average Block Erase Count, Power On Hours and Power Cycle Count) into a spreadsheet for analysis. The snapshots indicate WAF deviates, over short periods, a lot from its average (although some of the deviation might have been caused by brief stops and restarts of the .bat file as I tinkered with the .bat file).

All of these WAF results are preliminary. To produce more trustworthy results, I plan to eventually allow each experiment to run for a week or two, and without any interruptions caused by tinkering with the .bat file.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
I wrote another .bat file that monitors the ssd Write Amplification Factor over whatever interval of time I prefer. I've been running it set to 5-minute intervals, while the .bat file that controls the ssd selftests runs in a separate window.

Each loop of the selftest controller is currently set to last 20 minutes: 19.5 minutes of ssd extended selftest plus 30 seconds of ssd idle time.

By comparing the timestamps in the two programs' display streams, I observed the following:
  1. All of the 5-minute intervals that had very high WAF included the 30 seconds of idle time.
  2. In some of the 5-minute intervals that included the 30 seconds of idle time, WAF was low.
  3. WAF was about 2 during all 5-minute intervals that had no idle time, except once when WAF was about 6.

Observations 1 & 3, taken together, are strong evidence that selftests successfully limit the ssd background processes that amplify writes.

Observation 2 is (weak) evidence that the ssd block wear inequality isn't growing terribly with the "19.5 minutes of each 20" selftest duty cycle. In other words, my intuition is that it's safe for the ssd's health and data integrity if the selftest duty cycle is such that WAF is low during some of the idle intervals. If the intuition is true, then I would want to add a feature to the selftests controller program, to automatically dynamically adjust the duty cycle so that idle intervals occasionally have low WAF. This might be safer than my earlier idea, to automatically adjust the duty cycle to try to keep average WAF within a reasonable range.

I'm also thinking about shortening each selftest controller loop to make them much shorter than 20 minutes. One of the two ssd background processes that I'm aware of is the SLC-to-TLC delayed writes of the ssd's Dynamic Write Acceleration feature. (DWA uses TLC NAND operating in fast SLC mode as a write cache, and later copies the data to regular TLC NAND blocks. This is the process mentioned earlier by fzabkar.) I presume DWA writes to SLC occur only if the ssd's dram cache overflows. Perhaps reducing the length of each controller loop will reduce the number of times that the dram cache overflows? At some point I'll start experimenting with selftest loops of about 2 minutes, with a few seconds of idle time in each loop.

I'd appreciate people's thoughts about the above ideas.
 
I briefly cover the two algorithms used with wear-leveling here. The first, evenness-aware, is effectively what you mean by "dynamic" as it keeps data from getting too stale, although this is already done by garbage collection anyway (eventually) because of voltage drift. The second, dual-pool, covers what they stated with "Rejuvenator" in that it tracks both hot/cold data as well as young/old. Although I think you greatly over-estimate the impact of this on WAF (yes, of course, unless it's a bug). Honestly you're well beyond being able to track this with software, you would need to JTAG/OpenOCD (e.g. DLP-USB1232H) to get more since it's clearly not normal behavior.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
@Maxxify: Thanks for the link to your Reddit article... good enough that I bookmarked it!

My previous comment used the word "dynamic" in two places: (1) Crucial's Dynamic Write Amplification feature (which uses TLC in fast SLC mode to handle write bursts from the pc), and (2) my ideas about a .bat file that will "dynamically" adjust the selftest duty cycle (either to maintain average WAF within a range, or so that some of the idle intervals will have low WAF). I don't understand why you think one of those two uses of "dynamic" corresponds to evenness-awareness.

Where you wrote that I overestimate the impact of "this" on WAF, which of the processes being discussed did you mean by "this?"

Where you wrote that "it" is not normal behavior did you mean by "it" the high WAF before I began running selftests, or the high WAF during "idle" times between selftests, or both, or something else?

Do you have a reason to believe that Crucial's firmware implements both the evenness-aware algorithm and the dual-pool algorithm? The little research I did on static wear leveling suggested to me that there isn't a consensus yet on the optimal algorithm.

If the firmware has a bug, I suspect it's related to the case where the host pc doesn't write much to the ssd. It may be a use case that the firmware engineers didn't test, or didn't test long enough to elicit the high WAF.

For concreteness, here's a sample of the display output of my WAF Monitor .bat file, with highlighting of the two WAFs that overlapped the 30 second ssd idle intervals (twenty minutes apart):
Total Host LBAs Written (last 8 digits) = 68419710
Total Host NAND Pages Written (last 8 digits) = 27195289
Total FTL Pages Written (last 8 digits) = 18657120
Mon 03/02/2020 11:51:59.96 Waiting 300 seconds (ctrl-C aborts)...
____
deltaHostWrites= 24.4 MB, deltaHostPages= 1027, deltaFTLPages= 785
WAF= 1.76
Mon 03/02/2020 11:57:00.59 Waiting 300 seconds (ctrl-C aborts)...
____
deltaHostWrites= 15.6 MB, deltaHostPages= 706, deltaFTLPages= 37952
WAF= 54.75
Mon 03/02/2020 12:02:00.64
Waiting 300 seconds (ctrl-C aborts)...
____
deltaHostWrites= 16.9 MB, deltaHostPages= 762, deltaFTLPages= 523
WAF= 1.68
Mon 03/02/2020 12:07:00.58 Waiting 300 seconds (ctrl-C aborts)...
____
deltaHostWrites= 10.7 MB, deltaHostPages= 491, deltaFTLPages= 646
WAF= 2.31
Mon 03/02/2020 12:12:00.63 Waiting 300 seconds (ctrl-C aborts)...
____
deltaHostWrites= 10.1 MB, deltaHostPages= 485, deltaFTLPages= 610
WAF= 2.25
Mon 03/02/2020 12:17:00.56 Waiting 300 seconds (ctrl-C aborts)...
____
deltaHostWrites= 14.4 MB, deltaHostPages= 649, deltaFTLPages= 3157
WAF= 5.86
Mon 03/02/2020 12:22:00.58
Waiting 300 seconds (ctrl-C aborts)...
 
I was talking about dynamic wear-leveling. And DWA (dynamic SLC) is only additive with regard to WA at a maximum of the amount of base cells (thus 3 with TLC). But nothing will spike WAF that high, it would require low-level diagnostics via JTAG to get anywhere.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
@Maxxify: WAF can presumably spike infinitely high over a time period that's very short. In particular, it's infinity during the time to write one amplified block (during which zero blocks from the host pc are written). So, how high can WAF spike in, say, a 30-second interval or a 5-minute interval? Couldn't the FTL controller have a large backlog of writes to handle, while the host pc is writing little during the interval, that would cause a large spike?

Should I infer that you're talking about my very high average WAF before I began running any ssd selftest loops (the high WAF that led me to start this forum thread), and you're not talking about the short term high WAF during the brief selftest loop idle times?

[EDIT] I should also have asked, WHY can't WAF spike that high?
 
Last edited:

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
Assuming the known FTL background algorithms aren't buggy, is 5 GBytes (186,000 NAND pages) an unreasonably or impossibly large amount for the FTL controller to (occasionally) choose to write during a 30 second period of time, given a pc that averages only about 100 KBytes/second of writing to the ssd?

Here's why I'm asking: For the last 10 hours, every 30 seconds my pc has logged the changes of some ssd SMART data: Host MBytes Written, Host Pages Written, and FTL Pages Written (and the corresponding 30-second WAF). The log shows that occasionally there's a huge burst of FTL page writing, and each burst is approximately a multiple of 37,000 pages.

Here are the huge FTL page writing counts:
74573 at 21:23:12
186127 (actually the sum of two consecutive counts: 139963 at 21:43:12 and 46164 at 21:43:42)
37308 at 22:03:12
37138 at 22:23:12
37230 at 23:23:12
37269 at 23:43:12
74278 at 1:43:12
37271 at 2:23:12
74328 at 3:23:12
37254 at 3:43:12
37362 at 4:43:12
37306 at 5:03:12
(Each timestamp is the end of the 30-second interval.)

The time between FTL bursts is always a multiple of 20 minutes, which matches the ssd selftest loop time. In fact, each burst corresponds to a 30-second "idle" interval between 19.5 minute selftests.

The logging app isn't synchronized with the ssd selftest controller app, so there's an offset between the log's 30-second intervals and the selftest loop's 30-second idle intervals. For example, one of the idle intervals was from 21:42:55 to 21:43:25. Comparison of the timestamps shows that each idle interval overlaps 17 seconds of one log interval and 13 seconds of the next log interval. This probably explains why the 186127 FTL burst started during one log interval and finished during the next... both start and finish were probably within a single idle interval.

The offset implies each FTL burst took no more than 17 seconds to complete, except perhaps for the 186127 burst.

The 30-second log intervals that didn't include huge FTL counts had FTL counts that were MUCH smaller. Most FTL counts were less than 100. Only a few exceeded 100, and only one exceeded 152:
2517 at 1:17:42 (which was during a selftest, not during an idle period between selftests).

Each page is approximately 30,000 bytes. (Divide Host MBytes Written by Host Pages Written.) If we assume the 186127 FTL count took at most 30 seconds to complete, that's more than 5 GBytes in at most 30 seconds... an FTL NAND write speed of at least 175 MBytes/second. Speed benchmarks that I found by googling indicate the MX500 500GB ssd can sustain a write speed of about 400 MBytes/second from the host pc, so the 186127 burst doesn't seem impossibly fast. I assume the FTL reads about as much as it writes during the bursts, since it's presumably copying data.

(UPDATE: The log has continued to run, and I noticed two FTL counts of about 112000. Each was entirely within one log interval (not split like the 186127 burst), which implies each burst completed in less than 17 seconds. So these two bursts had write speeds of at least 188 MBytes/second.)

The 12 FTL bursts totaled about 20 GB during the 10 hours. The host wrote about 3.6 GB during the 10 hours.

I think there are two new questions: (1) Why do FTL controller background processes occasionally perform GBs of burst writes, and (2) why did those processes take advantage of only 12 of the 30 idle intervals that occurred during the 10 hours? (I might be able to gain some insight into #1 by turning off the selftests while logging, to see if the FTL still performs occasional bursts of GBs, or frequent smaller bursts.)

And there's still my other recent question: will high duty cycle ssd selftests interfere with the ssd background processes in a way that's unhealthy for the ssd, or healthy?

Here's an excerpt from the beginning of the log file; I imported it into a spreadsheet so I could paste it here as a table. I marked the first huge FTL count in red:
Day​
Date​
Time​
WAF​
Host MB Written​
Host Pages Written​
FTL Pages Written​
Mon​
03/02/2020​
21:09:12.45​
1.64​
1.7​
101​
65​
Mon​
03/02/2020​
21:09:42.39​
1.9​
0.7​
54​
49​
Mon​
03/02/2020​
21:10:12.44​
1.81​
1.2​
79​
64​
Mon​
03/02/2020​
21:10:42.36​
1.85​
0.6​
54​
46​
Mon​
03/02/2020​
21:11:12.35​
1.4​
5​
210​
85​
Mon​
03/02/2020​
21:11:42.40​
1.45​
3​
129​
59​
Mon​
03/02/2020​
21:12:12.51​
1.88​
2​
101​
89​
Mon​
03/02/2020​
21:12:42.42​
2.6​
1.1​
66​
70​
Mon​
03/02/2020​
21:13:12.48​
1.12​
11.8​
415​
50​
Mon​
03/02/2020​
21:13:42.36​
2.8​
0.7​
45​
49​
Mon​
03/02/2020​
21:14:12.38​
1.91​
1.5​
79​
72​
Mon​
03/02/2020​
21:14:42.39​
1.58​
2.4​
129​
76​
Mon​
03/02/2020​
21:15:12.43​
1.37​
3​
132​
50​
Mon​
03/02/2020​
21:15:42.44​
1.44​
3.1​
138​
62​
Mon​
03/02/2020​
21:16:12.38​
1.84​
1.9​
106​
90​
Mon​
03/02/2020​
21:16:42.43​
1.62​
1.5​
82​
51​
Mon​
03/02/2020​
21:17:12.43​
1.9​
1.4​
80​
72​
Mon​
03/02/2020​
21:17:42.37​
1.73​
2.3​
123​
91​
Mon​
03/02/2020​
21:18:12.44​
1.83​
1.7​
95​
79​
Mon​
03/02/2020​
21:18:42.38​
1.88​
0.9​
62​
55​
Mon​
03/02/2020​
21:19:12.40​
2​
0.7​
48​
48​
Mon​
03/02/2020​
21:19:42.43​
1.8​
1.4​
82​
66​
Mon​
03/02/2020​
21:20:12.43​
1.4​
4.8​
197​
80​
Mon​
03/02/2020​
21:20:42.45​
2.9​
1.3​
76​
83​
Mon​
03/02/2020​
21:21:12.42​
1.77​
1.1​
67​
52​
Mon​
03/02/2020​
21:21:42.39​
1.66​
1​
72​
48​
Mon​
03/02/2020​
21:22:12.41​
1.61​
1.1​
65​
40​
Mon​
03/02/2020​
21:22:42.44​
1.82​
1.3​
81​
67​
Mon​
03/02/2020​
21:23:12.36​
761.94​
1.5​
98​
74573
Mon​
03/02/2020​
21:23:42.38​
1.3​
1.8​
127​
39​
Mon​
03/02/2020​
21:24:12.38​
1.45​
4.2​
176​
80​
Mon​
03/02/2020​
21:24:42.37​
1.71​
1.5​
89​
64​
 
Last edited:

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
In case anyone is interested in how fast the ssd's FTL controller can read from or write to NAND during its background processes (garbage collection, static wear leveling, etc)... I've been running my "Monitor the ssd WAF & some SMART data" .bat file, now set to log data every 5 seconds. (Also I've continued to run, in a separate window, my .bat file that runs ssd selftests, each with a duty cycle of 19.5 minutes of every 20.) The Monitor has been running long enough that it's logged a lot of the large FTL NAND Write bursts.

A few of the bursts lasted long enough that they started and finished within three consecutive 5-second intervals. Those that took three intervals can presumably be used to measure the speed of FTL NAND reading & writing, since the ssd was presumably reading & writing nonstop (or nearly nonstop) to NAND for the entire 5 seconds of the middle interval. The middle bursts ranged from 36,829 pages written to 40,712 pages written. The 40,712 burst corresponds to 233 MBytes/second written (40,712 pages x 30,000 bytes/page / 5 seconds). If we assume the background processes must read from NAND as much as they write to NAND, it implies a read speed and a write speed of twice the 233 MBs/second, or approximately 466 MBytes/second.

I enhanced the Monitor so it also logs to a second file whenever the number of FTL NAND pages written during the 5-second interval exceeds 1000. This means the second log contains only the bursts and thus is much smaller than the primary log. Here's the second log file (with blank rows manually inserted between intervals that were at least 20 minutes apart, and the middle burst counts highlighted in blue):
Day​
Date​
Time​
WAF​
Host MB Written​
Host Pages Written​
FTL Pages Written​
Wed​
03/04/2020​
14:03:01.38​
715.42​
0.67​
52​
37,150​
Wed​
03/04/2020​
14:23:01.34​
4870.00​
0.09​
8​
38,952​
Wed​
03/04/2020​
14:23:06.36​
960.78​
0.66​
37​
35,512​
Wed​
03/04/2020​
15:03:02.37​
6186.33​
0.06​
6​
37,112​
Wed​
03/04/2020​
16:23:02.38​
1424.00​
0.40​
18​
25,614​
Wed​
03/04/2020​
16:23:07.34​
1424.25​
0.39​
27​
38,428
Wed​
03/04/2020​
16:23:12.38​
426.54​
0.29​
24​
10,213​
Wed​
03/04/2020​
16:43:02.48​
2642.57​
0.30​
14​
36,982​
Wed​
03/04/2020​
17:22:59.38​
360.24​
0.71​
61​
21,914​
Wed​
03/04/2020​
17:23:04.42​
326.02​
0.89​
47​
15,276​
Wed​
03/04/2020​
17:42:59.39​
312.68​
0.78​
66​
20,571​
Wed​
03/04/2020​
17:43:04.43​
759.48​
0.92​
50​
37,924
Wed​
03/04/2020​
17:43:09.36​
554.44​
0.33​
29​
16,050​
Wed​
03/04/2020​
18:03:00.40​
454.06​
0.84​
66​
29,902​
Wed​
03/04/2020​
18:03:05.48​
529.09​
1.37​
71​
37,495
Wed​
03/04/2020​
18:03:10.40​
562.29​
0.19​
17​
9,542​
Wed​
03/04/2020​
18:23:00.38​
474.84​
0.75​
65​
30,800​
Wed​
03/04/2020​
18:23:05.41​
1486.88​
0.42​
26​
38,633
Wed​
03/04/2020​
18:23:10.35​
232.09​
0.33​
22​
5,084​
Wed​
03/04/2020​
18:43:00.36​
472.31​
0.52​
41​
19,324​
Wed​
03/04/2020​
18:43:05.36​
292.48​
1.42​
62​
18,072​
Wed​
03/04/2020​
19:23:00.43​
582.44​
0.40​
36​
20,932​
Wed​
03/04/2020​
19:23:05.33​
561.93​
0.49​
29​
16,267​
Wed​
03/04/2020​
20:23:00.35​
441.75​
0.83​
69​
30,412​
Wed​
03/04/2020​
20:23:05.38​
220.03​
0.51​
31​
6,790​
Wed​
03/04/2020​
21:03:00.45​
588.73​
0.62​
49​
28,799​
Wed​
03/04/2020​
21:03:05.35​
272.58​
0.48​
31​
8,419​
Wed​
03/04/2020​
21:23:00.43​
786.26​
0.48​
41​
32,196​
Wed​
03/04/2020​
21:23:05.35​
183.25​
0.46​
28​
5,103​
Wed​
03/04/2020​
21:43:00.42​
782.15​
0.58​
38​
29,684​
Wed​
03/04/2020​
21:43:05.34​
155.06​
1.10​
48​
7,395​
Wed​
03/04/2020​
23:03:00.32​
448.37​
0.86​
67​
29,974​
Wed​
03/04/2020​
23:03:05.36​
224.65​
0.52​
32​
7,157​
Thu​
03/05/2020​
0:03:00.42​
468.92​
0.85​
65​
30,415​
Thu​
03/05/2020​
0:03:05.34​
282.83​
0.36​
24​
6,764​
Thu​
03/05/2020​
0:43:00.37​
518.90​
0.57​
43​
22,270​
Thu​
03/05/2020​
0:43:05.44​
752.61​
0.98​
49​
36,829
Thu​
03/05/2020​
0:43:10.38​
767.95​
0.23​
20​
15,339​
Thu​
03/05/2020​
1:03:00.41​
2531.33​
0.06​
6​
15,182​
Thu​
03/05/2020​
1:03:05.33​
646.52​
0.62​
34​
21,948​
Thu​
03/05/2020​
1:43:01.34​
1237.80​
0.50​
30​
37,104​
Thu​
03/05/2020​
2:22:58.42​
188.82​
0.39​
34​
6,386​
Thu​
03/05/2020​
2:23:03.33​
1623.89​
0.39​
19​
30,835​
Thu​
03/05/2020​
2:42:58.34​
194.02​
0.82​
68​
13,126​
Thu​
03/05/2020​
2:43:03.38​
1198.41​
0.55​
34​
40,712
Thu​
03/05/2020​
2:43:08.41​
233.27​
2.37​
90​
20,905​
Thu​
03/05/2020​
3:02:58.39​
126.01​
2.00​
105​
13,127​
Thu​
03/05/2020​
3:03:03.33​
332.31​
1.73​
73​
24,186​
Thu​
03/05/2020​
4:02:58.37​
216.67​
0.70​
61​
13,156​
Thu​
03/05/2020​
4:03:03.37​
1045.43​
0.32​
23​
24,022​
Thu​
03/05/2020​
4:22:58.41​
221.88​
0.41​
35​
7,731​
Thu​
03/05/2020​
4:23:03.43​
2675.27​
0.16​
11​
29,417​
Thu​
03/05/2020​
4:42:58.34​
223.50​
0.72​
62​
13,795​
Thu​
03/05/2020​
4:43:03.37​
1121.42​
0.25​
21​
23,529​
Thu​
03/05/2020​
5:42:58.37​
295.93​
0.60​
47​
13,862​
Thu​
03/05/2020​
5:43:03.37​
3890.00​
0.06​
6​
23,334​
Thu​
03/05/2020
5:50:18.37
66.89
0.43
38
2,504
Thu​
03/05/2020​
6:42:58.45​
232.59​
0.94​
64​
14,822​
Thu​
03/05/2020​
6:43:03.37​
2789.37​
0.12​
8​
22,307​
Thu​
03/05/2020​
7:02:58.34​
404.97​
0.43​
35​
14,139​
Thu​
03/05/2020​
7:03:03.36​
2878.00​
0.11​
8​
23,016​
Thu​
03/05/2020​
7:43:02.41​
1682.45​
0.49​
22​
36,992​
Thu​
03/05/2020​
8:43:03.43​
818.47​
0.96​
40​
32,699​
Thu​
03/05/2020​
8:43:08.32​
154.79​
0.56​
29​
4,460​

Once again, the burst sizes were approximately multiples of 37,000 pages: each of the logged bursts was approximately 37,000 pages or 74,000 pages. I assume the Monitor will eventually log one of the rare bursts that's a larger multiple of 37,000, which should occupy more than three consecutive 5-second intervals, and thus have more than one middle interval, and thus provide evidence whether the speed during a burst is fairly constant.

The log includes one anomaly (highlighted in red): the 2,504 "mini-burst" at 5:50am this morning. It's the only "burst" that occurred while a selftest was in progress. I don't know what caused it; the primary log indicates the host pc didn't write an unusual amount during the previous hours. But I think 2,504 is small enough not to be a concern.
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
I have the beginning of a theory about the ssd firmware "bug," based on an observation about changes in the ssd's Write Amplification Factor during the currently running ssd selftests trial.

Below is the portion of my WAF spreadsheet that includes the currently running selftest trial, which began on 3/1/2020. (Only the relevant columns are shown here.) In this trial, selftests are run 19.5 minutes of each 20 (in an infinite loop). The Time and SMART values below were pasted to spreadsheet from a logfile to which the selftest .bat app appends every 2 hours. The interesting thing is that the 2-hour WAFs have decreased noticeably during the last couple of days, compared to what they were during the first few days of the trial:
Date​
Time​
SMART F7​
SMART F8​
2-hour change of F7​
2-hour change of F8​
2-hour WAF
03/01/2020
17:23
227,001,355
1,417,282,614
19:23
227,031,773
1,417,409,200
30,418​
126,586​
5.16
21:23
227,048,225
1,417,573,722
16,452​
164,522​
11.00
23:23
227,066,301
1,417,890,851
18,076​
317,129​
18.54
03/02/2020
01:23
227,096,519
1,418,055,445
30,218​
164,594​
6.45
03:23
227,108,579
1,418,181,339
12,060​
125,894​
11.44
05:23
227,120,760
1,418,195,794
12,181​
14,455​
2.19
07:23
227,151,839
1,418,285,175
31,079​
89,381​
3.88
09:23
227,173,465
1,418,487,473
21,626​
202,298​
10.35
11:23
227,189,564
1,418,652,362
16,099​
164,889​
11.24
13:23
227,222,450
1,418,745,828
32,886​
93,466​
3.84
15:23
227,238,915
1,418,761,562
16,465​
15,734​
1.96
17:23
227,267,401
1,418,854,995
28,486​
93,433​
4.28
19:23
227,304,346
1,419,020,568
36,945​
165,573​
5.48
21:23
227,339,985
1,419,185,050
35,639​
164,482​
5.62
23:23
227,360,183
1,419,496,230
20,198​
311,180​
16.41
03/03/2020
01:23
227,395,995
1,419,549,733
35,812​
53,503​
2.49
03:23
227,416,739
1,419,748,783
20,744​
199,050​
10.60
05:23
227,436,267
1,419,874,517
19,528​
125,734​
7.44
07:23
227,479,043
1,419,925,740
42,776​
51,223​
2.20
09:23
227,504,631
1,420,237,857
25,588​
312,117​
13.20
11:23
227,542,479
1,420,366,483
37,848​
128,626​
4.40
13:23
227,610,679
1,420,386,280
68,200​
19,797​
1.29
15:23
227,665,284
1,420,403,408
54,605​
17,128​
1.31
17:23
227,704,871
1,420,530,135
39,587​
126,727​
4.20
19:23
227,743,690
1,420,583,950
38,819​
53,815​
2.39
21:23
227,759,499
1,420,782,429
15,809​
198,479​
13.55
23:23
227,774,130
1,421,017,866
14,631​
235,437​
17.09
03/04/2020
01:23
227,825,749
1,421,145,616
51,619​
127,750​
3.47
03:23
227,883,641
1,421,238,023
57,892​
92,407​
2.60
05:23
227,941,645
1,421,479,068
58,004​
241,045​
5.16
07:23
228,020,275
1,421,571,280
78,630​
92,212​
2.17
09:23
228,065,715
1,421,738,828
45,440​
167,548​
4.69
11:23
228,127,025
1,421,906,384
61,310​
167,556​
3.73
13:23
228,212,171
1,422,042,768
85,146​
136,384​
2.60
15:23
228,286,797
1,422,212,234
74,626​
169,466​
3.27
17:23
228,370,213
1,422,383,252
83,416​
171,018​
3.05
19:23
228,434,135
1,422,703,351
63,922​
320,099​
6.01
21:23
228,492,657
1,422,833,827
58,522​
130,476​
3.23
23:23
228,550,875
1,422,926,289
58,218​
92,462​
2.59
03/05/2020
01:23
228,636,627
1,423,094,496
85,752​
168,207​
2.96
03:23
228,709,240
1,423,302,087
72,613​
207,591​
3.86
05:23
228,767,785
1,423,431,386
58,545​
129,299​
3.21
07:23
228,863,539
1,423,565,200
95,754​
133,814​
2.40
09:23
228,934,463
1,423,658,990
70,924​
93,790​
2.32
11:23
229,006,067
1,423,679,531
71,604​
20,541​
1.29
13:23
229,076,103
1,423,736,689
70,036​
57,158​
1.82
15:23
229,146,293
1,423,794,184
70,190​
57,495​
1.82
17:23
229,244,101
1,423,816,885
97,808​
22,701​
1.23
19:23
229,346,625
1,423,874,656
102,524​
57,771​
1.56
21:23
229,420,627
1,423,893,991
74,002​
19,335​
1.26
23:23
229,496,215
1,423,952,754
75,588​
58,763​
1.78
03/06/2020
01:23
229,631,509
1,424,010,905
135,294​
58,151​
1.43
03:23
229,702,017
1,424,031,696
70,508​
20,791​
1.29
05:23
229,758,885
1,424,086,830
56,868​
55,134​
1.97
07:23
229,851,637
1,424,144,508
92,752​
57,678​
1.62

During the early days of the trial, 3/01, 3/02 and 3/03, fifteen of the twenty-seven 2-hour WAFs exceeded 5, and nine of them exceeded 10. But since 3/04, all twenty-eight 2-hour WAFs have been less than 7, twenty-five of them have been less than 4, eighteen have been less than 3, and the most recent eleven have been less than 2. A nice downward trend.

The trend has suggested a theory: that the FTL amplifies not only Host Page Writes (a necessary evil), but also amplifies some FTL Page Writes (perhaps an unnecessary evil). Thus, by using ssd selftests to reduce the runtime of FTL background processes, there were fewer FTL Page Writes to amplify after a few days of the trial, so that in the latter days most of the amplification has been amplification of Host Page Writes.

One possible reason why it might amplify FTL Page Writes is if some of the FTL page writing increases inequality of block wear, rather than decreasing it. Would this imply a bug? Or perhaps a sub-optimal algorithm?

For typical users' workloads which write much more to the ssd than my pc does, the amplification of Host Page Writes is a large fraction of the total amplification, so the amplification of FTL Page Writes might go unnoticed. Nevertheless, amplification of FTL Page Writes could contribute to somewhat premature ssd death even for typical users. It would be right to call it "premature" if amplification of FTL Page Writes is unnecessary, or if the algorithm could be optimized so there isn't as much amplification.

Any thoughts about whether the theory is plausible?

The other conclusion to be drawn from the trend is that these selftest trials need to be allowed to run for many days to find valid long term results. I plan to let the current trial keep running for many more days; hopefully WAF will remain low.
 

fc5

Prominent
Mar 6, 2020
3
0
510
(Still NOT Solved) current pending sector is 1on new SSD Cache

Do you know this bug?
I noticed that smart (F8) increased when this bug happened
 

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
(Still NOT Solved) current pending sector is 1on new SSD Cache

Do you know this bug?
I noticed that smart (F8) increased when this bug happened

SMART F8 is always increasing during normal operation. Even during intervals as short as 5 seconds, my log shows that F8 increases, typically by dozens of page writes. I think an increasing F8 is only a sign of a problem (excessively high WAF) if F8 increases a lot compared to the increase of F7.

I see no relation between the "pending sector" warnings behavior described in that unraid forum thread, and the high WAF that my ssd has experienced.

EDIT: Please be careful about "hijacking" a topic. It violates the forum rules. I've decided to click "report" on that off-topic post, to see if the forum moderators agree.
 
Last edited:

Lucretia19

Reputable
Feb 5, 2020
192
14
5,245
It seems like the suggestion is relevant to your original question/issue? Not necessarily that they are asking for help.

I see no suggestion in that post, nor any relevance to this thread's "excessively high Write Amplification" topic. Perhaps you misunderstood which bug s/he was referring to where s/he wrote about F8 increasing when "this" bug happened. I believe s/he meant the "pending sector" bug in the unraid forum thread, not the "high WAF" bug in our thread. If you're aware of any possible link between the two bugs, please let us know.

However, I can't rule out the possibility that the post was a genuine attempt to be helpful, rather than asking us to help with his/her "pending sector" bug. I don't want to discourage attempts to be helpful, so I have some regret about having reported it.
 
Last edited: