Question Can a series of SMART selftests be made to span the entire ssd if each selftest is periodically aborted?

Lucretia19 · Mar 15, 2020

I've been using "nearly nonstop" extended selftests to try to mitigate a bug in the firmware of the Crucial MX500 ssd that causes excessive write amplification: Before I began running the selftests, the Write Amplification Factor (WAF) averaged about 40 during a period of many weeks. With extended selftests running 19.5 minutes of every 20 minutes, WAF has been averaging about 3 since I began the trial on March 1st. (Smaller WAF is better, all else being equal, because the lifespan of an ssd is inversely proportional to WAF.) My theory is that the selftests keep the ssd busy reading itself so that the ssd gives much less runtime to its buggy lower-priority background process (perhaps Static Wear Leveling and/or Garbage Collection).

The reason why I don't run the selftests completely nonstop is that I don't know whether that would be okay for the ssd's health. In other words, nonstop selftests might not allow the ssd background processes enough runtime. That's why I'm providing 30 seconds of "idle" time out of every 20 minutes, even though nonstop selftests caused WAF to average about 1.5.

To be specific, I'm using Windows CMD.exe to run an infinite loop .bat file that performs the following steps each loop: (1) "smartctl.exe -t long -t force /dev/sda" to start an extended selftest, (2) "timeout 1170" to pause the .bat file for 19.5 minutes, (3) "smartctl -X" to abort the selftest, and (4) "timeout 30" to pause for 30 seconds. (Note: An extended selftest allowed to run to completion, rather than aborted, would take about 25 minutes on my 500 GB Crucial MX500 ssd.)

I'm starting this thread because I'm concerned that the periodic aborting of the selftests might cause the physically-highest-numbered sectors of the ssd to NEVER be selftested or read. In other words, each loop might selftest within the same subset of physical sectors, and the periodic abort commands might prevent the ssd from ever selftesting the physical sectors not in that subset. I don't know whether it's safe for the long term health of the ssd, and the data written on it, if some sectors are never selftested.

So my questions are:

Q1. Does the periodic aborting of the extended selftests indeed cause some sectors to NEVER be selftested?

Q2. If the answer to Q1 is Yes, is that unsafe for the ssd health, or have any other negative consequences / side effects?

Q3. If the answers to Q1 and Q2 are Yes, is there some way to ensure all sectors will (eventually or periodically) be selftested, while also ensuring that a selftest is running for 19.5 minutes of every 20, and without requiring the pc to frequently read SMART data to check whether a selftest is in progress?

The three goals in Q3 seem to conflict. If I use the "smartctl -t select,N-max" command to selftest a subset of sectors that includes all of the high sectors, I assume the selftest will terminate after it reaches the highest sector but before 19.5 minutes. (Or the pc will abort it at 19.5 minutes before it reaches the highest sector, if the starting point N is chosen too small.) If the selftest terminates after it reaches the highest sector, a new selftest would need to be immediately restarted and allowed to run for the remainder of the 19.5 minutes. Unless there's a way to command the ssd to automatically start another selftest after a selftest concludes, the pc would need to intervene to immediately start the next selftest, and I assume this would require the pc to frequently (every second or so) read SMART data to check whether a selftest is in progress.

The smartctl documentation doesn't say what will happen using the "smartctl -t select,N+size" command if N+size exceeds the highest LBA of the ssd. If the selftest would immediately wrap around to LBA 0 and continue selftesting, I think I could construct a solution. But my hunch is that the selftest would terminate (before 19.5 minutes) when it reaches the max LBA. Does the SMART spec specify whether a "select,N+size" selftest must wrap around or terminate, after it reaches the highest LBA?

I suspect the reason why other people haven't noticed the MX500 WAF bug is that my pc writes to the ssd much less than a typical pc does. It writes only about 100 kBytes/second, on average. For some reason, WAF isn't as large when the pc writes a lot more to the ssd. A small rate of writing may be a use case that the ssd firmware engineers didn't test. Or perhaps the engineers intentionally made write amplification unnecessarily high for small rates of writing, in order to make the ssd Remaining Life reach 0% soon after its warranty expires... planned obsolescence.

I should also mention that the high WAF problem accelerated, before I began the selftests. The ssd Remaining Life reached 99% on Aug 31, a few weeks after the new ssd was installed in the new pc, after the pc had written 1,772 GB to the ssd. Remaining Life reached 95% on Dec 23, after the pc had written a total of 5,782 GB to the ssd. The pc wrote an additional 390 GB to reach 94% on Jan 15. The pc wrote an additional 138 GB to reach 93% on Feb 4.

Two other interesting observations about the WAF problem:

1. The bug appears to manifest itself by occasional high speed huge bursts of writing multiples of about 37,000 NAND pages... multiples of about 1 GByte. (Most of the bursts are approximately 37,000 pages. Some are about 74,000 pages. I've seen a few as high as 5 x 37,000 pages.) I don't know whether ALL of those bursts are due to the bug, or whether some bursts are the result of proper design and the problem is many more bursts than necessary. When it's writing bursts, it takes nearly 5 seconds to write each 37,000 pages. When the ssd isn't writing bursts, its background processes write an average of about 15 pages per 5 seconds. Occasionally there are "mini-bursts" of about 2400 pages; perhaps those are the result of a different background process. The huge bursts that are multiples of 37,000 pages occur only during the idle time between selftests. The mini-bursts can occur during selftests.

2. Other people have reported another Crucial MX500 bug that I've determined is correlated with the huge NAND write bursts: People have reported the "Current Pending Sectors Count" SMART attribute sometimes mysteriously becomes 1, and goes back to zero after a few seconds. (The change to 1 triggers health alerts by the monitoring software that many people use, and the consensus among MX500 users and Smartmontools is that it's a firmware bug, even though Crucial denies it's a bug.) Smartmontools began calling the attribute "bogus" on MX500 ssds. By logging SMART data every 2 seconds, I determined the correlation is perfect: the Pending Count becomes 1 at the start of each huge NAND write burst and goes back to 0 at the end of each burst. (I'm unsure whether the mini-bursts correlate with the Pending Count bug too, but I think not. By accident, during the hour after the switch to Daylight Savings Time on March 8, my logging app logged SMART data as fast as it could, without pausing. During that hour there were several mini-bursts, and the logged Pending Counts stayed at zero.) I don't know what the causal relation is, underneath the correlation... whether the bogus change to 1 causes the huge write burst as a response, or whether the huge write burst causes the bogus change.

popatim · Mar 15, 2020

?
Q1: Checking SMART just reads the tables stored in the drive and nothing is actually tested on the drive.

thats as far as I got reading this. My headache is to big to concentrate on it.

fzabkar · Mar 16, 2020

I think section 7.54.5.2.8 of the ATA standard has what you are looking for (SMART Selective self-test routine).

http://www.t13.org/documents/UploadedDocuments/docs2011/d2015r7-ATAATAPI_Command_Set_-_2_ACS-2.pdf

Lucretia19 · Mar 16, 2020

@fzabkar: Thanks! The ability to specify up to 5 spans in the "selective" selftest command sounds promising. For example, each loop could run a selective selftest that tests 5 spans. The 2nd, 3rd, 4th and 5th spans would be "0 to max" to ensure the selftesting will keep running until aborted (at 19.5 minutes or whatever... this should work okay even for much longer loop durations, up to about 112 minutes given that an ordinary extended selftest takes about 25 minutes if not aborted). The 1st span would be "n to max" where n is made to alternate from loop to loop: either n=0 or n=MAXLBA/2. Unless smartctl or the ssd reorders the spans so the selftest doesn't begin with the 1st span, I think that would solve it. (The drive's MAXLBA can be found by reading extended SMART data once during initialization.)

To be more general, instead of alternating n between two values as described above, n could be made to cycle among K values, for some moderately large K (perhaps K=8):
n = (K-i) * MAXLBA/K where i cycles among {1 to K}
It would handle more cases: the case where someone wants to experiment with loops much shorter than 20 minutes, and the case where someone runs this on a slower or larger ssd for which a "MAXLBA/2 to max" selftest lasts longer than 19.5 minutes.

I don't see a need to handle arbitrarily long loop durations (longer than 112 minutes), but if needed, the loop could broken up into J pieces that each force a new selftest to begin. But I'd have to think carefully about the interaction between J and K to make sure the combination won't create a "loophole" that, due to the forcing, causes some sectors to never be selftested.

Ralston18 · Mar 16, 2020

This:

"To be specific, I'm using Windows CMD.exe to run an infinite loop .bat file that performs the following steps each loop: (1) "smartctl.exe -t long -t force /dev/sda" to start an extended selftest, (2) "timeout 1170" to pause the .bat file for 19.5 minutes, (3) "smartctl -X" to abort the selftest, and (4) "timeout 30" to pause for 30 seconds. (Note: An extended selftest allowed to run to completion, rather than aborted, would take about 25 minutes on my 500 GB Crucial MX500 ssd.)"

.bat ? Just a question - not being critical.

What you are doing/working on is not within my comfort levels at all.

However, the ".bat" caught my eye.

Perhaps Powershell will meet your requirements and provide some additional flexibility and control with respect to times and loops.

Lucretia19 · Mar 16, 2020

Ralston18 said:
-SNIP-
However, the ".bat" caught my eye. Perhaps Powershell will meet your requirements and provide some additional flexibility and control with respect to times and loops.

Using cmd.exe with .bat hasn't caused any insurmountable problems, although some cmd commands have quirks that took time to learn. I think the selftest issue that this thread is about depends on understanding the ssd internal SMART behavior -- how it responds to selftest-related commands sent to the ssd by smartctl.exe -- and the language used on the pc isn't the problem.

If I had to start over from scratch I would investigate PowerShell, and other languages that have a small ram footprint and good execution speed... maybe the C compiler in Visual Studio Community Edition or in MinGW_64.

Someday I may want to switch to Linux and would want to be able to quickly port the ssd routines. One thing I would check before investing time learning PowerShell is how similar its scripts would be to what will be available on Linux.

Ralston18 · Mar 16, 2020

Fair enough.

Very easy to get a sense of things via a google search.

Search criteria = "disk management powershell commands" or variations as desired.

Example links:

https://www.thewindowsclub.com/use-windows-powershell-to-find-information-about-hard-drive

https://www.windowscentral.com/how-check-if-hard-drive-failing-smart-windows-10

Powershell is just short of being halfway down the second link webpage.

Lucretia19 · Mar 16, 2020

I'm fairly certain that my project is unique, or at least a "needle in a haystack" to try to google. I might be the only person in the world who wants to run selftests nearly nonstop, but not nonstop.

But there's no need to search for how to implement the solition, because I've already incorporated the "(8-i)*MAXLBA/8 for i from 1 to 8" solution that I wrote about this morning into my .bat file that controls the ssd selftests. (I didn't bother to calculate the exact MAXLBA because it doesn't need to be exact.)

What remains is to figure out how to test whether the selective selftests are actually doing what I intend. One of the SMART logs might hold information about what the ssd has been selftesting.

Lucretia19 · Mar 16, 2020

Here's a surprise: the smartctl command that begins the 'selective selftest' takes about 7 seconds to start the selftest. (It took less than a second to begin an ordinary 'extended selftest.') It means the pause between selftests is probably 37 seconds, and each loop is taking 20min 7secs.

I'm trying to decide whether to deal with that by simply shortening the pause time from "30" seconds to "23" or to be clever by measuring the elapsed time to know precisely how long to pause to achieve the desired 30 seconds. Being clever would require checking for midnight rollover and for changes to/from Daylight Savings Time, but fortunately I already wrote .bat code that can do that kind of precise timing (which I use to periodically log some SMART data and the Write Amplification Factor to a file).

The smartctl command that takes 7 seconds is:

smartctl -t select,!nLBA!-max -t select,0-max -t select,0-max -t select,0-max -t select,0-max -t force /dev/sda | findstr /R success

where nLBA is a variable.

(The smartctl output is piped to findstr so that at most one line of smartctrl output text will be displayed, and that line will show the test successfully started.)

I could also try removing a few of the "-t select,0-max" spans, to see whether the amount of time it takes to begin the selftest depends on the number of spans specified.

Lucretia19 · Mar 18, 2020

Experience has now shown that the ssd's response time between when it receives the "begin a selective selftest" command and when it reports back "the selftest has begun" isn't a constant 7 seconds. It varies, seemingly at random, and can be as small as 1 second. So it's literally impossible to control the pause time between selftests; the best that can be done is to control the average pause time.

The scheme I described above, which measures the elapsed time from sending the command to reporting selftest begun, came close to controlling the average pause time. But there's enough slop in the TIMEOUT command -- my .bat code uses two TIMEOUT commands, one to implement the 1170 seconds wait before aborting the selftest and the other to implement the 30-Elapsed seconds pause before starting the next selftest -- that the loop duration has been averaging about 1201 seconds, not 1200 seconds.

I could make the timing more precise (if precision is important) by making sure that whenever the timeline deviates by more than one second from the desired "nLoops x 1170 seconds" total selftest time or the "nLoops x 30 seconds" total pause time, it will vary the corresponding TIMEOUT enough to converge the total to within one second of what's desired. This would precisely control the average selftest time and the average pause time, as well as possible given the ssd's unpredictability.

I know from experience that the easiest way to do that kind of precise timing involves breaking up any long TIMEOUT into a series of shorter TIMEOUTS (implemented as an inner loop). That's because it's easy to detect the change to Daylight Savings Time when elapsed time is guaranteed to be shorter than an hour. In this case, 1170 seconds is short enough -- much less than an hour -- that it doesn't need to be broken up, but it would be nice to make the code general enough that it could correctly handle durations much longer than 1170 seconds.

fzabkar · Mar 18, 2020

Why don't you just define a single span for your self-test? Then perform 5 iterations of the self test, with your 20 minute interval in between. Each iteration would correspond to a different fifth of the drive's capacity. Page 502 of the standard describes the structure of the Selective Self-Test log (where you define your test spans).

select span 1 from LBA 0 to LBAmax / 5
disable spans 2, 3, 4 and 5 (by selecting a range of LBA 0 to LBA 0)
run self test
wait 20 minutes or until test finishes
select span 1 from LBAmax / 5 to (LBAmax x 2/5)
disable spans 2, 3, 4 and 5
run self test
wait 20 minutes or until test finishes
select span 1 from (LBAmax x 2 / 5) to (LBAmax x 3/5)
disable spans 2, 3, 4 and 5
…

In fact you could divide your LBA range into any number of regions. For example, you could test 5% of the user area during each of 20 iterations instead of 20% in each of 5 iterations.

Lucretia19 · Mar 18, 2020

fzabkar said:
Why don't you just define a single span for your self-test? Then perform 5 iterations of the self test, with your 20 minute interval in between. Each iteration would correspond to a different fifth of the drive's capacity.
-SNIP-

For one thing, "wait 20 minutes or until test finishes" wouldn't provide the desired 1170 seconds selftest followed by 30 second pause. The short span would complete in a few minutes (much less than 1170 seconds), and the next selftest would begin almost immediately since your algorithm doesn't pause.

I'm curious about whether you think there's a simple way to detect when "test finishes." The only way I know how to detect the selftest has finished is to FREQUENTLY read the ssd's selective selftest log ('smartctl -l selective') and search the output for a string that indicates whether the selftest is running. That would consume a lot of cpu time, whereas 'TIMEOUT 1170 /NOBREAK >nul' should consume negligible cpu time if Microsoft engineers know how to write software.

Why do you think it would be simpler or better than what I described? Are you suggesting that defining a single span instead of 5 spans would make the ssd's response time constant -- a reply to my most recent post? Or were you responding to one of my earlier posts?

The reason I decided to select 5 spans, with spans 2-5 covering the entire ssd, is that it provides the most robustness. Due to the ssd's Write Amplification problem, if something (such as a bug in Windows) interferes with the execution of the controller .bat, it would be better to have the selftests run too long than to have them come to an end. That kind of bug actually happened a few nights ago, when the controller was using an ordinary "extended selftest" instead of the selective selftest. My SMART log file and the display window of the controller showed that both the logger and the controller mysteriously suspended for over an hour. (Other Windows apps were misbehaving too, and it was necessary to turn off the pc to recover since Windows Restart hung too.) During the time the controller was suspended, WAF spiked to about 50, because the extended selftest completed and the controller didn't start the next selftest until it resumed over an hour later. Defining all 5 spans would let the selftest keep running while the controller is suspended.

Here's a sample of the display output of the controller .bat (except this forum loses the color-coding). It shows there's a 120(ish) minute cycle consisting of 6 20(ish) minute loops. The lines that begin with "Pass x of 6" display the first span of the selective selftest (the other four spans are 0-max and aren't displayed) and show how the starting LBA of the first span is made to cycle among 8 different values, as I described in an earlier post. The lines that say "Letting selftest run for 1170 seconds show how the timestamps deviate a little from the desired 20 minute interval. (I tried to highlight those elements in blue but I couldn't figure out how to display color inside the forum's Code block. Perhaps there's a better kind of block; I chose Code because it preserves spaces.)

Code:

____________SSD_Selftests_v4.5______________
Tue 03/17/2020 10:40:00.99   Appending SMART data to N:\fix_ssd_waf\Logs\ssdSelftest_2020.03.17-104000.LOG...
[Pass 1 of 6]  10:40:01.09   Executing 'smartctl -t select,875000000-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 10:40:01.97   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 10:59:32.35   Pausing 30 seconds...
[Pass 2 of 6]  11:00:02.20   Executing 'smartctl -t select,750000000-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 11:00:03.96   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 11:19:34.27   Pausing 29 seconds...
[Pass 3 of 6]  11:20:03.18   Executing 'smartctl -t select,625000000-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 11:20:05.82   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 11:39:35.37   Pausing 28 seconds...
[Pass 4 of 6]  11:40:03.17   Executing 'smartctl -t select,500000000-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 11:40:06.72   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 11:59:36.33   Pausing 27 seconds...
[Pass 5 of 6]  12:00:03.23   Executing 'smartctl -t select,375000000-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 12:00:07.67   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 12:19:37.31   Pausing 26 seconds...
[Pass 6 of 6]  12:20:03.19   Executing 'smartctl -t select,250000000-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 12:20:08.49   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 12:39:38.28   Pausing 25 seconds...
____________SSD_Selftests_v4.5______________
Tue 03/17/2020 12:40:03.25   Appending SMART data to N:\fix_ssd_waf\Logs\ssdSelftest_2020.03.17-104000.LOG...
[Pass 1 of 6]  12:40:03.34   Executing 'smartctl -t select,125000000-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 12:40:09.39   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 12:59:39.35   Pausing 24 seconds...
[Pass 2 of 6]  13:00:03.17   Executing 'smartctl -t select,0-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 13:00:09.96   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 13:19:40.38   Pausing 24 seconds...
[Pass 3 of 6]  13:20:04.20   Executing 'smartctl -t select,875000000-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 13:20:05.07   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 13:39:35.38   Pausing 29 seconds...
[Pass 4 of 6]  13:40:04.20   Executing 'smartctl -t select,750000000-max' to start ssd selftest...
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Tue 03/17/2020 13:40:05.97   Letting selftest run for 1170 seconds...
Executing 'smartctl -X' to pause [abort] ssd selftest...
Self-testing aborted!
Tue 03/17/2020 13:59:36.33   Pausing 29 seconds...
[Pass 5 of 6]  14:00:05.23   Executing 'smartctl -t select,625000000-max' to start ssd selftest...
-SNIP-

fzabkar · Mar 18, 2020

The reason I suggested using a single span was that you could precisely define the areas to be scanned (to eliminate your concern about potential "holes").

Lucretia19 · Mar 19, 2020

In my post on 3/16/2020 at 4:24pm, I wrote the following: "Unless smartctl or the ssd reorders the spans so the selftest doesn't begin with the first span, I think that would solve it." Below is a paste of output from smartctl that's evidence that the ssd can selftest spans out of order. Span 3 appears to be the span being selftested. But the selftests are aborted by the controller after 19.5 minutes (1170 seconds), which by my calculation should have been during span 2 in this case (since span 1 covered the upper quartile of the ssd).

N:\fix_ssd_WAF>smartctl -l selective /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-w64-mingw32-w10-1903] (sf-7.1-1)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 750000000 976773167 Not_testing
2 0 976773167 Not_testing
3 0 976773167 Self_test_in_progress [50% left] (0-65535)
4 0 976773167 Not_testing
5 0 976773167 Not_testing
Selective self-test flags (0x10):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I suppose it's possible that the ssd selective scans are faster than I believed. My belief about speed was based on the ssd taking about 25 minutes to complete an ordinary extended selftest, which I assumed is equivalent to a selective selftest of the single span 0-max. Maybe this assumption is incorrect.

I've also seen output where span 4 appears to be the span being selftested, and output that appears to show span 5 was aborted (after 19.5 minutes). In both of these, span 1 was the upper octile, 875000000-max.

According to Crucial the MX500 can sequentially read 560 MB/second (ssd to pc). That's about 650 GB in 20 minutes. At that speed, it would never begin span 4 if the spans are tested in order, spans 2 & 3 are 0-max, and the seltest is aborted at 19.5 minutes. It would have to read at least 1.125 TB in 19.5 minutes to reach span 4, which is faster than 1,000 MB/second. So, either spans can be tested out of order, or the internal read speed is much faster than 560 MB/s. I suppose that speed is possible, since internal speed isn't limited by SATA speed. But it doesn't explain why the extended selftest takes 25 minutes.

Does anyone know whether the "Not-testing" on the lines of span 1 & 2 indicates spans 1 & 2 were skipped? Seems to me the ssd ought to report something like "completed" on those two lines if spans 1 & 2 weren't skipped.

It may be another firmware bug, if spans aren't selftested in order.

Fortunately, span 1 appears to be most frequent, which implies the scheme is accomplishing the goal of eventually selftesting all LBAs.

Lucretia19 · Mar 19, 2020

fzabkar said:
The reason I suggested using a single span was that you could precisely define the areas to be scanned (to eliminate your concern about potential "holes").

The algorithm I described, in which the starting LBA of span 1 is made to cycle among K different values, precisely defines the LBA where the selftest is to begin. I think that should be as effective at preventing holes as using only a single span to precisely define an area. Having four additional 0-max spans selected shouldn't prevent span 1 from eliminating holes. It won't create holes if spans 2-to-5 don't complete and, to the contrary, it's by design that spans 2-to-5 won't complete; each selftest should run until aborted (after 1170 seconds or whatever) and as long as possible if not aborted because Windows crashes. Similarly, it won't create holes to have the endpoint of span 1 be max, and max helps the selftest run as long as possible in case Windows crashes.

The above assumes the ssd reliably begins each selftest with span 1, when multiple spans are selected. (I'm unsure how to test this assumption, because I'm unsure how to interpret the ssd's SMART log, as I wrote about yesterday. Because an extended selftest would take about 25 minutes if allowed to complete, my expectation is that aborting the selective selftest after 1170 seconds should cause the ssd log to show span 1 or 2 is running or span 1 or 2 was aborted. Yet the ssd log occasionally seems to show 3, 4 or 5 is running or was aborted. Perhaps selective selftests do fewer kinds of testing than extended selftests do, causing extended selftests to take longer to complete? If so, this difference could be beneficial for the ssd lifetime. In other words, perhaps extended selftests do some writing or other damage and selective selftests don't.)

Even if only a single span of fixed length is selected, one would still need to take care that the single span is given enough time to complete before the innermost loop forces a new selftest to begin. In other words, there could still be "loopholes" if the inner loop is too short.

My intuition is that the way to avoid loopholes, in the "more general" algorithm I proposed that would have K cycles of spans and J innermost loops, is to have large K and relatively small J, so that the change of the starting LBA will be small and the duration of each innermost loop will be long enough to guarantee no holes.

Yesterday I improved the time-handling of the selftest controller .bat, so the timeline doesn't drift from the desired 20 minutes per loop, and the average pause time between selftests is indeed 30 seconds (despite the random variations in how many seconds the ssd takes to begin a selftest after it receives the command). It's implementing the algorithm that has K=8 cycles of selective selftests, span 1 is [LBAk,max] and spans 2-5 are [0,max].

Since selective selftests might run much faster than I expected, on my todo list is to verify the selective selftests aren't completing before being aborted at 1170 seconds.

Lucretia19 · Mar 19, 2020

I'm now running the 'smartctl -l selective' command every 30 seconds to read the ssd selective selftest log, appending the output to a file. (This is running in parallel with the ongoing selective selftests.) The purpose of this logging is to determine how long it takes for the ssd to selectively selftest the spans, and to see if testing is done in the order specified when multiple spans are specified.

I quickly found an anomaly. (Note: the selftest controller .bat started a selective selftest at 9:00:07, with all 5 spans 0-max.) The selftest log began as expected, showing span 1 was being tested. At 9:08:15 the log began showing span 2 was being tested. But then at 9:09:15 it briefly went back to showing span 1 was being tested. Also, the percent complete was 70% and briefly went back to 90%.

So, either the ssd can test spans out of order or its log is inaccurate. Either seems like a bug in the ssd firmware.

Here's an excerpt from the logfile. It shows three snapshots of the ssd selective selftest log, separated by 30 seconds.

__ Thu 03/19/2020 9:08:45.20 Selftest Status
smartctl 7.1 2019-12-30 r5022 [x86_64-w64-mingw32-w10-1903] (sf-7.1-1)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 976773167 Not_testing
2 0 976773167 Self_test_in_progress [70% left] (0-65535)
3 0 976773167 Not_testing
4 0 976773167 Not_testing
5 0 976773167 Not_testing
Selective self-test flags (0x10):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

__ Thu 03/19/2020 9:09:15.22 Selftest Status
smartctl 7.1 2019-12-30 r5022 [x86_64-w64-mingw32-w10-1903] (sf-7.1-1)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 976773167 Self_test_in_progress [90% left] (0-65535)
2 0 976773167 Not_testing
3 0 976773167 Not_testing
4 0 976773167 Not_testing
5 0 976773167 Not_testing
Selective self-test flags (0x8):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

__ Thu 03/19/2020 9:09:45.20 Selftest Status
smartctl 7.1 2019-12-30 r5022 [x86_64-w64-mingw32-w10-1903] (sf-7.1-1)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 976773167 Not_testing
2 0 976773167 Self_test_in_progress [70% left] (0-65535)
3 0 976773167 Not_testing
4 0 976773167 Not_testing
5 0 976773167 Not_testing
Selective self-test flags (0x10):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Lucretia19 · Mar 19, 2020

Regarding the ssd speed when it selectively selftests spans, the log shows inconsistencies. It's hard to trust, given the anomalies.
8 minutes to test span 1 when span 1 was 0-max.
8 minutes to test span 1 when span 1 was 875,000,000 to max.
5 minutes to test span 2.
5 minutes to test span 3.
Bizarre. Especially the 8 minutes to test the small 875,000,000-max span., which is less than 1/8th of the 500 GB ssd. (Note: spans 2-5 are always 0-max.)

Span 4 speed is undetermined due to the selftests being aborted after 1170 seconds by the controller .bat app. There's no evidence that span 5 was ever tested, except that the log displays all abortions in the span 5 line.

Here are some highlights of the logs. (Note: toward the end of this list, I stopped listing the anomalies that briefly show span 1 being tested in the middle of testing another span.)

9:00:07 - The controller started a selective selftest, span 1 = [0,max].

9:08:15 - The log shows span 2 being tested (about 8 minutes after it began span 1).

9:13:15 - The log shows span 3 being tested (about 5 minutes after it began span 2).

9:18:15 - The log shows span 4 being tested (about 5 minutes after it began span 3).

Another anomaly occurred:
9:19:15 - The log says span 1 was being tested again, and back to 90% remaining.

9:19:38 - The controller aborted the selftest.

9:19:45 - The log shows the abortion, in the span 5 line.

9:20:02 - The controller started a selective selftest, span 1 = [875,000,000,max].

9:20:15 - The log shows span 1 being tested.

9:28:15 - The log shows span 2 being tested (about 8 minutes after it began span 1, even though span 1 covered only 875000000 to max, less than 1/8th of the ssd).

The next two log entries are anomalies: they show span 1 being tested again, 90% remaining again. Then span 2 resumed.

9:33:15 - The log shows span 3 being tested (about 5 minutes after it began span 2).

The next log entry showed span 1 being tested again, 90% remaining again. Again at 9:36:15

9:38:15 - The log shows span 4 being tested (about 5 minutes after it began span 3).

9:39:32 - The controller aborted the selftest.

9:39:45 - The log shows the abortion, in the span 5 line.

9:40:02 - The controller started a selective selftest, span 1 = [750,000,000,max].

9:40:15 - The log shows span 1 being tested.

9:47:45 - The log shows span 2 being tested (about 7.7 minutes after it began testing span 1).

9:53:15 - The log shows span 3 being tested (about 5.5 minutes after it began testing span 2).

9:58:45 - The log shows span 4 being tested (about 5.5 minutes after it began testing span 3).

etc.

fzabkar · Mar 20, 2020

Has Crucial acknowledged their WAF problem? ISTM that you have done more than enough work to demonstrate the firmware bug.

As for your batch routine, I don't have any ideas. Sorry. In any case, it's a clever workaround.

Lucretia19 · Mar 20, 2020

@fzabkar: No, I have the feeling that Crucial never acknowledges bugs. They've agreed to replace the ssd -- presumably because they couldn't explain the high recent WAF (prior to running the selftests to tame it) -- but that suggests they're treating it as a defective ssd and not a bug. Perhaps if other people reproduced the issue, Crucial would pay attention. But the issue may be hard to notice with typical users' amounts of writing from the pc. (My ssd's WAF was less than 6 during the first 5 months of its service, before I reduced the amount of writing to less than 100 kB/sec average.) And I don't know if anyone else has been collecting the SMART data that would let them track how WAF changes.

As for my .bat, I'm satisfied with its algorithm now, and I appreciate the link you provided that led me to the selective selftest with multiple spans. Here's the latest version of the .bat (but I see the forum has deleted the Esc character in the color code definitions):

[EDIT 3/25/2020: I edited the .bat code to fix the "call :magenta" issue pointed out by fzabkar, and to add a comment about the purpose of the resetANSI subroutine that appears to do nothing but actually mitigates an obscure bug.]
[EDIT 3/26/2020: Because the Esc character was deleted by the forum, I modified the .bat code so it will generate the Esc character at runtime. To use it, you'll still need to edit the paths and supply smartctl.exe, but otherwise it should now work as is.]

Code:

@echo off
goto :main

:resetANSI
exit /B
rem  The resetANSI subroutine call mitigates a bug that causes inline
rem  color codes to fail after an instruction within a parenthesized
rem  code block pipes output to FINDSTR.

:main
set BATNAME=SSD_Selftests
set BATVER=4.7
color 0F
TITLE %BATNAME%_v%BATVER% [to reduce Write Amplification] [%1 %2 %3]
if [%3]==[] (
   echo Usage: Parameter1 is the length of each loop, including the pause between selftests, in seconds.
   echo        Parameter2 is the amount of time to pause between selftests, in seconds.
   echo        Parameter3 is the frequency to log SMART data, in loops.
   echo            Thus data will be logged every Parameter3 x Parameter1 seconds.
   echo Note: Administrator privilege is required, so that smartctl can launch the ssd selftest.
   EXIT /B
)
setlocal EnableDelayedExpansion
set "SSD=/dev/sda"
set "PROGDIR=N:\fix_ssd_waf"
set "PROG=%PROGDIR%\smartctl.exe"
set "TMPDIR=R:"
set "SELFTESTFLAG=%TMPDIR%\ssdSelftestRunning.txt"
set /A "KMAX=8"

for /F "delims=#" %%E in ('"prompt #$E# & for %%E in (1) do rem"') do set "ESCchar=%%E"
set "green=%ESCchar%[92m"
set "yellow=%ESCchar%[93m"
set "magenta=%ESCchar%[95m"
set "cyan=%ESCchar%[96m"
set "resetcolor=%ESCchar%[0m"

if not exist "%PROG%" (
   echo %magenta%Aborting due to missing file: [!PROG!]%yellow%
   EXIT /B
)

set /A "LOOPSECONDS=%1+0, PAUSESECONDS=%2+0, FREQ=%3+0"
if %PAUSESECONDS% LSS 0  set /A "PAUSESECONDS=0"
set /A "SELFTESTSECONDS=%LOOPSECONDS%-%PAUSESECONDS%"
if %SELFTESTSECONDS% LEQ 0  set /A "SELFTESTSECONDS=1"
set /A "LOOPSECONDS=SELFTESTSECONDS+PAUSESECONDS"
if %FREQ% LEQ 0  set /A "FREQ=1"

set prevdate=xx
set /A "kcount=0, LBACHUNK=1000000000/!KMAX!"
rem  Initialize Endtime to the starting time, expressed in seconds after midnight:  
for /F "tokens=1-3 delims=:." %%a in ("!time!") do (
rem  Note HH may have leading blank, MM and SS may have leading zero octal confusion.
   set /A "EndTime=3600*%%a+60*(1%%b-100)+1%%c-100"
)
echo !time!  SecondsAfterMidnight[!EndTime!]  STARTING.
rem  INFINITE LOOP unless Ctrl-C pressed:
FOR /L %%G in (0,0,0) do (
   FOR /L %%H in (1,1,%FREQ%) do (
rem   Periodically log SMART data:
      if %%H==1 (
         echo %resetcolor%____________%BATNAME%_v%BATVER%______________
         if NOT !date!==!prevdate! (
            set prevdate=!date!
rem         Embed today's date and time in the log filename:
            set datetime=!date:~10,4!.!date:~4,2!.!date:~7,2!-!time:~0,2!!time:~3,2!!time:~6,2!
            set datetime=!datetime: =0!
            set "LOG=%PROGDIR%\Logs\%BATNAME%%BATVER%_!datetime!_[%LOOPSECONDS%_%PAUSESECONDS%].LOG"
         )
         echo %cyan%!date! !time!   Appending SMART data to !LOG!...
         (
            echo __________________________
            echo !date!              !time!
            %PROG% -A %SSD%
         )>>!LOG!
      )
      if !kcount! LEQ 0 (
         set /A "kcount=KMAX-1"
      ) else (
         set /A "kcount-=1"
      )
      set /A "nLBA=LBACHUNK*kcount"
      echo %green%[Pass %%H of %FREQ%]  !time!   Executing 'smartctl -t select,!nLBA!-max' to %magenta%START%green% ssd selftest...%yellow%
      %PROG% -t select,!nLBA!-max -t select,0-max -t select,0-max -t select,0-max -t select,0-max -t force %SSD% | findstr /R success
      if !ERRORLEVEL! EQU 0 (
         call :resetANSI
rem      Signal to listening apps that selftest is running, by creating a file
         type NUL >%SELFTESTFLAG%
         echo %cyan%!date! !time!   Letting selftest run for %SELFTESTSECONDS% seconds...%resetcolor%
         TIMEOUT /t %SELFTESTSECONDS% /NOBREAK >nul
         echo %green%!date! !time!   Executing 'smartctl -X' to %magenta%HALT%green% [abort] ssd selftest...%yellow%
         %PROG% -X %SSD% | findstr /R aborted
         call :resetANSI
         if exist %SELFTESTFLAG%  del /Q %SELFTESTFLAG%

         set /A "EndTime+=LOOPSECONDS"
rem      To calculate the number of seconds to pause and to check for
rem         midnight rollover and for a change to/from Daylight Savings Time,
rem         we need to know the current time, as seconds after midnight
         for /F "tokens=1-3 delims=:." %%a in ("!time!") do (
            set /A "CurrentTime=3600*%%a+60*(1%%b-100)+1%%c-100"
         )
rem      We passed midnight if endtime is much greater than currenttime
rem         so in that case subtract 24 hours from endtime
         set /A "TestTime=CurrentTime+43200"
         if !EndTime! GTR !TestTime! (
            set /A "EndTime-=86400"
         ) else (
rem         A change to Daylight Savings Time occurred if endtime is less
rem            than currenttime-120, so in that case add an hour to endtime
            set /A "TestTime=CurrentTime-120"
            if !EndTime! LSS !TestTime! (
               set /A "EndTime+=3600"
            ) else (
rem            A change to Standard Time occurred if endtime greater than
rem               currenttime+3600, so in that case subtract an hour from endtime
               set /A "TestTime=CurrentTime+3600"
               if !EndTime! GTR !TestTime! set /A "EndTime-=3600"
            )
         )
         if !EndTime! GEQ !CurrentTime! (
            set /A "SecsToWait=EndTime-CurrentTime"
            echo %cyan%!date! !time!   Pausing !SecsToWait! seconds before starting next selftest...%resetcolor%
            TIMEOUT /t !SecsToWait! /NOBREAK >nul
         ) else (
            if !EndTime! LSS !CurrentTime! (
               echo %magenta%Workload ran unexpectedly long... skipping pauses until timeline restored.
            )
         )
      ) else (
         call :resetANSI
         echo %magenta%!time!  UNSUCCESSFUL ATTEMPT TO START SELFTEST!  Will retry in 5 seconds %resetcolor%
         TIMEOUT /t 5 /NOBREAK >nul
      )
   )
)
exit /B

fzabkar · Mar 20, 2020

Very nice!

(previous questions edited)

There is a CALL :MAGENTA line which references a non-existent label.

I don't understand the purpose of the "resetANSI" subroutine. It doesn't appear to do anything.

Lucretia19 · Mar 25, 2020

@fzabkar: The "call :magenta" is a line that I neglected to update when I converted from (less efficient and less flexible) calls to color subroutines to inline color codes. That buggy line hasn't executed because smartctl.exe exists. Updating this is straightforward: move up the lines that define the color codes so the definitions execute before the test for smartctl.exe existence, insert %magenta% in the message that says smartctl.exe is missing, and delete the "call :magenta" line.

The resetANSI subroutine serves to mitigate a strange bug that causes inline color codes to misbehave until after a "call" (to a subroutine) and "exit" (return from the subroutine) have executed. (Perhaps there's a better alternative.) The bug is triggered when FINDSTR receives piped input inside a FOR loop, as the example .bat below demonstrates. After the lines with FINDSTR execute, color codes "echo" as literal values as if the Esc character isn't in them, instead of controlling the output color. The color codes resume working properly after a call and return, so the call to :resetANSI is performed after each use of FINDSTR. As far as I know, this bug is undocumented -- it's not one of the bugs listed at the bottom of https://ss64.com/nt/findstr.html -- and it took over an hour to find and mitigate it. Before I switched recently from using "call :yellow", "call :magenta" etc subroutines to using inline color codes, the bug hadn't been revealed because the color subroutines involved a call and return.

Code:

@echo off
setlocal EnableDelayedExpansion
set "green=[92m"
set "yellow=[93m"
set "magenta=[95m"
set "cyan=[96m"
set "white=[97m"

echo %white%Test 1 is NOT in a FOR loop, and the inline color codes work okay.
   echo %yellow%[Test 1] %green%This is Green,  %magenta%this is Magenta, and %yellow%this is Yellow.
   echo %Next, the string 'success' will be piped to FINDSTR...
   echo success | findstr /R success
   echo %magenta%This is supposed to be magenta, and FINDSTR found and displayed 'success'.%yellow%
echo %cyan%Test 1 completed.

echo %white%Tests 2 and 3 ARE in a FOR loop, and inline color codes fail after the pipe to FINDSTR.
FOR /L %%H in (2,1,3) do (
   echo %yellow%['FOR' test %%H] %green%This is supposed to be Green,  %magenta%this Magenta, and %yellow%this Yellow.
   echo Next, the string 'success' will be piped to FINDSTR...
   echo success | findstr /R "success"
   echo %magenta%This is supposed to be magenta, and FINDSTR found and displayed 'success'.
)
echo %cyan%Test 2 and 3 completed.%white%
exit /B

Lucretia19 · Mar 25, 2020

A follow-up: the FINDSTR color code bug discussed above doesn't require a FOR loop to trigger it. Putting the lines within parentheses to make them a block of code will trigger it too.

Code:

@echo off
setlocal EnableDelayedExpansion
set "green=[92m"
set "yellow=[93m"
set "magenta=[95m"
set "cyan=[96m"
set "white=[97m"

echo %white%This test is NOT in a FOR loop, but the lines are within parentheses.
(
   echo %yellow%[Test 1] %green%This is Green,  %magenta%this is Magenta, and %yellow%this is Yellow.
   echo %Next, the string 'success' will be piped to FINDSTR...
   echo success | findstr /R success
   echo %magenta%This is supposed to be magenta, and FINDSTR found and displayed 'success'.%yellow%
)
echo %cyan%Test 1 completed.
exit /B

fzabkar · Mar 25, 2020

Thanks.

BTW, I initially didn't understand that "exit /b" was equivalent to a "return" from a subroutine because the embedded documentation only mentions exiting from scripts and the CMD window, not labels. Even Microsoft's online docs don't mention this:

https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/exit_2

This site (which you mentioned) does, though:

https://ss64.com/nt/exit.html

Lucretia19 · Mar 26, 2020

Because the forum ate the Esc character in the .bat code that I posted a couple of days ago, I edited the code so it will generate the Esc character at runtime. See the "EDIT 3/26/2020" note in the post.

Question Can a series of SMART selftests be made to span the entire ssd if each selftest is periodically aborted?

Honorable

Titan

Illustrious

Honorable

Titan

Honorable

Titan

Honorable

Honorable

Honorable

Illustrious

Honorable

Illustrious

Honorable

Honorable

Honorable

Honorable

Illustrious

Honorable

Illustrious

Honorable

Honorable

Illustrious

Honorable

Share this page