SSD partial failure after wiping partition table

ciab_ch

Commendable
Mar 21, 2016
4
0
1,510
Hello!

I manage a fleet of laptops and tablets for our company. As part of their upkeep, we routinely wipe the HDD/SDD and repartition it to install new operating systems via a scripted process. This is done in Windows PE using PowerShell scripts; we don't use Ghost or other cloning products.

I've been doing it the same way for a few years now, yet recently I've had a disturbing problem -- only with the SSD machines, not the ones still using HDDs.

The issue: After some number of repetitions of this process (we do this quarterly, so the number isn't high; maybe 10-20 times), I wipe the drive and repartition, then format the partitions. Typically the SSD will have a small partition of about 2 GB, with the rest allocated to a Windows OS. The small partition formats properly, but then the large one does not -- it may appear to be running but never complete, or it may freeze the whole system after a few minutes.

Details I've gathered over time:
  • ■ Not SSD vendor specific -- this has happened with Intel drives but also with whatever's in the Surface Pro 3.
    ■ Not machine specific -- it's happened with Lenovo machines and with Surface Pro 3.
    ■ GPT or MBR, makes no difference.
    ■ UEFI or BIOS, makes no difference.
    ■ PowerShell, DiskPart, or cmd.exe "format," also makes no difference.

One thing I tried is to pull the drive, which is normally installed as the boot drive, and put it in a caddy, then install that into another machine as a secondary drive. Doing that allowed me to wipe it and repartition by hand. However, this is obviously not an option for the Surface Pro 3, as I cannot remove those drives.

Checking SMART info shows me no obvious red flags; for instance, Intel's Toolbox gave the drives a 100% good status.

I'm starting to think that SSDs have some kind of hard limit on the number of times they can be wiped and repartitioned, whereas I've seen no such problem with our HDD-based machines. I'm very concerned that eventually this will render all of our SSD-based machines unusable, which would have a huge impact on our business.

Has anyone faced this problem before? Is there some built-in feature or limitation that I'm running up against?

Thank you for any assistance.
 

popatim

Titan
Moderator
Have you tried 'not wiping' the SSD's?
Wiping is bad for an SSD, you should use Secure Erase or just delete & redefine the partition and let garbage collection handle erasing the cells.
This is where I think you are hitting a wall. Garbage collection now basically has a whole drive to erase, which takes quite a while, yet you are already trying to store stuff on it. Just a guess without further investigating...
 

ciab_ch

Commendable
Mar 21, 2016
4
0
1,510
OK, that's interesting! I understand the concept of "garbage collection" so that makes a bit of sense. (What doesn't make sense, at least not intuitively, is why it would work for a while but then stop working.)

To clarify, by "wipe" I do not mean doing a long format; I just issue a command to clear the partition table, then create new partitions.

If I recall correctly, Secure Erase is essentially zeroing all sectors of the drive, similar to a low-level format, and must be done using a special tool. I did this on a couple of drives while testing, and I had to use the Intel SSD Toolbox, so I wouldn't be able to script that. Also, it took a long time.

Now, I'm not opposed to performing an operation that takes a long time, provided it can be automated. For our purposes, we don't actually need to obliterate the data, just write a new partition table and then quick-format the new partitions.

Up until now, I've been using the equivalent of DiskPart's "clean" command, through PowerShell:

Code:
$WorkingDisk | Clear-Disk -RemoveData -RemoveOEM -Confirm:$false
$WorkingDisk | Initialize-Disk -PartitionStyle MBR
$NewPt = New-Partition -DiskNumber $HDD -AssignDriveLetter  -Size $PtSize
...
$NewPt = New-Partition -DiskNumber $HDD -AssignDriveLetter -UseMaximumSize
...
Format-Volume -InputObject $NewVol -FileSystem NTFS ...

"Format-Volume" defaults to a quick format, and that's what I use. This is the point at which things seem to hang up.

If I want to avoid coding separate solutions for HDD vs SSD, I could instead get the existing partitions on a volume and remove them all:

Code:
$WorkingDisk | Get-Partition | Remove-Partition -Confirm:$false

I'm testing a new procedure using that method and will get back to you with the results. So far, it appears to be showing the same behavior, but I will keep my hands off for a while.
 

ciab_ch

Commendable
Mar 21, 2016
4
0
1,510
OK, I left it overnight attempting to format a ~200 GB partition. 14 hours and counting, still just sitting there. Is this considered expected behavior?
 

popatim

Titan
Moderator
Its very strange indeed. I would have had you let the drive sit after deleting the partitions. This is what give garbage collection the job of writing zero's to the whole drive. The lockup I'm thinking is caused when the write buffer overflows during garbage collection.

Being in laptops the drives are all probably oem with oem firmware so no official support from the real manufacturers and your not likely to get Lenovo to look into it unless a whole lot more people start complaining.

Is there any consistency the drives? all TLC drives perhaps?
Any problems reported by SMART, like reallocated or uncorrectable sectors? (Though I doubt SMART would report failed buffer sectors, slc or ddr)
Lastly, can you warranty the drives or are they too old?
 

ciab_ch

Commendable
Mar 21, 2016
4
0
1,510
Regarding the firmware, yes, someone on the Intel board told me much the same thing. The SSDs are Intel but were shipped from Lenovo, so they have a Lenovo firmware. I can probably return the drives but I'm not happy with the idea of returning 40 drives. (I'm an IT guy; when there's a problem, I want to know what's going on and how it can be fixed.)

I would be content to say "Oh, must be crap hardware, better not buy Intel next time..." -- except that I now have the same problem with Surface Pro 3 tablets. Those don't have a SSD that I can remove, so we would have to replace the whole unit.

To me, "same problem, different hardware" makes it less likely that the problem is with the hardware. Therefore I wonder if this is some problem that exists for all SSD, or something that I'm doing wrong.

Also: Remember that these same units worked fine with this same procedure, weeks or months ago. Now, we have mass failures. My method hasn't changed, so is this something that breaks the drives over time?

I will get back to you on the SMART status. Previously when I checked, I saw nothing. Can you recommend a tool that will let me dump the results to a file so I can paste it up here?