Why do HDDs and SSDs have so many issues?

Tommy Volt

Honorable
Dec 19, 2013
8
0
10,510
Hi!

The majority of posts in the storage-area of this forum are from people having troubles with their HDD and SSD drives. And it obviously happens to consumer drives as well as to enterprise drives.
What are the root causes?

All drives nowadays are able to detect and map out bad blocks/sectors during a WRITE. If a block becomes faulty afterwards - due to a weakness of the magnetic media or the Flash - this will only be detected during a READ of that block. The drive will then try to re-read the same block multiple times until it gives up and reports an error. The block will be mapped out.
A CRC Checksum (Cyclic Redundancy Checksum) is added to each block, so the drive controller can verify the block-data for errors.
Magnetic media as used on HDD drives is exponentially more reliable than the NAND Flash technology used on SSDs. NAND-Flashes often can only stand up to 5000 re-write cycles and tend to have lots errors in their data-bits. Additionally the NAND Flashes do not like 'sitting around holding data'. NAND Flash suffers from a "Read disturb". Blocks which are read only, but never get overwritten, have a high potential to lose their data.
This makes it important to count how often each block has been overwritten before and to move the data around periodically (wear levelling). Additionally, SSD drives need to use much stronger data-redundancy. A simple CRC checksum as used on HDD drives would not help much. They have just too many bit-errors. This is why NAND Flash requires ECC (error correction code) parity data to be added to each block. This parity-information is several bytes long and the controller can correct 1, 4, 8 or more bad bits in a data-block by these. NAND Flashes also have a number of spare-blocks to replace bad ones while still keeping their capacity up to a certain level.

Worst case your total available storage capacity will slowly reduce throughout the years, on both the HDD and the SSD. Most-likely the capacity-reduction will occur faster on SSD or even accelerate with the age of the drive due to the very limited number of rewrites possible on the NAND Flashes. I guess a HDD will have a better lifetime.

Also, a HDD does not know the Read-Disturb effects and less often encounters bit errors on old stored data. It also does not require the data to be moved periodically. This eventually speaks for the performance of HDD drives vs SSD drives, especially when they get older.

Next we have the mechanics. I rarely hear about mechanic HDD fails, but they still are a risk and surely do happen from time to time. Everything that is mechanic can age. At least it is possible to save the data of the drive by repairing or replacing the mechanics. An exception is the famous "headcrash", when a read-head touches the media and scratches it. Then you're done with the drive and recovering the data is extremely difficult.
SSDs have a clear benefit here, because there are no mechanics at all.

The logic and interface electronics could also fail. If this is the case, then replacing the controller board can solve the issue in most cases, or at least makes it possible to recover the data. But this only works on HDDs, because on SSD drives, the controller board also carries the media = the NAND Flash memory. They are soldered to the same circuit-board. And desoldering the chips to replace them would be a doubtful method to try. Here the advantage is clearly on the HDD side, because the drive and the controller board are seperate!

There are also two types of RAM-memory in most drives. A very little bit of RAM-memory is inside the drive-microcontroller-chip (you could also call it the CPU of the drive) for fast exceution of the firmware software. Another larger-capacity RAM can be found as a seperate DRAM-memory-chip on the PCB, directly connected to the CPU. Let's call this larger DRAM the 'external memory'. This external DRAM memory is mainly used for write-buffering and caching the data transfer between the computer and the media, making the transfers more smooth and fast. If you, for example, save a file to the drive, it goes into the DRAM memory as a write-buffer first. Your computer already reports "Finished saving", while the drive is still writing the data from the write-buffer to the media. When reading, the drive reads fast into the cache and then transfers to the computer. If your computer should try to load the same files again, it does not need to be read from the drive-media again, since it is still stored in the cache and will be loaded from there much faster.
The firmware of the drives is typically on a little Eprom or serial Flash chip on the controller PCB. Upon starting the drive, the firmware is loaded from the Eprom partially into the CPU-internal RAM-memory. But I assume that that a big portion of the firmware is loaded into the external RAM, because the internal RAM is just too small. So the external RAM is not only from write-buffering and cache, but also holds some parts of the firmware.

To make a long story short: RAM can also have bit-errors. You can see these problems when you look at your computer, your smartphone, routers or other electronics. I think all of us know that they sometimes malfunction or crash and then need a RESET. Servers use ECC error correction with 72 bit wide RAM containing 64 data-bits plus 8 parity bits. This is why servers never fail. They run stable forever.
Bit-fails in dynamic RAM (DRAM) occur far less than in NAND Flash. The technology of DRAM memory is very different from NAND Flash memory. Without power, the data in the DRAM get's lost within a few milliseconds. This is why the CPU must periodically refresh the DRAM. RAM allows an almost infinite number of accesses and rewrite-cycles. It has no read-disturb and does not require wear levelling. DRAM-errors are very random and are typically happening only on single data-bits. There are many root causes such as weak-cells, radiation from antennas, other electronic devices, antennas or even the natural radiation surrounding us. A failing bit in the DRAM is not a repeatable defect, but an effect. The next bit-error can appear at a totally different address of the RAM memory, which is why bad-block-management would not work here. But ECC is extremely effective as it corrects single-bit-errors on the fly in DRAMs.

Unfortunately, the DRAM on the drives is just one single memory chip with a 16 bit data-width, which does not make it possible to perform an ECC error correction, as that would require extra parity bits on top of the 16 bit width. But many controllers of Enterprise harddisks at least add a CRC checksum after writing the data to the DRAM, so some errors can at least be detected and then appear listed in the S.M.A.R.T. attributes under IOEDC End To End errors. This does not repair the corrupted file and unfortunately does also not tell us which one of the many files was affected, but at least we can see the counter ;-).
What if the error hits the drives firmware, which is also in the RAM? Best case the complete drive could stop working until a RESET (which will re-load the firmware), but worst case the drive continues working erraticly and messes up the directory or other data on the drive.
I already mentioned it in another post that Intelligent Memory released new ECC DRAMs where the memory chip itself performs it's own automatic error-correction. This would be a perfect solution to be used on HDD and SSD drives, but so far I have not seen any manufacturer adopting the technology.

Anyway, as you can see there are many things that can happen on a drive and for sure I am missing some of them here.
Let me know your thoughts!
 
AIUI, the typical failure mode for modern hard drives is read/write head degradation or total failure. In most cases there has been no sustained head crash, so the platters are fine. I suspect that the occasional "head slap" may be at least partly responsible for the failures.

The serial flash memory (aka "ROM") on the PCB contains a small part of the firmware, and is sometimes referred to as "BIOS". The ROM consists of boot code plus unique, drive specific, "adaptive" information. These "adaptives" include such things as preamplifier gains and head map. The ROM also points to the location of the System Area (SA) in a hidden section of the platters. The SA is sometimes referred to as the negative cylinders or maintenance cylinders or service area. It contains the bulk of the drive's firmware.

When the boot code is loaded into DRAM and executed, it retrieves additional firmware modules from the SA. These include a directory of all the modules plus the "loader", ie the main code overlay that is responsible for parsing ATA commands. The loader is a static module, ie its contents do not change. Other modules include the defect lists, SMART modules, CHS-to-LBA translator, and others. These are dynamic modules, ie they are periodically updated. Essentially the SA consists of an HDD operating system, a module directory, and a number of firmware "files". In other words, the HDD is a computer system of its own.

See the following thread:

newbie info, from and for newbies :) About firmware, SA, etc:
http://forum.hddguru.com/viewtopic.php?f=16&t=6562

When a head begins to degrade, some sectors become unreadable. The drive then updates the defect list. It also updates the SMART data and rebuilds the translator to account for the remapped sector. Unfortunately, this gives rise to a Catch-22 situation. How can the drive rewrite its SA if the head is unreliable? Moreover, if the head cannot read the firmware modules, then the drive cannot complete its power up sequence. When this happens, some drives will remain busy, others will report a generic model number or serial number, and still others may report a capacity of zero. In each of these cases the data are inaccessible.

Often a drive can be repaired by using a copy of the damaged module on another platter, but special software is required for this. This is because HDD manufacturers use their own proprietary Vendor Specific ATA Commands for accessing the SA and ROM.
 
@fzabkar: Really well written article (yours and the one on the HDD Guru page).
It confirms my thinking that the DRAM on HDD drives is not purely used for cache & write-buffering, but also to execute firmware-code. And if the DRAM-memory-chip was ECC protected with the new ECC DRAMs, it would reduce cases of data-corruption by bit-flips in the buffer as well as it would avoid bit-flips in the executable code.

The System Area on a drive is surely heavily used. Bad block management and wear levelling is not usuable here as this data has to be at the same place at all times to be able to boot. This is a critical point especially for SSD drives, which makes me doubt that they can ever be reliably used on Enterprise-systems or other systems that need a high availability.
 
Tommy:
>>The System Area on a drive is surely heavily used. Bad block management and wear levelling is not usuable here as this data has to be at the same place at all times to be able to boot. This is a critical point especially for SSD drives, which makes me doubt that they can ever be reliably used on Enterprise-systems or other systems that need a high availability.>>

Do you think the latest Samsung SM863 3D V-NAND could help to manage those risks with DRAM on SSD?
http://www.tomshardware.co.uk/samsung-sm863-3d-v-nand-enterprise-ssd,review-33347.html

http://www.tomsitpro.com/articles/samsung-sm863-3d-vnand-enterprise-ssd-review,2-957.html

They shows great QoS performance, too.