RAID 5 May Be Doomed in 2009

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
I have never read a more flawed article in my life. What a load of Bullsh*t.

The 10^14 figure is the chance of an unrecoverable read error by the drive. If you send the same read command for the same sector to the drive again, in 10^14 - 1 cases, the error won't happen the next time. During the purported RAID 5 rebuild, the controller will just have to read one sector from one drive again, and the rebuild completes perfectly, just like normal.

The author of this article obviously has no clue whatsoever how RAID arrays and controllers work, and is unqualified to write an article in this area. If he worked for me, this article would have never gone to print, and would have been rejected instantaneously.
 
seriously one day they say "well have 10 kajillion bit hard drives" the next they say "ohh no, the world will stop at 10TB" RAID 5 is freaking awsome! 12 TB of space is normal. Maybe those of you who work from a garage don't see it that often, but to enterprise users 12TB is normal. If drives get so wobbly that RAID 5 is not practical, then you're also going to be pissing off non raid users 10 fold.

Also, you can rebuild drives in the background, I've done it several times. Consistancy checks scheduled in the background too. If the mentioned problem becomes serious, then manufacturers can release firmware to just mark the sector bad or inconsistant instead of just saying "rebuild failed" I'm not trying to be coarse, but some of you really need to read about RAID 5 before you comment. Especially the author/editor.
 
The article is complete nonsense based a fundamental mis-understanding of the what the unrecoverable read error rate represents.

SomeJoe7777 is on the right track, but failed to explain it completely.

First consider the soft error rate, the rate at which a bit may be read incorrectly one time, but read correctly most of the time. That number isn't really important other than as a context for what uncorrected read error rate is. The URE rate is the rate at which raw bits passing under the read can't be consistently read AND match what was written. Note that we're talking about the raw bit stream from the media. All HDs for at least 15 years have had advanced ECC encoding on each sector which is capable of recovering from at least one bit (typically several bits) error in the bit stream for that sector. This means that the ECC which is built in to the drive will correct that 10 ^ -14 URE.

The URE applies to the raw bit stream, not to the ECC corrected data coming out of the drive, and the ECC can recover from at least 1 bit error in the bit stream for each (and every) sector on the disk.

Because of that, the size of the drive is essentially irrelevant, it is the sector size and the amount of ECC per sector that is relevant. Because of the sector level ECC, the size of the drive has only a trivial effect on the likelihood of a sector becoming unreadable.

RAID (other than the improperly named RAID 0) offers an additional level of recoverability allowing for a failure of any part of a drive (e.g. one sector, one head, etc.) or a whole drive (e.g. motor, actuator, controller, etc.) to fail and still be able to recover the data.

Likewise, the size of the RAID volume is only a trivial factor in the chances that a RAID volume will be unrecoverable. The number of drives in an RAID 5 or RAID 6 array is many orders of magnitude more significant than the size of the drive or of the volume/array. The more drives, the greater the chance of a failure of one or more drives. The same applies to the seldom/never used RAID 2, 3, or 4 and to a few implementations of RAID 0 + 1 (where the data is striped before it's mirrored)

Because RAID 1 (and RAID 10, aka 1 + 0) mirrors pairs of drives, the number of drives and the size of the array are almost irrelevant. In order to be uncoverable, a RAID 10 array must have a failure the identical portion of the two drives that are mirrored within a short enough timeframe that one failed drive is not repaired/rebuilt before the second failure. That probability does not change significantly by adding more drive pairs to the array.
 
RAID5 is in no way doomed. If HDDs ever become so unreliable that they are unfit for RAID5, then they will most certainly be unfit for use as standalone drives (i.e. the most common use). HDDs simply must store enough parity information (per sector) that data loss is a VERY rare event. As we pack information more and more densely onto the platters it may be necessary to add more parity to compensate.

IOW ... move along, there's nothing to see here.
 
"Upon encountering such a read error during a reconstruction process, it is claimed that the array volume will be declared unreadable and the recovery processes will be halted."

Please stop with the FUD.
If a parity stripe fails in a RAID set the single stripe is failed, not the entire RAID volume.

Check with Adaptec or LSI tech support for more information on this.
I rarely will say an author does not know what he is talking about, but in this case it is appropriate.
 
I guess this could be good for me, as my company provides RAID data recovery services. Unfortunately, due to the capacity of the drives and the size of the RAID volume, it could take a week or more just to get all the drives mirrored and could take just as long to reconstruct the file system, if it is damaged.

I'm curious to know how companies are going to keep regular backups of their 10TB RAID arrays. Even with an eSATA connection, it will take days to transfer 10TB of data to another drive.
 
Status
Not open for further replies.