Can Heterogeneous RAID Arrays Work?

What if the homogenous setups were all with an 8 MB, 3 platter setup like the slowest drive included?


My interpretation of the data is that RAID arrays not only only have the storage of n x (smallest drive), but also have a performance cap that the array will not perform better than a ratio based on the speed of the slowest drive.

This article is a nice casual read, but I think it goes without saying that a business will invest the capital to have the same drives whereas a guy with 4 x 160 GB drives from various manufacturers probably won't mess around with getting the same drives and firmwares before setting up an array for storing pictures or music, etc.
 
Just a note on that, typically businesses that impliment RAID go with a storage solution that has a warranty, like a DMX3 or something from EMC. So all drives will be the same anyway.
 
When I saw the I/O results the thought that came to mind was FAIL. The explanation I came up with was that one or both of the "other" drives they swapped in has better I/O at large queue depths than the Samsung drives.

-mcg
 
I purposefully run heterogenous arrays -- say, a WDC and a Seagate from the same generation. The reason for this is that of coincident failure. If a bunch of drives that sit in the same enclosure, serving the same workload, all have identical mechanisms, chances are too big that you'll lose a second drive before you've been able to replace and re-build the first. With different mechanisms, that risk is mitigated somewhat.

Btw: what's with Seagate not providing a cross-ship warranty exchange any longer? I just got errors from one half of my second-oldest array, and now I have to run degraded for weeks? I'd rather just buy a new drive. From a company that offers cross-shipping.
 


i have the same setup with my array. i used different brands and different batches though i used linux's software raid since cards would add too much to the cost for supporting my (now) 7 disk raid 5 array. i wonder if the performance issues between hetero vs homogeneous arrays are limited to hardware controllers or to software raid as well.



i was thinking the same thing. iirc, stripped raid (0,5,6) distributes the data evenly and it would be then follow that the throughput of the whole array will be bottlenecked by its slowest member.
 


I'm not quite sure you could justify that statistically. If you have two manufacturers both with 1,200,000 MBTF's I'd say one has just as much chance of dying tomorrow as the next one. I have had multiple drives in my machines (one behind me had 4 SCSI's) and , with one exception, I have never had two drives fail in a box within 3 months.....maybe even in same year. I did have one box where everything died, optical drive twice, vid card once, HD but that was jut a poor case design that was way to hot ... back in the days (1995 ish) before reliable temperature monitoring.

Btw: what's with Seagate not providing a cross-ship warranty exchange any longer? I just got errors from one half of my second-oldest array, and now I have to run degraded for weeks? I'd rather just buy a new drive. From a company that offers cross-shipping.

Yikes....that's the reason I stopped using WD a few years back. Seagate always let me cross ship as long as I gave them a credit card number. I wonder if it depends on whether it's a "consumer" or "enterprise" drive ? Tell me more.
 

Although I have no personal experience, I have always heard that heterogeneous is best for redundancy. As can be seen in this article, the performance hit is small and the benefits that you stated are real.
 
The idea of running heterogenous because of different mechanisms, and thus different failure points is somewhat intriguing.

From a mathematical stand point, two drives rated at the same MTBF, whether both Seagate or one Seagate and one Western Digital, have the same chance of failing at any given time.

Thus, heterogenous helps only if you assume that doesn't hold. That is, you have an extreme circumstance. For example, an intrinsically flawed set of homogenous drives (as something you wish to avoid by running heterogenous). The odds that a drive from the same batch as a known failed drive will fail is higher than regular drives. The risk of that failure being a manufacturing defect that is across the batch is now non-zero, thus promoting the risk of failure in your 2nd working drive.

However, in my experience, a mechanically flawed drive will exhibit problems within a short time span. As in, a month or two. Granted, this is enough time to become dependent on the array, but in an enterprise situation they would be running RAID5 or RAID6 and have a hot swap. You could easily hot swap in a different brand, and if you ran RAID6 you could just pull that one drive.

Another way to run homogenous without the risk of a flawed batch is to buy the same model at different times or from different sources. I have four 7200.10 320GB drives, but I purchased it as two, and then two more a couple months later. So, they have different batch codes and I haven't had a problem with either set of drives. They're all in a RAID5 for storage, with two more drives [7200.10 250GB] in RAID1 for the OS.
 
Bleh, this belief that 2 drives in a homogeneous array will fail at the same time is crap. :pfff:

Should be pretty obvious that if it was a real concern then you would have heard of problems by now rather than speculation about the mere possibility of an issue.

I run a server farm in ATL that has 500+ servers and 6000+ SCSI drives, same model servers and same model hard drives. Never once in the past 7 years of maintaining this server farm has there ever been more than one drive fail in any one server.

I asked around the data center to get some more input. Out of a 15 floor data center filled with thousands of servers and 10's of thousands of hard drives, no one has ever experience this so called problem.

Its not even worth worrying about. 😉
 

Just as you are telling me it's crap, I have had others tell me that IT HAS HAPPENED TO THEM. Maybe not at the -exact- same time, but it isn't too crazy to assume that 2 virtually identical drives doing the exact same thing in the same operating environment would fail at around the same time.
 


The author's idea of "small" performance hit doesn't really coincide with mine....those H2 bench graphs show the hetero's only 78% as fast as the Homo's ! HomoRAID rules !

Access times are 54% longer for the hetero's!

Bar charts show Homo's are 22% faster in reads, 36% faster in writes

Homo's are 42 % faster as file server

Homo's are 74 % faster as database server

Homo's are 37 % faster as workstation

If I'm building a RAID box, I don't want those kinda penalties, my HD's will be Homo's

Tho I am still wondering where the line might be ..... is a 7200.10 and a 7200.11 close enough to be homo ? or is that hetero ? Or is it in the middle somewhere ?
 
This test is a joke. How can you know that the poor results of the heterogeneous raid are caused because of the heterogeneous nature of the raid and not because you used a slow hard disk along with fast disks? Why didnt you compare a homogeneous raid that consisted of only the slowest disks that you used with the heterogeneous one? I bet it would have been slower than the heterogeneous.

Everyone in the storage business knows that heterogeneous is the way to go. Network Appliance(one of the biggest storage companies, Nasdaq 100/Fortune 1000) uses all kinds of hard disks in their raids.

And it makes sense. When you go and buy the same disks, chances are that they are from the same production batch therefore they have a much greater chance of failing at the same time than a disk which is from a completely different production batch.

MBDF mean nothing. You dont care how much average(which even that isnt true but nvm) it takes for the disk to fail. What you do not want is having 2 disks failing at the same time. You dont care if those disks fail after 2 years or after 4 years on average. You care if you have 2 disks failing simultaneously.

Imagine having 100 identical disks, distributed on many raids and let's say 1 hot spare disk(you are a cheapskate and you dont want to have more 😛). You want to tell me that homogeneous raids is the way to go? Seriously? I am writing this large scale example so that you can understand more clearly the risk of using only identical disks.

So try to do this with similar 500gB disks(same amount of cache/similar performance). Use 3 disks from 1 brand and 1 disk from a different brand. Also compare performance of that heterogeneous raid with the performance of 2 homogeneous raids(one raid consisted of only the first brand's disks and a second raid consisted of the second brand's disks).
 

:pt1cable:
Do you always contradict yourself when making a point?

You can assume all you want, but the only way it could happen is if a lazy system admin didn't replace the first drive when it failed.

Anyone can damage two drives and kill an array through grouse negligence, thats not an enticing point.
 


In our enterprise we have 6 DMX3's with over 20TB of storage a piece, plus over 100TB of storage in other equipment, most of it previous versions of EMC SANS. All of which use the same friggen disks. And never in the life of the equipment has more than one disk failed.

This also includes the thousands of RAID 1 and RAID 5 local drives on our thousands of 1850's, 1950's, 2550's, 2650's 2850's, 2950's, and 6650's.

Any network that scales to the size you are speaking of will use the same equipment due to warranty availability and the ability to hot swap drives lower hardware costs involved overall.
 

Okay, in this article they were purposefully trying to find drives that were dissimilar. For example:
The WD drive has even less per-platter capacity, with SATA/150 instead of SATA/300 and only 8 MB cache, but we thought the device was particularly appropriate for our purposes, which was to create a scenario with three entirely different drives.
In the real world you would want to choose drives that were as similar as possible i.e. at least having the same interface and the same amount of cache.

Are you seriously going to fuse two of my sentences together, cut one of them in half with surgical precision, and state that I am contradicting myself? You should write for the Daily Show.
 
While this topic can be debated over and over again... fact is that for home use, while you should choose drives with very similar performance, it's not going to ruin anything if you don't. And from an enterprise standpoint, again I say the exact same disk is chosen NOT because of performance but BECAUSE of the warranty, costs, and vender support. Mostly because of costs. Performance is achieved based on a chosen product that meets performance requirements for a system based on statistics (IE users hitting it concurrently, type of usage, type of I/O and so on), and not the brand of disks placed in a configuration.

For example Dell uses several HDD venders for the SCSI 15KRPM drives, I have personally stuck seagate and WD 74GB drives in various servers that were shipped under warranty from dell.

What am I trying to say? I guess that for huge networks, it's all determined by the system (IE storage) vendor and what they supply as far as disks.
 

😴 Your point was so crappy I figured I would cut out the BS and get to the meat of your point.

Your assumption is homogeneous drives will fail together.
My proof is years of experience with thousands of the same drives.
Your proof was a lazy negligent admin that sat idly by while a degraded array eventually had a second drive fail. :pfff:

:non:

The plain and simple truth is drives will fail randomly and MTBF means jack squat nothing. One drive may last 7 years while the drive beside it in the same array may have failed twice over in the same period. Heterogeneous vs Homogeneous is a faulty argument for drive failure as MTBF is just an educated guess. Performance tells you everything you need to know about which setup is best.
 
Even if drives from the same batches are more likely to fail that drives from different batches (and this itself is an assumption that is unproven), given the spectrum of the lifespan of a hard drive which I'd say is about 1 day to 20 years, it's extremely unlikely that more than one drive fails within several days of each other without a 3rd party 'intervention' like bad power or dropping the drive.

RAID will run degraded for as long as it takes to replace the drive, which depending on circumstances will vary from a half hour or so to maybe 10 days for a home user who needs to mail it in and not get cross-shipping. Either way, the odds of two drives failing in that span BECAUSE they are in the same batch are very bad. Even if it happens, it's probably more likely from a statistical standpoint that it's coicidence than that a batch of HDD's is only going to last X number of days then 50% of them will fail within a week or some such.
 

For starters, you seem to be focusing on large datacenters with enterprise class SCSI or SAS drives. I'm just talking about home users with regular old off-the-shelf SATA drives. I'm not surprised that you haven't had many failures given that you are working with much higher quality hardware than the average joe would be using.

Also, I am not assuming that homogenous arrays will fail together, but I am assuming that they are more likely to fail at around the same time. It could easily take up to a week to replace a drive if the replacement is not locally available. Laziness has little to do with it.

I agree that MTBF is next to meaningless, but for different reasons. Western Digital recently switched the WD3200AAKS from 2 160GB platters to a single 320GB platter. There wasn't even model number change, much less an updated MTBF rating. Somehow I doubt that changing the platters would have no impact whatsoever on MTBF.
 
Uh... no it couldn't "easily take a week" to replace the drive. If you can't get something locally, you can newegg it overnight or 2 day. The only way it would be "easy" to take a week to get the drive replaced is if you were lazy and didn't act fast enough.

As for this article, as others have pointed out, it's rather flawed because it used a slower drive in the test. If you can get 4 diff drives w/the same specs (speed/cache/platters/etc), then it would be a lot more feasible. Either that or, run the tests as is, but also add another set of 4 drives that features an above average speed drive to really test if it's a bottleneck due to drive configuration or actual drive hardware.
 
The idea of running heterogenous because of different mechanisms, and thus different failure points is somewhat intriguing.

From a mathematical stand point, two drives rated at the same MTBF, whether both Seagate or one Seagate and one Western Digital, have the same chance of failing at any given time.

you should appreciate this quote:

In theory, practice and theory are the same.

In practice, they aren't.
 
Didn't the whole rule of same-drives-only for RAID originate with SCSI-based RAID arrays in which the drive heads were synced? I seem to remember reading somewhere that the very old RAID controllers (e.g. mainframe and minicomputer era, I suppose?) would run the SCSI drives syncrhonized so that when one drive was looking at a particular block, so was the other. From that, it wasn't just better, but necessary that the two drives be identical.

This same-drive-only concept sounds like it's based on the kind of folklore that gets propagated down through time, and no one remembers why. Like, "You have to use the exact same memory in all slots on a mobo". At one time, it was necessary to use the exact same memory in each slot or your computer wouldn't run. Now, you can mix-n-match and it usually works fine. Yeah, you might get reduced performance because of a loss of dual-channel, or least-common-denominator (actually, greatest common factor) memory timings. But it works.

It looks like the same situation here: the ATA standard is based on having the drive controller hardware inside the drive itself, right? The processes of managing the timing on the drive head, moving the head across the platters, etc. are all handled by the drives. The O/S just asks the drives to find Block X in a given sector and track. So as long as the drives can do this with relatively equal performance, it shouldn't matter much.

As for homo vs. heterogeneous impacting failure safety: if one mfr's drives are particularly susceptible to a particular type of power fluctuation, temperature, humidity, G-shock (earthquake?), dust, or cyclic frequency of any of the above more than other mfrs', then employing a hetero environment could conceivably reduce the risk of multiple failures from some environmental factor (e.g. the A/C fails, or there's an earthquake). Most data centers are protected against that kind of failure by environmental controls and auto-shutdowns; most homes are not.