How to choose the best hard drives for Intel's PCH RAID?

dude2

Distinguished
Nov 15, 2008
12
0
18,510
Hi,

I've received no response from Intel community forum regarding this topic:
"TLER/ERC/CCTL capable drives needed for PCH RAID?"
http://communities.intel.com/message/91107
Neither have I got a concrete answer for Intel support ticket 8000033427.
I wonder if my question is not valid.

question in brief:
When choosing RAID hard drives for using together with P55/H57 boards, are TLER/ERC/CCTL capable drives needed for a more stable RAID set up?

TLER/ERC/CCTL is a feature of a hard drive to accommodate RAID controller's management feature. I have collected some info and called hard drive manufactures in this regard.
WD TLER default sets to 7 seconds.
Seagate ERC default sets to 10 seconds.
Samsung CCTL default sets to 7 seconds.
These numbers may change for different versions/models. Some models may even allow users to change the timeout settings.

In the ticket 8000033427, an Intel engineer said that P55/H57 PCH, just like ICH7R, has a 10 seconds limit for a RAID member hard drive to reply to R/W commands before declaring this hard drive not responsive and dropping it from the RAID array, but he added Intel's PCH Software RAID does not truely support hardware enabled TLER/ERC/CCTL.

Thus, I don't know how to pick the best hard drives for the Intel PCH RAID array. If I choose hard drives built without TLER/ERC/CCTL features, these drives, while in its error recovery process, may be detected as not-responsive and dropped from the RAID array by PCH. On the other hand, If I choose TLER/ERC/CCTL drives, I've been warned about the incompatibility issues.

Any suggestions before I can get a clear answer from Intel chipset support on this? Or, shouldn't I be concerned about this issue?
 

dude2

Distinguished
Nov 15, 2008
12
0
18,510
I received the following response from an Intel tech support engineer in the ticket #8000033427.

He said on 5/28,
"P55 only uses Software RAID. Therefore, the P55 documents do not list items such as TLER. .... TLER is not supported."

He didn't mention specifically about ICH10R, but he said that 10 second limit is unchanged from ICH7R to PCH and the non-responsive drive will be marked as failed after 10 seconds since PCH issued a R/W command.

AFAIK, Western Digital RE uses TLER for the out-of-sync problem between the RAID controller and the member drive during this hard drive's error recovery process.

WD RE may work, at the first look, because it will definitely respond to PCH in 10 seconds, but how well these two will work with each other in various situations is at stake. I am evaluating the RAID stability risk of using either the Intel PCH or a dedicated RAID card.
 

dude2

Distinguished
Nov 15, 2008
12
0
18,510
WD RE(RAID edition) disks definitely have TLER.
http://www.tomshardware.com/reviews/sbm-high-end-system,1689-7.html
Even some desktop edition WD disks used to be "TLER available" via the TLER utility, and that is before WD decided to widen the gap between Desktop disks and Server RAID ones in the beginning of this year.

I'm looking for a more reliable RAID solution, but of course faster is always better. For example, maybe I should just pick a RAID card which is marked CCTL compliant and then pair it with CCTL ready drives to achieve the RAID stability. Any suggestion on which to choose from, CCTL, TLER or ERC? or SAS maybe?
 
Other than SSDs, nothing beats a bunch of 15K SAS drives connected to a high performance RAID controller and a BBU to enable write caching. I wouldn't use it for a desktop PC, but it's an excellent solution for servers. On a desktop I'd use RE3 drives on the ICH10R combined with an SSD if booting/applications loading performance is important. No matter what RAID solution is used, you need to perform regular backups.
 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
> Any suggestions before I can get a clear answer from Intel chipset support on this?

We've seen a few requests here for assistance,
because users assembled a RAID using
WD's Caviar Black series, which do not support
time-limited error recovery ("TLER").

Yes, they will drop out of the RAID array
after they start to fill up, because the
firmware's error recovery logic may take too long,
and Intel's I/O controller hub will conclude
that one or more HDDs are not responding.

BEST WAY is to stay with WD's RE (RAID Edition)
HDDs, which are designed with TLER --
time-limited error recovery.

The specs for each WD HDD will state if TLER
is supported by any given HDD:

http://www.wdc.com/en/products/productcatalog.asp?language=en


p.s. WD has sold so many millions of HDDs in recent years,
and Intel's I/O controller hubs are also so ubiquitous,
it's extremely unlikely that WD's RAID Edition HDDs
are incompatible with any recent ICHx.


MRFS
 

dude2

Distinguished
Nov 15, 2008
12
0
18,510
Hi GhislainG,

When SATA II and 6G is closing the gap between the speed of SATA and SAS, I wonder if there are some open knowledge and benchmark test results to show how SAS is more reliable than the new generation SATA. Anyway, it seems like you are on the side of putting in a RAID card with the backup battery unit for stability and speed.

===============================
Hi MRFS,

Yes, with WD RE hard drives and Intel I/O controller hubs' reputation and prevalence, I shouldn't worry too much about their incompatibility.

Just like you thought, TLER will most likely help lessen the chance for hard drives to be dropped from a RAID array. This has also been mentioned by an Intel engineer.
http://communities.intel.com/message/12098#12098

However, by looking at these reported problems:
"Random drive fails with new Matrix Storage Manager 8.9"
http://communities.intel.com/message/51299#51299
"Random drive fails with new Rapid Storage Technology 9.5 ?"
http://communities.intel.com/thread/8139?start=0&tstart=0

I am afraid WD RE TLER enabled disks will only alleviate the drop-out sympton somehow instead of providing a rock solid array, because even though they will be more responsive to ICHxR/PCH but may still not be a perfect match? The Intel support engineer told me that ICHxR/PCH does not truely support TLER/ERC/CCTL. He can't provide me a list of compatible hard drives for ICHxR/PCH.

I searched through Intel site for tested and supported parts for the PCH motherboards but failed to find info regarding compatible/tested hard drives. Tested memory is listed nonetheless.
http://www.intel.com/support/motherboards/desktop/sb/CS-029945.htm

I think two things can help assure the stability of matching ICHxR/PCH with TLER/ERC/CCTL drives:
-----------------------------------------------------------------------------------------------------
1. Even though Intel's integrated RAID solution may not claim to be the perfect match for any specific vendor's(or vendor group's) standard but it should still claim categorically compatible to the extent that ICHxR/PCH is guaranteed not causing drop-out problem due to the prolonged disk's error recovery process(e.g., 5~10 seconds) as long as these disks have implemented TLER/ERC/CCTL.
2. Some compatibility list of hard drives for the ICHxR/PCH boards
 
I read the threads and it appears that version 9.6 fixes the issues reported with 8.9 and 9.5. It also looks like using RE drives wasn't as bad as using Caviar Black drives. What will you do? Buy a RAID controller, use version 9.6 or no RAID at all?

Edit: SAS drives are not more reliable that SATA drives, but they are faster. When connected to a good RAID controller with a BBU, writing is very fast. Just add more drives to improve performance.
 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
From a scientific point of view, controlled tests need to compare
all permutations involving all IDE, AHCI and RAID modes
with and without TLER (or similar) support in the HDDs attached.

And, with or without Intel's ICHxR chipsets, there is also the option
to create "software RAID" arrays with Windows XP
e.g. starting with dynamic disks.

Thus, IDE and AHCI can still be configured in such a software RAID.


We have 2 x 6G WD HDDs configured as a software RAID 0 (for speed);
each is 1TB for a total of 2TB; and, this RAID 0 array
is far from being full :)

Here's that 6G HDD:

http://www.newegg.com/Product/Product.aspx?Item=N82E16822136533&Tpk=N82E16822136533


So far, so good; and, it's pretty fast too!
The following test was done with a 96MB file,
to force the test to read from the 2 HDD caches only,
in order to get a feel for the 6G difference (if any):


ASUS.PCIE.GEN2.SATA6G.96MB.read.2xHDD6G.XP.RAID0.bmp



MRFS
 

dude2

Distinguished
Nov 15, 2008
12
0
18,510
GhislainG>>....RE drives wasn't as bad as using Caviar Black drives<<
It is pretty much concurred by me. But, I still can't draw conclusion WD RE will be trouble free though, at least not so by reading the threads from the Intel Communities forum.

GhislainG>>What will you do? Buy a RAID controller, use version 9.6 or no RAID at all?<<
If I just need a home desktop with some basic and convinent RAID capabilities, Intel ICHxR/PCH may be powerful enough. But, if I need to a reliable RAID without the possible mismatches waiting somewhere in the lifespan of the hard drive, I may wait for some answers and look around all options before jumping into a conclusion. No RAID is not that bad, if some cluster setup works. Redundancy can be achieved in many ways.

MRFS,
Two 6Gb/s 1TB RAID 0 setup is definitely a kill on the C/P. RAID 0's average throughput almost doubles that of non-RAID. Based on your data, its native non-RAID readings on average should be around 115MB/s. It is right on par with Legit Review's test results on SATA II and SATA 6Gb/s.
"Seagate XT 2TB SATA 6Gb/s Hard Drive Testing"
http://www.legitreviews.com/article/1127/3/

How stable is your RAID 0 setup? Are there I/O intensive applications running on it 24/7/365?
 
But, if I need to a reliable RAID without the possible mismatches waiting somewhere in the lifespan of the hard drive, I may wait for some answers and look around all options before jumping into a conclusion.
To start with, no RAID is 100% reliable. Unfortunately you never mentioned that it's for a critical application in your original post. You can mitigate the risks with RAID0+1, RAID6, RAID60 and a few other combinations with the use of at least one hot spare drive just in case one fails at the most inopportune time. That's how an enterprise SAN should be setup, but I don't necessarily do it for servers that run 24/7/365. You also need a dual processor server, dual NICs, dual PSU, dual UPS, etc. Achieving 100% uptime is expensive.
 

dude2

Distinguished
Nov 15, 2008
12
0
18,510
I don't expect RAID to do full system redundancy. It will only do disk redundancy. It takes much more to accomplish full system redundancy just like you said. But, I still expect to have a "relatively reliable" RAID without losing an arm and a leg.

If we all agree this "error recovery gets out of sync" problem does exist between ICHxR/PCH and TLER/ERC/CCTL drives before Intel openly claims that ICHxR/PCH is categorically compatible with TLER/ERC/CCTL drives, shouldn't this potential problem be noted by users and dealt with by vendors? For example, can we users ask Intel to make new version firmware/software for ICHxR/PCH to address this issue? BTW, can AMD 's desktop chipset coordinate with TLER/ERC/CCTL drives and handle error recovery well without dropping them accidentally?
 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
> BTW, can AMD 's desktop chipset coordinate with TLER/ERC/CCTL drives and handle error recovery well without dropping them accidentally?

Very good question: WHY DON'T YOU ASK AMD DIRECTLY?

And, let us know what they say, please.

Another reliable expert to ask this same question
is Allyn Malventano at www.pcper.com .


p.s. It occurs to me that a user option should be
added to the SATA protocol, with a reasonable DEFAULT
value, based on the best engineering expertise.
This option should be accessible via Intel's
RST and Matrix Storage Technology.


MRFS
 
For example, can we users ask Intel to make new version firmware/software for ICHxR/PCH to address this issue?
Based on the threads that you linked, version 9.6 seems to address the dropped drives issue. No additional issues have been posted by people who upgraded to that version. It also is interesting that several complaints are from people using Intel motherboards and/or hard disks without TLER.
BTW, can AMD 's desktop chipset coordinate with TLER/ERC/CCTL drives and handle error recovery well without dropping them accidentally?
That's harder to determine because Intel have been supporting RAID since the ICH5R, therefore there is more info available about Intel than AMD. However you can google RAIDXpert and you'll find that the issue also exist with AMD. If I were you I'd probably stick to the Intel ICH10R and use the latest Intel Rapid Storage drivers (version 9.6.1014) and RAID enabled hard disks (WD TLER / Samsung CCTL / Seagate ERC). Or buy a controller like the 3ware 9650SE-4LPML and the BBU. It will cost less than $500 and provide good RAID5 performance.

Edit: MRFS' suggestion to use software RAID shouldn't be ignored if you're leary of Intel's solution.
 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
p.s. Re: a "relatively reliable" RAID without losing an arm and a leg

There are a lot of factors to consider, such as these observations
which we offer, after several years of using RAID 0 primarily for speed:

(1) 5-year warranties are superior to 3-year warranties,
especially when retail cost per warranty year is considered;
time passes swiftly when you're having fun, and 3 years
can happen before you know it;

(2) input power quality is crucial, which mandates a
good UPS and PSU on every discrete system, with a
feedback cable to initiate SHUTDOWN whenever
the power grid fails;

(3) active cooling on all HDDs is another necessity,
ideally with a removable dust filter on all intake fans;

(4) short-stroking partitions that host the most frequently
used files increases performance and reduces wear
on the armature assembly and bearing;

(5) we also suspect, without conclusive proof, that
regular disk checking does maintain the strength of
raw magnetic recordings on the platter media;

(6) installing HDDs with vibration-reducing mounts
is another good idea, particularly when several HDDs
share the same drive cage;

(7) being kind to your HDD manufacturer is also a
good policy, particularly if/when any given HDD fails:
there is a measurable amount of "infant mortality"
in this industry, so don't blow up if you experience
your share of same!


MRFS

 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
> MRFS' suggestion to use software RAID shouldn't be ignored if you're leary of Intel's solution.


Software RAID doesn't come in all flavors, however,
particularly with XP.

I am told that server editions of Windows do support
more flavors of software RAID.


RTFM (Read The Fine Manual -- not always "F"ine however :)


MRFS
 
XP isn't an issue as people running mission critical applications use Windows Server 2003 or 2008 and dedicated servers with hardware RAID controllers, BBU, etc.

Edit: I agree with the 7 points in your previous post. I have a home server that runs Windows Server 2003 and it's been running 24x7 for several years. It has a good PSU, a server motherboard and a Smart-UPS 1500 in case of power fluctuations or failures.
 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
> XP isn't an issue ...

It certainly is for XP users who try to configure a RAID array
using Intel's ICHxR chipset with HDDs that don't support
TLER/ERC/CCTL.

Some of those customers have complained bitterly
e.g. with RMAs to Western Digital's support staff,
that their RAID arrays are failing after lots of WRITEs
and long before their factory warranties have expired.

BUT, those customers failed to take note that
their Caviar Black HDDs do NOT support TLER.

And, even though I don't yet have any experience with Win7,
my money is on a bet that similar things can be expected
to happen with that OS too.

So, part of the problem here is sheer customer ignorance.

As I understand the crux of this problem, it results from
an interaction among a HDD's firmware,
the RAID controller at the other end of the data cable,
and the logic of the device driver running that controller:

if the HDD's firmware initiates an error recovery
sequence, that sequence can take more time as
the HDD has more data to check: if that sequence
makes it "appear" to the controller that the HDD has died,
the controller's internal logic very probably will "drop"
that HDD from the RAID array.

Thus, the problem can occur with any OS and with
any HDDs that do not support one of these features:
TLER/ERC/CCTL


We've also observed something like the reverse of this
situation: our Highpoint RocketRAID 2322 was dropping
WD's RE HDDs only if we enabled periodic "polling"
from that controller's User Interface. When polling
was DISabled, our WD RE HDDs no longer dropped out.


Yes, Windows Server 2003 or 2008 do support more
software RAID options than does XP, but that is not
the main point of this discussion.


MRFS
 
I simply meant that XP is not an issue for mission critical applications as it isn't the right platform. I agree that it's an issue for end users who select the wrong drives for RAID, but WD are partly at fault for not clearly warning users. On the other hand, IT people should be able to determine what's best suited for a given environment.
 

dude2

Distinguished
Nov 15, 2008
12
0
18,510
if the HDD's firmware initiates an error recovery
sequence, that sequence can take more time as
the HDD has more data to check: if that sequence
makes it "appear" to the controller that the HDD has died,
the controller's internal logic very probably will "drop"
that HDD from the RAID array.

Thus, the problem can occur with any OS and with
any HDDs that do not support one of these features:
TLER/ERC/CCTL
Even more so, I am afraid that some TLER/ERC/CCTL hard drives may still get dropped when they start talking different languages to ICHxR/PCH. I hope they all speak in common language in their basic dialogs, even if not so for some advanced error recovery handling features. That is what the relative reliability and categorical compatibility I referred to.

Based on the threads that you linked, version 9.6 seems to address the dropped drives issue. No additional issues have been posted by people who upgraded to that version.
...
That's harder to determine because Intel have been supporting RAID since the ICH5R, therefore there is more info available about Intel than AMD. However you can google RAIDXpert and you'll find that the issue also exist with AMD. If I were you I'd probably stick to the Intel ICH10R and use the latest Intel Rapid Storage drivers (version 9.6.1014) and RAID enabled hard disks (WD TLER / Samsung CCTL / Seagate ERC).
I haven't found instances from Intel Communities forum where the relatively young IRST 9.6 proved itself a savior for the random disk fallout problem and confirmed by a tangible amount of users.
Besides, Intel tech support confirmed that ICHxR/PCH family can't speak with TLER/ERC/CCTL drives even though they may have less chance to step on each other's foot.
It seems I owe a visit to AMD forum on this perspective, as so recommended by MRFS.
 

dude2

Distinguished
Nov 15, 2008
12
0
18,510
Once I finish researching on AMD's take and embark my journey to the AMD forum, I will definitely leave a note here. Please update me with your new findings as well.
 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
> I am afraid that some TLER/ERC/CCTL hard drives may still get dropped when they start talking different languages to ICHxR/PCH


I don't think it's a syntax issue:

the firmware in HDDs is already coded
to respond to "polling" requests issued
by SATA controllers: as I understand it,
this is a feature of the ATA command set.

SATA is merely the Serial version of that ATA command set.

It's when a HDD does NOT respond
to such a polling request, that the controller
then decides to "drop" it from a RAID array.

The failure to respond, in this context,
is due to the fact that the HDD's firmware
is simply BUSY doing error checking,
and it also does NOT permit "interruptions"
at the moment polling requests are received.

Put differently, it is not a "real-time" process,
but one which queues polling requests
until such time as the firmware is ready
to handle such a request.

The same thing can and does happen
whenever the HDD's cache is full:
it will send a command back to the
controller to wait until that cache
has been emptied enough for more
controller output to be received
by that cache.

I'm sure many of you have already
had the experience of trying to "kill"
a running process that has gone rogue,
but all attempts to "kill" it fail.

This was much more common with
older versions of Windows, like 98SE,
which did not always respond if/when
a User tried to kill a rogue process.

I wouldn't bet on this, until we receive
absolute confirmation from the HDD manufacturers:

But, from observing WD's HDD behavior over many years,
I would have to make an educated guess that
WD's TLER capability is simply a feature of firmware
in their RAID Edition ("RE") drives which only does
error checking for a very limited amount of time
-AND-
then, at the instant when that amount of time has passed,
it checks for any polling requests before going back
to error checking ...

... something like that.


MRFS
 

dude2

Distinguished
Nov 15, 2008
12
0
18,510
I don't think it's a syntax issue:

the firmware in HDDs is already coded
to respond to "polling" requests issued
by SATA controllers: as I understand it,
this is a feature of the ATA command set.

SATA is merely the Serial version of that ATA command set.

It's when a HDD does NOT respond
to such a polling request, that the controller
then decides to "drop" it from a RAID array.

The failure to respond, in this context,
is due to the fact that the HDD's firmware
is simply BUSY doing error checking,
and it also does NOT permit "interruptions"
at the moment polling requests are received.

Do you imply even though TLER, ERC, and CCTL each type has its own error recovery procedure/protocol, but they all adhere to the same basic set of instructions (i.e., the ATA command set) when communicating with Intel ICHxR/PCH? And this basic command set has taken basic error recovery process into account.

From my understanding, there is no common baseline protocol/instruction to deal with RAID member's error recovery process other than trying not to step on each other's foot too quick, like within 10 seconds. But, will ICHxR/PCH wait forever as long as TLER enabled drive keeps telling the controller it is busy? A basic set should take these into accout to ensure relative reliability.