ICH10R - RAID 5 Failure

wolf2

Distinguished
Apr 10, 2009
16
0
18,510
Greetings all,

After having two 1TB NAS failures, I have decided to build my own File Server based on Windows XP (Least common denominator). I have built many data centers in my past with SUN, HP, IBM, and other vendors. I have been lured by the prospect of using SATA II drives in RAID 5 at an affordable price with software level RAID.

I embarked on designing and building a system as a Media and FileServer for my home network. My home has a 3COM switchw ith 2 dedicated 1GB/s ports and 24 100MB/s ports. This provides a healthy backbone for file server operation.

So here is the system I designed and built:

Gigabyte EP45-UD3R (45 NorthBridge, ICH10R SouthBridge)
Intel QUAD core 2.8GHz processor
4 GB memory DDR2 1066Mhz
5 (500GB) Seagate 7,200RPM drives

While I realize this is way too much processing power for a Fileserver, I decided to double this and use it for powerful desktop functions and some gaming. Since it willbe chewing power on the network backplane.

The configuration was done to use SATA2, NCQ, and Intel's ICH10R capability of setting up the RAID-5 across the 5 drives. The drive was segmented into 100GB for OS and the remainder, close to 1.8TB, for storage. Intel Matrix Storage Manager 8.5 picked up the drive array, initialized the array, and everything worked great.

One small hiccup along the way was the fact that Write speed was bismal! It was really horrible and give me a bad feel in my stomach as I was getting at best 5MB/S write speed! After much research, I found that my cache write-back flag was turned off. Turning this on gave me close to 80MB/S which was very nice. Read on Raid5 as you know is amazine as-is with very little optimization required.

Now to the problem, my rig worked wonderful for about 1 month in a 24x7 mode with no problems. Then the Blue Screens of Death on Windows XP started occuring. I have done the basic debugging and found the following types of failures:

Module - Error
-------------------
nt - DRIVER_FAULT
nt - DRIVER_FAULT memory_corruption
iaStor.sys - DRIVER_FAULT iaStor.sys
nt - DRIVER_FAULT memory_corruption
vax347b - COMMON_SYSTEM_FAULT Vax347b.sys
nt - COMMON_SYSTEM_FAULT memory_corruption
nt - COMMON_SYSTEM_FAULT ntkrpamp.exe
Vax347b COMMON_SYSTEM_FAULT Vax347b.sys
ntfs NULL_CLASS_PTR_DEREFERENCE ntfs.sys
nt COMMON_SYSTEM_FAULT ntkrpamp.exe
nt COMMON_SYSTEM_FAULT memory_corruption
intelppm COMMON_SYSTEM_FAULT intelppm.sys
nt COMMON_SYSTEM_FAULT memory_corruption
sr DRIVER_FAULT

To try and mitigate this from happening, I have started XP with the last working config. I know without a doubt that the ICH10R and the Storage Manager are the root cause of this. I was able to upgrade the driver to Storage Manager 8.8 and get the array back in working order (rebuild on XP). Interstengly I experienced no data loss (thumbs up for RAID5).

The system worked for about 1 week and then the crashes started again. The uptime ranges from minutes to a couple of hours before the crash occurs.

Now I am at a loss and not sure how to proceed. I would like to try to fix the problem because I was very happy with the performance of the box, even for intense gaming. If I can't find a solution I am thinking of turning my attention towards a Raid controller and give Software level RAID the thumbs down.

Any assistance with my problem would be greatly appreciated.

Regards.
 

specialk90

Distinguished
Apr 8, 2009
303
0
18,790
It sounds like a faulty chipset. Can you still return the motherboard? Or at least get another to try. You should have no problem moving your raid array to another board with ICH10R. Also, did you know about Intel's Matrix Raid? This simple feature alone should keep you from moving to hardware raid because you can create a Raid 0 array for the OS and a Raid 5 for file server all on the same 5 drives. Also, you won't be able to go above 2TB total available space under 32bit windows and onboard raid. You would need 'carving' for anything above 2TB or 64bit XP/Vista. Just something to think about.

OH YES, I was just about to hit the Submit button when I remembered that you are using a Gigabyte motherboard and their boards tend not to work with hardware Raid cards. I don't know the exact reason, but I have seen quite a few people try and fail. I know firsthand that ASUS boards work just fine.
 

wolf2

Distinguished
Apr 10, 2009
16
0
18,510
Thank you very much for the reply. It would be great if I can isolate hardware vs. software problems. As you can imagine it would be a royal pain to exchange the motherboard due to the mounting and configuration.

Are there any diagnostic tools that can isolate chipset problems? I assume you mean the ICH10R could be having the problem. It would be wonderful to isolate this quickly in this debugging quest.

Thanks
 

daship

Distinguished
If your drives are 7200.11, that would be the first place to start pointing fingers at. Those drives suck and even the new firmware isn't 100% stable. My guess is one or more bad drives.
 

specialk90

Distinguished
Apr 8, 2009
303
0
18,790
Wow....another person with such great information. 7200.11's suck? Really? That is why all 8 of mine have not had any problems running 24/7 for the last 15months. And yet, 1 of my Raptors just died a few weeks ago. Surely those Raptors don't suck.

Wolf2: I don't know of any specific tools to test the southbridge/ICH10R.

Are you overclocking your cpu at all? This alone can cause problems with the southbridge if the voltage is not changed in the bios.

Also, have you tried changing the placement of your 5 drives among the 6 sata ports? Maybe it is a single port.
 

wolf2

Distinguished
Apr 10, 2009
16
0
18,510
Thanks for the post and the suggestions. I would like to add my two cents about the 7200 rpm drives and give you folks an update of my latest endavours with the setup.

First on the drives, the 1TB NAS I had ran for 14 months non-stop before crashing. Inside the NAS were two WD Caviar 7200 Sata2 drives. Interestingly, I blieve that the Unix software that came with the WD Book did something to mess up the drives.

After running Spinrite 6 on each drive individually, refreshing the surface, and plugging the drives back in, the NAS went back into operation. Nonetheless, I was upset by the incident as I lost my storage for 5 months until I stumbled onto Spinrite and was able to recover the data. Since then I have moved the data to the RAID5 server I am discussing on this thread.

Secondly, my RAID5 server has finally crashed beyond repair. The machine kept rebooting after POST and the SATA drives inventory, right before going into Windows.

After many desperate attempts at recovering it, I ended up going back to the shop and having the tech run full diagnostics on the motherboard, memory, CPU, and hard drives. Luckliy all the components were in working order and we performed a BIOS upgrade on the Gigabyte board, then re-initialized the RAID array and started the trek to re-install Win XP from scracth (wiping all my data, no worries I backup regularly as I have learned my lesson).

The machine is back on the home network now and I plan to restore my data back onto the RAID array. My faith is a bit shaky on this ICH10R and the Intel Matrix Storage manager though. I am worried that as I get into higher amounts of data (over 1TB) on the Array, I might end up facing BSODs and other problems again.

I will give it another shot but if it fails again, I believe that I am headed for Controller Based RAID folks, one that has a dedicated XOR CPU and on-board cache memory.

I will keep you posted.

 

neiroatopelcc

Distinguished
Oct 3, 2006
3,078
0
20,810
Are you sure the system is okay? The behavour you describe would resemble how the system would act if the memory were on the edge of stability (check memory and power supply).

As for the write speed - it can vary a lot with the intel controller according to gigabyte.
For the last year or so I've been fighting a fight to get my ich9r raid5 (5x500gb like yours) to work stable, and in the end I gave up and simply bought a single 2tb drive for storage - and use the others as stand alone drives with backup. It doesn't deliver the read performance of the raid 5, but your network adapter limits the read speed to under 125mb/sec anyway - and only if the other system uses the other gigabit port - else it'll be around 12mb/sec max anyway - slower than usb, thus making the need for a fast raid non-existent.

Reply from gigabyte regarding the non-excellent ich9r speeds (assuming the same is true with any other software raid controller)

Answer - 679114
Answer : Sorry for our late reply.

One reason is, your OS is located on the RAID. This causes the drops in the transfer speed.
You will get much better results if the OS and the tools are installed on a separate disk.

Best regards

GIGABYTE-Team


Answer : Dear [edited],

You can´t compare performance of ICH9r raid with a real hardware RAID controller with its own processor and RAM.
The speed of RAID 5 is nearly the same as of a single drive and there is the complete Windows driver stack and the Matrix Storage Manager Software is involved. This all uses processor time and decreases the speed.

The reliability of the RAID is nearly the same as on a professional hardware RAID controller. That a rebuild sometimes starts might be caused by some data loss due to faulty RAM or software problems. Don´t OC the system if running a RAID to increase
the reliability of the RAID.
You can check your memory with memtest86 in a long time test to make sure there is no error.
www.memtest.org

S.M.A.R.T errors are not reported because the controllers in AHCI or RAID mode can´t handle these informations. It is only working if the controllers operate in standard ide mode.
Please check the drives with the diagnostic utilitys from the drive manufacturers for errors and replace the faulty drives.

Best regards

GIGABYTE-Team


Edit: Removed my name from the reply.
 

ShadowFlash

Distinguished
Feb 28, 2009
166
0
18,690
I've had similar problems with both 9 & 10 many times. The problem doesn't come from running an OS off of a RAID 5 set, it just magnifies it. RAID 5 is never ever ever a good choice for an OS/Program drive even with a hardware controller. Most of the software RAID 5 ( mobo based ) set-ups I've done constantly rebuild or "resynching" for no apparent reason. At the time, I just put up with it as a fact of life as surprisingly it didn't drastically reduce performance as expected. With an OS installed on the RAID 5 set however, any little glitch had the potential to "blue screen" the system. Gigabyte's advice was perfectly sound, especially from a troubleshooting point of view. The benefits of RAID 5 have been truly muddled over time resulting in alot of bad advice since its rise in popularity painting it as the best thing since sliced bread. The only real benefits of RAID 5 vs. any other level is the increased capacity vs. cost ratio. Web servers are a possible exception, but even there a smart RAID 10 set-up beats it. Using mobo-based RAID 5 should only be for storage and only if price vs. capacity nessesitates it. As to the exact cause of the problem for both 9 & 10 ?....I never did solve it, I just moved on to the more appropriate hardware RAID. Mobo RAID is great for "getting your feet wet", but should not necessarily be relied on for data security in RAID 5. I never had any issues with RAID 0,1, or 10 ( really 0+1 anyhow ), only RAID 5.
 

wolf2

Distinguished
Apr 10, 2009
16
0
18,510
Thanks for the info and insight guys! Its great to get the conclusions of people who have faced problems with these PC builds, as it tends to save a lot of time and heart ache.

Just to give you an update, since getting the machine back, the restore of the data has been going onto the RAID array. Close to 1TB+ of data, software setup files, etc are being restored. Things have been running fine for a few days and then it Blue Screened last night again. I am beginning to believe that the root cause is around a couple of reasons:

1) The volume of Data managed by the MOBO Raid, as it goes north of 1TB, it might choke - I started seeing these problems once I crossed over the 1TB threshold. The machine ran fine for 1 full moth with no problems while the overall data was less than 1TB (Note that my Array's total stoarge limit is 1.85TB as I have 5 drives x 500GB each in the Array)

2) The O/S installed on the same Array, based on the last couple of posts. I also believe that the short read/writes of the O/S and other software for that matter degardes the performance of the RAID array as the drives are constantly cranking to update the cache, virtuam memory file, etc.

So my plan of attack for the next week is as follows:

a) Verify that Memory Modules are sane and not causing the BSODs
b) Update the BIOS to the latest possible version
c) Update the Intel Storage Manager to the latest version
d) Run this for another week in 24x7 (the data already is above 1TB so if my assumption is correct I should see the problem)
e) Report back to this forum with the results, in case of a crash I will need help migrating the Contoller RAID

On a side note Shadowflash, the allure of RAID 5 has always been the cost benefit, ability to run with a drive down (hot-swap when you get replacement), and the ease of mind that your system is resilient to failures. I realize that 0+1 can give you similar benefits, but you will require more money to get the number of drives to get you an extended amount of storage (2TB+).
 

ShadowFlash

Distinguished
Feb 28, 2009
166
0
18,690
Yep...the allure of cheap redundancy is pretty strong, however price is not as large of a factor now-a-days as TB drives are pretty cheap. I think the greater problem with RAID 10 is the increased physical space required which some cases and PSU's cannot support. I run in to that problem on builds more often than the financial concern of the drives themselves...

On a side note...the real reason RAID 5 should be avoided and why it's just a false sense of security to the end-user.
http://miracleas.com/BAARF/RAID5_versus_RAID10.txt
I personally have experienced this problem and was able to replicate it under controlled conditions. My opinion is that this phenomenom is the root cause of most RAID 5 mystery failures especially using non-RAID model SATA drives, but no one wants to here that anyhow....these inherent flaws have been known and ignored for a long time now.

Funny you should mention degraded array performance and rebuilding as an asset for RAID 5, as in reality, it is by far the worst of all redundant RAID levels at these tasks.

Even knowing all this and experiencing many parity related malfunctions, I too am continually tempted to use RAID 5....LOL
 

sammeow

Distinguished
Apr 29, 2009
1
0
18,510
ShadowFlash,

Can you elaborate more with "My opinion is that this phenomenom is the root cause of most RAID 5 mystery failures especially using non-RAID model SATA drives".

I think I have the problem with my new upgraded system right now. :pfff:
 

ShadowFlash

Distinguished
Feb 28, 2009
166
0
18,690
First, a good percentage of people here will disagree with me on this. My theories and opinions are based on the theoretical process of RAID and extensive testing and real-world experience. In recent years, RAID 5 and 6 have become increasingly popular due to its inherent economical advantages and improved controller design. This has lead to many people choosing these levels without proper consideration to the potential for disaster.

Did you read the link I posted ?....That should explain the inherent flaws in RAID 5 or 6. RAID 3 or 4 do not have this problem, as they provide a "free" parity check on reads, which 5 and 6 do not. That point aside, we come down to the drives and controller.....

The following are quotes from WD posted in other threads here....

Quote :

Question
What is the difference between Desktop edition and RAID (Enterprise) edition hard drives?

Answer
Western Digital manufactures desktop edition hard drives and RAID Edition hard drives. Each type of hard drive is designed to work specifically in either a desktop computer environment or on RAID controller.

If you install and use a desktop edition hard drive connected to a RAID controller, the drive may not work correctly. This is caused by the normal error recovery procedure that a desktop edition hard drive uses.

When an error is found on a desktop edition hard drive, the drive will enter into a deep recovery cycle to attempt to repair the error, recover the data from the problematic area, and then reallocate a dedicated area to replace the problematic area. This process can take up to 2 minutes depending on the severity of the issue. Most RAID controllers allow a very short amount of time for a hard drive to recover from an error. If a hard drive takes too long to complete this process, the drive will be dropped from the RAID array. Most RAID controllers allow from 7 to 15 seconds for error recovery before dropping a hard drive from an array. Western Digital does not recommend installing desktop edition hard drives in an enterprise environment (on a RAID controller).

Western Digital RAID edition hard drives have a feature called TLER (Time Limited Error Recovery) which stops the hard drive from entering into a deep recovery cycle. The hard drive will only spend 7 seconds to attempt to recover. This means that the hard drive will not be dropped from a RAID array.

If you install a RAID edition hard drive in a desktop computer, the computer system may report more errors than a normal desktop hard drive (due to the TLER feature). Western Digital does not recommend installing RAID edition hard drives into a desktop computer environment.


Quote :

Q: Regular 7200 RPM desktop drives run fine in RAID environments; why do I need these drives? A: Unlike regular desktop drives, WD RE SATA and EIDE hard drives are engineered and manufactured to enterprise-class standards and include features such as time-limited error recovery that make them an ideal solution for RAID.
Q: What is time-limited error recovery and why do I need it?
A: Desktop drives are designed to protect and recover data, at times pausing for as much as a few minutes to make sure that data is recovered. Inside a RAID system, where the RAID controller handles error recovery, the drive needn't pause for extended periods to recover data. In fact, heroic error recovery attempts can cause a RAID system to drop a drive out of the array. WD RE2 is engineered to prevent hard drive error recovery fallout by limiting the drive's error recovery time. With error recovery factory set to seven seconds, the drive has time to attempt a recovery, allow the RAID controller to log the error, and still stay online.


OK, that should start to explain the problems using desktop drives in RAID configurations. Many people DO successfully use desktop editions in RAID. The problem is compounding errors. When using a fast controller card, parity overhead is reduced, thus allowing for more time devoted to error recovery. The advantage to a hardware controller card when using parity RAID is it's ability to completely off-load system overhead. The problem with on-board RAID is any "system hang" for any reason, can affect the stability of the RAID array. Re-building or Re-synching takes time...lots of time...and usually a user will shut-down or go to standby long before the process can complete. Even if your machine is "always on", what happens, when there is another error before the first process is complete ? I've used both 3ware cards and the Intel Matrix controller, and although there performance ( especially the 3ware card ) is admirable, my arrays were almost constantly rebuiling or re-synching. This did not provide all that much performance loss ( surprisingly ), but left unchecked resulted in an ever increasing number of errors. Worst case scenario, a drive is completely dropped from the array. The hoops you have to jump through to re-add the supposedly failed drive is ridiculous.

These problems however are not limited to on-board or software RAID, only magnified. The situations I've been able to replicate have occured with mid to high end controller cards with on-card XOR engines and battery-backed cache. The problem in these cases were partially dying drives which were not reported by S.M.A.R.T. and also did not trigger a failed drive. This occured on an enterprise level SCSI disk shelf where bad data was written, not reported, and subsequently bad parity information written to disk. If this would have been reported on a read access, perhaps I could have caught it in time, but due to the nature of RAID 5, it wasn't. The end result was not complete data loss ( luckily ), but a corrupt directory structure which caused a "gobbley-beloved patriot" rebuild that had to be manually re-sorted file by file. I replicated this error on both an LSI megaRAID 1600 enterprise and a compaq smart array controller, both using the same set of 14 questionable disks. To this day, that exact same disk shelf is still in problem-free operation using RAID 10. Both now-ancient controllers are also still in problem-free use.

As you can see, I've had AND been able to replicate serious RAID 5 related errors with ALL forms of controllers. This substantiates the theoretical flaw in RAID 5. Just look through the forums, here and elsewhere, and see how many failed RAID 5 recovery questions exist. Now compare them to the iron-clad RAID 1, and the slightly more volatile RAID 10 and you'll begin to understand why I dis-like RAID 5. From a performance standpoint, RAID 10 almost always beats RAID 5 anyhow, so why use it on a desktop or workstation. Web Servers are the only exeption I can think of in terms of performance, and even that is questionable. My understanding in fact is that oracle database servers are increasingly moving away from RAID 5 for both performance reasons AND this very issue.

The only thing that magnifies these problems more than on-board controllers and desktop edition drives is the use of RAID 5 as an OS/system drive. ANY small error will more than likely cause boot problems, further restricting your ability to take action. Not to mention, small random writes are the weakness of parity RAID and needed by the OS to perform well, hence the instability associated with 1st generation SSD's.

By all means, don't believe me, do your own research and you'll find this is not a "mystery problem", but a well defined flaw. I am not a proffesional, just a RAID junkie who's done alot of testing in search of the holy grail of storage systems.....RAID 5 is not it.

AFAIK, there is no way to prevent ANY of these problems from occuring. Many people will tell you that I'm just too paranoid and that they have never had any of these problems. That does not mean that the potential for these scenarios do not exist. Use parity RAID at your own risk.

Sorry for the essay....but you asked me to elaborate. This is actuallly the "short version" LOL......
 

wolf2

Distinguished
Apr 10, 2009
16
0
18,510
Thanks for the extensive post Shadowflash! I will tell you that ever since the failure of the Server and its rebuild I have been uneasy about its full use within my home network. Also, the constant churn of the disks and sometimes slow trasfer rates are beginning to push me over to your camp (for Desktops at least).

So let me ask you this, would the following setup, based on your experience, yield the most reliable RAID system using SATA2 drives?

2 Disks in Raid 1 - For Operating System and Application Files

6 Disks in Raid 10 - For Data files

Intel Matrix Storage Manager in liu Hardware Controller.

Quite frankly, this will be a much cheaper solution for me than opting to a hardware controller. The cost would be three additional 500G disks and reconfiguring the system. On the other hand, the Hardwrae Controller cost would be 4 time as much. As for speed, going with what you are suggesting eliminates the parity bit calculation and synching which would significantly speed up data transfer.

I would also believe that with the data spread across three disks you will get faster read speed, so the system should see an improvement in that area as well.

What are your thoughts?

One other question, I tried to run Memtest86 but it did not work. The system would boot the CD and stop at the "Loading......................" prompt but nothing happens. Can you recommend another tool for Memory Testing? Can Sisoftware Sandra do this?

Thanks,
 

ShadowFlash

Distinguished
Feb 28, 2009
166
0
18,690
My home server doubles as a workstation ( also an 8 drive set-up ), so I run 3 RAID sets, RAID 1 for the OS, and a RAID 0 for large apps and the log file. I use the extra space on the RAID 1 for set-up files, drive images, and such. The extra space on the larger RAID 0 array, I use as a scratch disk for nero or a download drive. The remaining 4 drives I run in RAID 10 for network accessable storage.

The intel matrix controller should be fine, it's pretty descent on a 6 drive RAID 10. It is suggested that you put the log file on a seperate spindle, but in the case of a small home server, i doubt if it would be that big an issue.

You could try this little real-world experiment I like to do for testing.....use "media player classic" and set the options to open a new window for each new media file clicked. Then, start opening movies and see how many you can simultaneously stream before any get choppy. Lay them all out across your desktop and randomly jump to different parts of the movies and see how it does. If you already have a RAID 5 set-up, try it there first so you can compare later. This really isn't a very accurate way to test as opposed to benchmarking, but it can simulate some real-world home server loads and give you a general "feel" of performance.

I can't help you with memtest though...sorry...
 

wolf2

Distinguished
Apr 10, 2009
16
0
18,510
Excellent info Shadowflash! What has your experience been with setting up stipe size on these RAID sets with the Intel Matrix Storage Manager? There isn't great guidance out there, not even in the Intel technical manuals.

I have read somewhere, someone saying that it should be aligned to the NTFS write block size. But it was not 100% clear as to what works and what doesn't. The standard is 64Kb for the stripe size but it can be lower or higher.

Your thoughts would be appreciated.

Thanks
 

specialk90

Distinguished
Apr 8, 2009
303
0
18,790
Shadowflash, I could really use your help. I just moved from XP x86 to Vista x64 with a 3ware 9650SE-8 and 3 500GB drives in Raid 5 on the 3ware(I know what your thinking but I will be going back to Raid 10 very soon). Apparently, 3ware doesn't have the best Vista/Server 2008 64bit support and I had to update the firmware to get Vista to work. I need Vista x64 because I use Adobe CS4 (Premiere Pro, PS & AE) and I must have 8GB of ram. And I must use my 3ware card for storage and the OS+Apps reside on 4 Raptors in Raid 10.

My problem is the Read speed of my Raid 5 array is 35MB/s when copying and 150MB/s for writes. It wasn't like this under XP. HD Tune Pro is able to show Read speeds up to 112MB/s with 256Kb blocks and only 50ish with 512kb blocks. The stripe is 64kb, btw. I have 3 Raptors that were part of a Raid 10(1 died a few weeks ago) and they had the XP on them. Since I'm not using XP anymore, I created a Raid 0 array first and then a Raid 5 array to test my 3ware card. The 3 Raptors had no problems in either Raid 0 or Raid 5, and I used the same stripe for the Raid 5 array as my 3 500GB array.

I bought a 1.5TB drive to backup everything as well as backing up important data to the Raptor Raid 5 and the Raid 0 array using the 4 Raptors connected to the onboard Intel(Raid 10 holds the OS and I used Matrix Raid). So, everything is backed up twice and I was thinking of deleting the 3 500GB Raid 5 array and creating a new array to see if that fixes it. I'm just very hesitant to delete all that data.

Do you or anyone else have any ideas as to what the problem is?

Could it be something XP did that doesn't work well with Vista?

Also, Shadowflash, you said your 3ware Raid 5 arrays were also rebuilding all the time. Are you certain they were rebuilding and not Verifying? Since I updated the firmware, I came across a new feature that allows a schedule for Verifying instead of it being automatic. Also, with my 3ware card, my PC alarm/speaker will let me know if there is any problem with a drive. I don't know if there are newer features that weren't in the cards you were using or if its something else.


PS Its nice to see another Raid junkie here.
 

wolf2

Distinguished
Apr 10, 2009
16
0
18,510
Greetings all,

All right, here is an update to my saga with the ICH10R Raid 5 Failures. First of all, to answer one of my own questions posted earlier, Memory Tester from HCI Design can be downloaded for free to test your computer's memory and check whether there are any errors in read or write. The free version, for home use only, will require you to open multiple instances of the tool at the same time to consume all available memory.

Running memory test on my machine with no problems have squarely put RAID5, ICH10R, and Intel Matrix Storage manager as the root cause for the BSODs I have been experiencing. This is a good point in my journey as I know that Memory is ok and I need to do something different with the RAID setup.

With the suggestion from Shadowflash, I have gutted my server last night wanting to add additional Hard Drives for the new RAID setup. The first problem I hit is running out of SATA ports! Yes, believe it my friends, when I bought the Mobo I thought who could ever want 8 SATA ports, well it turns out that I do! One of the SATA ports is taken up by the CD-ROM leaving 7 SATA ports available between the ICH10R and the Gigabyte SATA (GSATA) controller.

Faced with this problem, I changed my configuration to have 6 SATA drives hanging off the ICH10R, 1 SATA drive and 1 DVD drive off the GSATA controller. I figured I will run the 6 SATA drives off the ICH10R in RAID 10 and live with one drive for the OS (realizing that it will be a single point of failure). Then I hit the second problem, ICH10R only allows 4 drives max in RAID 10! And its RAID 0+1 not RAID 1+0. From my research RAID 1+0 is better than RAID 0+1 but there is no option within the ICH10R for 1+0 even though they call it RAID 10.

Having hit this second problem, I had to change my configuration again. I ended up configuring two of the 6 SATA drives on the ICH10R in RAID1 for the OS and putting the remaining four disks in RAID 0+1. On the GSata controller I installed one disk (no fault tolerance) and the DVD drive. I made the required configuration changes in the BIOS, added the member disks to two RAID volumes, and installed Windows XP.

After the install and drivers pathcing (with Intel Matrix Storage Manager to 8.8) the system is up and running again. So I ended up with the following configuration:

RAID-1 Volume with 500 GB
RAID-10 (0+1) Volume with 1TB
500GB Scracth Disk (No fault tolerance)

I will start the restore process and the burn-in during the upcoming week. All in all, while it was not what I expected, the final setup is pretty good with a lot of fault tolerance storage (1.5TB worth) and a total storage of 2TB.

I will keep you folks posted with the results of this reconfiguration and the performance within a couple of weeks.
 

specialk90

Distinguished
Apr 8, 2009
303
0
18,790
Wait, wait, wait. Raid 10 is 1+0. Intel's idiot engineer who wrote the BIOS wrong which shows 0+1 next to Raid 10.

Here is from Intel's site from the "Intel Matrix Storage Manager" page,
"A RAID 10 array uses four hard drives to create a combination of RAID levels 0 and 1 by forming a RAID 0 array from two RAID 1 arrays."

"Intel® Matrix Storage Manager
RAID 10 Volume Recovery

A RAID 10 volume will be reported as Degraded if one of the following conditions exists:

* One of the member hard drives fails or is disconnected.
* Two non-adjacent member hard drives fail or are disconnected.

A RAID 10 volume will be reported as Failed if one of the following conditions exists:

* Two adjacent member hard drives fail or are disconnected.
* Three or four member hard drives fail or are disconnected. "

Look at the "2 non-adjacent drives failing but the array is only degraded" - this means it is 1+0 because only 1 drive can fail in a 0+1 before all data is lost. With 0+1, there are 2 sets of Raid 0 and when 1 drive fails, all that is left is a single Raid 0 array.
With a 1+0 array, drives A, B, C & D - A+B=X & C+D=Y, one drive from X & Y can fail without losing data.
 

ShadowFlash

Distinguished
Feb 28, 2009
166
0
18,690


There is a whole range of opinions on stripe size, and at one point I focused quite abit on this. Unfortunately, I ended driving myself insane with constant benchmarking with different stripe size and NTFS block size. In theory, setting the block size to the stripe size would be a good idea, however it never works. You have no way to be assured that the blocks are precisely aligned with the stripe size. Years ago, there was someone who figured out how to do it on a specific nforce chipset, and he did get some crazy fast benchmarking results, but the "trick" was only good under very specific conditions, with very specific hardware. I actually tried finding the thread that described how awhile back, but never could. It was somewhere at storagereview, that much I remember. I usually run 64k for my OS/program drives; this increases random IO, and I run 32k on my storage drives; this increases sequential performance. Deviating any more from this did yield more performance in the specific area needed, but sacrificed too much in other areas IMHO. Block size never seemed to make that much of a difference to me, but admitedly I gave up testing on that. I just use the default size and call it a day.

I had no idea there was a 4 drive limit to RAID 10, as that's all I ever used with the on-board. 8-ports with a 4 drive limit sucks. Sorry to here about the DVD drive being sata. I hope it still works out for you.
 

ShadowFlash

Distinguished
Feb 28, 2009
166
0
18,690
@ specialk90.....

It's entirely possible that my 3ware array was verifying some of the time and rebuilding other times. The problem was, this was an "always on" machine, and even after months, this constant activity did not stop. After 2-3 months, the BSOD's would start. This was back on an XP 32-bit machine. When I went to Server '03 64-bit, The array would not function correctly. I chalked it up to bad 64-bit support too. I still use that RAID 10 array in a 32-bit XP machine at work for my "off-site" back-up ( I host work's backups at home in exchange ). The scheduled verify is a nice addition. I loved my 3ware card for performance, but it was definately "feature" weak.

I can't say I've seen READ speeds that low with any RAID 5, that is after all what they're good at. 150MB/s read and 35MB/s write sounds more normal to me for a slow RAID 5 array. Why are you using block sizes that big with a 64k stripe ? What you are essentially doing, is forcing EVERY write to be striped, which could explain the high write speeds. That same effect means that on reads, multiple spindles must be synched for ANY file, but this should be a positive on large sequential access. I've never had good luck with the larger block or stripe sizes. As a rule, I usually keep block size at least a little smaller than stripe size. As I said to wolf, I've seen increasingly diminished returns when straying too far from the default 64k and I've actually went backwards by going to far. How does the array benchmark with the windows default block size ? That would probablly be my first step. I usually benchmark first at default stripe and format at default block size to establish a baseline. Then, if I'm tweaking for a specific purpose, I rebuild the array and test with different stripe sizes first. When I have my desired stripe size, then I start working with block sizes and testing. If you do go this far, I'de be curious what the results are, because this drove me a little crazy when I did it. It took 2-3 weeks of constantly rebuilding and benchmarking, and I still couldn't establish any clear pattern. I thnk this is why there are so many different opinions on the subject.

When you say "while copying", is that refering to a benchmark or a "real-world" file transfer ? If it benchmarks OK, but dosen't perform well in real life, it could be the OS. I know almost nothing about Vista ( I know, I'm stubborn ), but XP and Server '03 always had problems with large file copies, even 64-bit. I used a program called Teracopy, which helped speed things up. The problem was with file copies that exceeded available resources. My typical copies would be 50-100GB in size, far surpassing my available RAM ( someday it won't...LOL ) and they would just choke. If you don't want to hassle with deleting and rebuilding, try that program first, and see if it helps. There are registry hacks that do the same thing, but I avoided those because they are always in effect, and you don't want that for normal file copies.

Sorry it took so long to get back to you guys....I've finally "cleaned my plate" of all my "side job" projects. Today's project...custom wired pwm 6-fan hot-swap control board, shoe-horning a SCSI U320 rack into a case it dosen't belong in, and resurrecting my old quad socket opteron server. Good luck to us all........
 

specialk90

Distinguished
Apr 8, 2009
303
0
18,790
Shadowflash, thank you for your comments so far.

1) The allocation/block size is windows default. The larger block size I was referring to was within HD Tune Pro while benchmarking.

2) I have a 3 drive Raid 5 with Raptors and I used the exact same Stripe(64k) and same block size(default) and its on the 3ware card. This array is extremely fast with 140+MB/s writes and reads while copying files.

I have a tendency to confuse people with what I write so let me try and start over.
These are my specs:
1) 4 74GB Raptors connected to Intel(ICH8R) & using Matrix Raid with a Raid 10 for OS and a Raid 0 for Adobe stuff. Both arrays are fast with the Raid 0 array average 225MB/s read/wite speed(according to HD Tune)

2) 3 500GB(7200.11) in Raid 5 on 3ware 9650, stripe=64, allocation=default

3) 3 150GB Raptors in Raid 5 on 3ware 9650,k stripe=64, allocation=default

4) 1 1.5TB(7200.11) on motherboard(ICH8R)

I use the Raid 0 array to test the real-world copy tests and everything, except #2, works like a charm.

On a side note: 3ware's 64bit support is seriously lacking. I'm rather glad LSI bought 3ware so maybe now their cards will have better support and more features.

In XP Pro x86, my write speeds were about 90MB/s but that was with only 100GB free. I once had this card in a Vista x86 and it performed just fine.
 

wolf2

Distinguished
Apr 10, 2009
16
0
18,510
Quick update folks,

After the changes made, the system has been running superb. Its close to 1 week now without a single BSOD so far. The other interesting thing is I was amazed at how snappy the system has become with the change. It truly does feel like a quad-core machine with a 1GB Radeon graphics card now.

The response is excellent, running multiple tasks at the same time is a snap, and the hard drive churn activity feels to be at 10% of what it used to be at with RAID5. It seems that software RAID 5 is a bit of a stretch and the amount of power that gets wasted does not justify the potential redundancy.

I will keep you posted as the acid test is for the server to run for 1 month without a crash in a 24x7 operation.
 

wolf2

Distinguished
Apr 10, 2009
16
0
18,510
Ok folks, some sad news!

Close to two weeks into operation the server started experiencing BSODs again. It ranges from VMX.sys to iastor.sys errors. The machine stays up between 2 to 4 hours before crashing with erros such as iastor.sys. I know that the iastor.sys error is related to the Intel Matrix storage manager and the driver for the ICH10R chipset.

Before any suggestions from the readers please realize that I have updated the BIOS, updated the drivers, tested the RAM/CPU, and motherboard. All updated, checks out, and seems to be working fine. The setup the I have now has:

2 Drives mirrord as OS volume
4 Drives Raid 10 as Media volume
1 Drive as scratch with no fault tolerance
Intel Matrix Storage Manager(iMSM) 8.8

All these drives are 500G. I have two options at the moment, either trying the hardware RAID and avoid any software/mobo raid combinations. Or, simply forgo RAID all together and stick with regular drives with a good backup routine.

Needless to say, I am a bit frustrated with the whole issue, especially after dropping a couple of thousand dollars on an unreliable system.
 

specialk90

Distinguished
Apr 8, 2009
303
0
18,790
I don't remember if you ever told us the cooling setup you have for the drives.

For this to suddenly come back possibly sounds like a heat issue which could be the Southbridge, any of the drives or even the Video card or possibly something else.

Have you considered the higher speed Ram, its compatibility and/or Ram voltage? One thought I have is that the board uses the Ram as Write-Back cache and a possible error with the Ram could cause this problem.

What could happen, is there is an error which corrupts something to do with the Intel driver(s). It seems like something is randomly occurring which causes corruption because you were able to reinstall everything and have it last for over a week without a problem.
 

sub mesa

Distinguished
You're dealing with software errors that result in crashes, this could be due to faulty hardware (so check if the chipset could be overheated) or it could just be driver bug. Since Intel uses more features its driver is larger and thus more complicated.

Basically you should:
1) check the chipset temperature by touching its heatsink to guess its temperature. If it feels warm that's normal. If it really hurts that you must release your hand, it might be too hot. Adding a fan might help too.
2) re-check your memory with memtest86. its the best memory test there is because it tests all memory, and not memory iastor.sys might have in use already so doesn't get tested by any in-Windows application. You should download the UBCD (Ultimate Boot CD) which contains memtest, or any Ubuntu Linux cd which also contains memtest (its in the menu you get when booting from the cd). This might remedy any boot problems you had with the memtest86 ISO.

Also, i would like to stress that running Software RAID5 under Windows has reveral disadvantages. First you have a stripe misalignment, which amplifies the RAID5 small write performance penalty, and causes more I/O than necessary. Also, all software RAID5 solutions available on Windows do a very bad job, with one exception being the ICHxR drivers which have 'write caching' that can give you at least moderate performance. Obviously they don't work well for you, and you probably should look at another solution.

Have you ever thought about making a computer dedicated to handling all your storage? A RAID NAS might not be something casual computer users have, but it would allow you to run advanced RAID5 setups using ZFS filesystem, with its own implementation of RAID-5: RAID-Z. Combining both filesystem and RAID-engine in one package allows for variable stripe sizes, so 2-phase writes of RAID5 disappear, additing to performance. ZFS is also packed with features that would allow a maintenance-free, corruption-resistant and self-healing storage solution.

If you would like to know more about this path i'll indulge you, but check the two points addressed above first. If your hardware is really working properly, that would imply the Intel drivers are at fault here. If that is the case, it's worth looking at an alternate solution.