help with massive RAID system

grafixmonkey

Distinguished
Feb 2, 2004
435
0
18,790
Hi guys, many thanks for the help building my graphics workstation and its huge RAID arrays. Now I've got another big project, this time for a company I'm working for, and I need a little help with some throughput issues. The issues are complex, so sorry for writing a novel. I do have a couple quick questions that I'll put up at the beginning though.

Quicky #1: what's the best stripe size for serving large files over network? I assumed it would be the smaller ones (i.e. 16k, what it was set to) because the machine is serving ethernet packet sized chunks, but I don't know how caching and interrupt-bundling in the ethernet card factor in. And how does stripe size factor in to raid-50? Because if you write 64k chunks to the three "logical" raid-5 arrays that make up the raid-0, are you really writing 21k chunks to each disk? or are you writing 64k chunks to each disk, making the effective chunk size for the raid-0 array (of raid-5 arrays) 192k?

Quicky #2: I had a hard time finding benchmarks, so which is faster raid-5 or raid-50? Raid-50 supplies more redundancy than raid-5, so it should be slower correct? But one benchmark I found showed 50 as much faster, and one showed them being the same (but that was a very poorly done benchmark. Used the PCI bus I think.) I've only managed to find those two benchmarks.


So on to the big problem. A little background... We capture old 8mm/16mm video film to digital video, from multiple stations around the office, edit it in Premiere and export it to DVD. Before I joined, they used sneaker-net with external firewire hard drives (not a good system, but functional) so I took on the task of building them a centralized network storage machine that could handle the load of uncompressed video flying around everywhere.

So, on to the hardware... I built this system for them to handle the networking:
<b>Tyan S2720-533</b> (<A HREF="http://www.tyan.com/products/html/thunderi7501.html" target="_new">here</A>)
(<b>with dual gigabit ports</b> and a single 10/100 port. Upgradable to as many as 14 gigabit ports.)
<b>3Ware Escalade 9500S-12</b> (<A HREF="http://www.3ware.com/products/serial_ata9000.asp" target="_new">here</A>)
<b>Nine 250GB SATA drives</b> in hot swap enclosures. Initially all drives were put into a RAID-50. (i.e. raid-0 of three raid-5 arrays of three drives each, I think. 16k stripe size.)
<b>Windows XP Pro</b> (need to learn to use linux for this stuff I know)

Here's the problem. The stations do everything realtime, so the video each station puts out has to be read and written in realtime during editing, and written realtime during capturing. SiSoft Sandra reports that the array has PLENTY of random read and random write throughput, but the system seems to get hammered when the typical number of stations are working at the same time. I don't think it's network bandwidth, because the office is divided into two gigabit segments and I can have three stations doing stuff on segment 1, but one station starting up on segment 2 after that kills it. The three network ports are bridged in windows, which seems to be smart enough to not send traffic through to other ports if it's not destined to that segment. (at least, one switch can be blinking like nuts, and the other can be silent.) So it's not the network bridge. I don't much trust the windows performance monitoring applet (the one in AdminTools) but it does say that the hard drive array is running with only 5%-7% idle time after three stations are using it. But it also says that it's spending 170% of its time in reads, and another 130% of its time in writes..... so... yeah thanks windows...

Last night I divided the single raid-50 array into multiple smaller raid-5 arrays and a couple single disks. I'm hoping that way we can use one array for capturing, one for exporting, etc. and get more total throughput out of the same number of drives since RAID doesn't exactly scale performance linearly (especially random-RW performance) but it would be much more efficient for us to have a single array.

So does anybody have some performance improving suggestions to get this single array system working right? All we need is about 4-ish MB/sec, per station, write-only for two capturing stations and both read and write for the three editing/exporting stations.

SiSoft Sandra said last night that the 9-drive raid-50 array had 110MB/sec random read, and 30MB/sec random write, and warned that this discrepancy in read and write performance might be due to a write-verify setting being on. I can't find that setting if it's there. It's not in the 3ware card's bios, or its management software, I looked there already last night.

Another note. Jumbo frames did not seem to help, for whatever reason. I specifically built the network to support them, but apparently the problem gets worse when they are on.

And, anybody know of any software that can monitor the amount of bandwidth each station is using? Apart from starting the windows performance monitor on each station, because when they are exporting / capturing they should not be refreshing that thing, and it would be hidden behind Premiere where you can't see it anyway.

And, should I even bother trying a single raid-5 if a single raid-50 didn't work? Raid-0 is out, because the likelihood of failure in a 9-drive system is too big. And because if we lose all our data suddenly, it would be pretty bad.

Thanks to all the hardcore people who read this far. Give me some hardcore advice so I can get full use out of this system! :cool:

<P ID="edit"><FONT SIZE=-1><EM>Edited by grafixmonkey on 08/31/04 02:08 PM.</EM></FONT></P>
 

Crashman

Polypheme
Former Staff
Wow, I didn't read you whole post, it's too long! But I always thought RAID50 was a combination of 5 and 0, for example:

2 RAID 5 arrays, striped together

That gives you the redundancy of RAID 5 if one drive fails, but also gives you the improved throughput of having twice as much theoretical bandwidth.

<font color=blue>Only a place as big as the internet could be home to a hero as big as Crashman!</font color=blue>
<font color=red>Only a place as big as the internet could be home to an ego as large as Crashman's!</font color=red>
 

grafixmonkey

Distinguished
Feb 2, 2004
435
0
18,790
Yeah, it's long I know. The whole problem is the throughput of multiple stations doing that specific task over the network I set up, and I had to explain the whole setup. If I just said I was doing video, people would be like "oh that's sequential read/write, you'll have no problems with the array" but apparently I need big random read/write performance.

I gathered that raid-50 was raid5 arrays striped together, from the formula I have for the result capacities for different raid arrays. Like, for raid5, capacity = (#drives - 1) * (size of smallest drive). But for raid50, capacity = (#drives - #subunits) * (size of smallest drive), so it has to be multiple striped 5's.

Nobody much seems to have played with raid-50 and posted benchmarks though. Too bad SCSI would have tripled or more the cost of this machine, because I bet it would have had higher throughput. 2 terabytes of SCSI is just prohibitively expensive.


So, here are some things I think would help, if anyone out there knows how to do them...

* Can the system memory be devoted entirely to caching? Having a multi-gigabyte buffer, especially if it's managed intelligently and will clump writes to the same file together, would help a lot.

* Can network card settings be tweaked in any way to help with this? For example, I thought my read/write performance would be less "random" and more "sequential" if I turned Jumbo Frames on for all machines (and my switches support it btw), but it seems to make it worse and I don't know why.

* What about the capture/export stations themselves? Can they be told to have a larger amount of cache for network output, so that frames don't get dropped the instant the hard drives all happen to be busy when it writes? I'm guessing network cache is in the lower KB right now, and it needs to be in the upper 10's or 100's of MB. I don't know if network cache is stored on the NIC itself, or if it uses system memory for that (assuming system memory)...

* Anyone know about this "write verify" thing and why my write performance would be low? Because it's 1/3 of my read performance, and SiSoft Sandra says that's not quite normal?

Still crankin' at this... thanks guys
 

Crashman

Polypheme
Former Staff
Think about it: If your ideal RAID5 array used 4-5 drives (we'll pick 4) and the minimum number of drives for RAID5 is 3:
1.) You'd need at least 8 drives to run an ideal RAID 50
2.) You'd need a minimum of 6 drives to run any RAID 50.
3.) All those drives would be limitted to the size of the smallest, so they might as well match.

That puts RAID50 out of reach for the average enthusiast, let alone gamers.

<font color=blue>Only a place as big as the internet could be home to a hero as big as Crashman!</font color=blue>
<font color=red>Only a place as big as the internet could be home to an ego as large as Crashman's!</font color=red>
 

woz

Distinguished
Jun 25, 2004
44
0
18,530
I do not know what you SATA controler will support but you have pleanty of hardware..because any one of those drives should be able to capture video realtime unless they are keeping it uncompresed. If you have 3 workstations and 9 disks why not just break your drives into 3 lv 5 Raids with 3 drives each? do this for a started to isolate out other issues (like bus and network)

But realy check and see if your editors are using any compression, also do the cliens have gig network? or just 100Mb?
 

grafixmonkey

Distinguished
Feb 2, 2004
435
0
18,790
Nine 250GB SATA drives. Tried it in Raid-50 first, with all the drives in a single array. That supported the most simultaneous video tasks, but could not do everything at once. Today we tried it with the following configuration:

Array 1: 4-drive RAID-5
Array 2: 3-drive RAID-5
Array 3: 1-drive independent disk
Array 4: 1-drive independent disk

The network consists of two independent segments of copper gigabit ethernet, with three stations on one segment and four on another. The RAID server also has a 100 megabit port uplinked to our router (internet connection), and the three network connections are bridged together in windows XP Pro. I'm beginning to wonder about the bridge, and whether it is wasting ethernet bandwidth by echoing packets to segments that don't need them, but it really doesn't look like that's the case.

We found that the independent disk arrays could not support an mpeg-2 video stream exporting to them in real-time across the network. That surprised me, because these drives are faster than the external firewire drives we've been using up till now. The two raid-5 arrays could each support a couple simultaneous tasks, but we still couldn't get everything going at once. I'm hesitant to go to raid-0 because of the chance of failure.

Got some better debug information today. I watched windows performance monitor like a hawk while my coworkers worked over the network, and every time they reported dropped frames it was because one of the drive arrays got down to 10%-0% idle time. So it is the arrays, and not the network, that do not have enough throughput. Sandra says the raid-50 array had 110MB/sec random read (yes I said random read, it's freaking fast) but only 30-35MB/sec random write, and warned that something might be doing write verification. But I can't find a write verify setting anywhere... Anyone know where it might be? Possibly a feature of the hard drives themselves? Should I try disabling SMART or running some kind of Western Digital diagnostic utility on the drives to change something?

Here's one more oddity. When dropped frames occurred, the typical drive array usage was only 50% to 60%. Occasionally a couple samples would read low, like in the lower 10's and 20's, and then the usage would jump way up to 90%, and then back down to 50%-60%. So I think that the hard drives are "taking breaks" or something, causing the system cache to fill up and then dump a bunch of data to disk when they come back.

Average "disk bytes per read" and "disk bytes per write" were 64,000-ish, right around the stripe block size of the arrays. (I created this set of arrays with 64K blocks instead of 32K blocks like the last one.)

I really want this thing up and running. I really think it should be able to handle this throughput. Here's a better throughput breakdown for people. Some stations read and write at the same time, some only write, some only read. In addition, some must do it in real-time, and some just read at the rate they can get and the speed of that task changes accordingly. The data rate for our video is 4.5-ish MB/sec. (DV compression.)

Station:________Writes:________Realtime?_____Reads:_______Realtime?
___1_________4.5MB/sec_________yes__________0_____________X
___2_________4.5MB/sec_________yes__________0_____________X
___3__________1 MB/sec_________yes______4.5MB/sec________yes
___4__________1 MB/sec_________yes______4.5MB/sec________yes
___5__________1 MB/sec_________yes______4.5MB/sec________yes
___6_____________? *____________No__________0_____________X
___7_____________? *____________No__________0_____________X

* - captures locally and then transfers a file in a single block transfer, from its hard drive, to the network. This has not been going on when the dropped frames occur.

Also, occasionally, stations 5 and 7 perform different tasks:
___5__________4.5MB/sec________yes__________0____________yes
___7_____________0______________X________4.5MB/sec_______yes


So we have between 12MB/sec and 16.5MB/sec writing, and between 13.5MB/sec and 18MB/sec reading, simultaneously. Cross reference that with SiSoft Sandra random read/write performances for the single RAID-50:
Random Read: 110MB/sec
Random Write: 35MB/sec

Tomorrow's experiment will be to eliminate the network bridge (thus taking all computers off the internet, which is ok I suppose) and installing a free DHCP server on the two connections. Also, it is possible for us to have a single dedicated gigabit line to each computer, with no hubs or switches involved. I'm also willing to look into other network protocols, because TCP/IP is unnecessary in this case, and there may be something better with less overhead.

<P ID="edit"><FONT SIZE=-1><EM>Edited by grafixmonkey on 09/01/04 09:30 PM.</EM></FONT></P>
 

woz

Distinguished
Jun 25, 2004
44
0
18,530
Back when I was young.. <old man voice> I configured a cutting edge system a Targa 1000? video capture card and a PowerComputing s900 with 120Mhz of raw speed and 32Mb of RAM, the RAID was two 1Gb SCSI drives at level 0. If the drive was freshly formatted we were able to hit an amazing 5.5 Mb a second and it was just fast enough for MJPG2 capture real time with no dropped frames.

Unfortunaly at times the computer would just start droping frames and I spent many, many nights reformating those darn drives with every block size possable.. reinstalling the OS, adjusting the RAM Drive ex.. ex..

It turned out that the Targa card had a loose chip! AAAGH! But it woprked out for the better because we replaced it a couple months later with a Media 100 and then an Avid :)

Basicly I learned 2 very good lessons.

1. Never assume that a component is working correcty.
2. Software can mask or cause hardware issues.

You have done a great job of trying configurations but you need to TEST things. (Test not try)

1. Have you tested the network for saturation (a simple way is ping -a <server>) there are much better tools.
2. Are the drives saturated.. You already looked at this but you should have plenty of hardware.

Q: 30-35MB/sec random write, and warned that something might be doing write verification..
A: On some drives this is set via reflashing the drive or via a utility provided by the manufacture. This may also be a seting in the controler. (for a financial database this would be a good thing) Do the drives you are using support command queing? and is it turned on (I have no idea how to check that on a SATA drive) also is there a newer version of the firmware?

Q: So I think that the hard drives are "taking breaks" or something
A: Drives will recalibrate them selves from time to time so they can correct for expansion caused for by heat. On some drives this timing can be set via tools from the manufacturer.

Q: Should I try disabling SMART or running some kind of Western Digital diagnostic utility on the drives to change something?
A: SMART shouldn't cause a problem but it's possable that one of the drives could be bad.. Some people say to never do a low level format of a drive.. but I always do it to every new drive..then I know how many bad clusters or blocks it had from the factory and if it has a lot it goes back ASAP.

3. Is there other software causing a problem.. Is virus protection on?
4. The 3Ware Escalade 9500S-12 looks to be a great controler but is XP able to handle it? have you tried running Win2k Server. (as a test)
5. Call 3ware and see what they have to say?
 

woz

Distinguished
Jun 25, 2004
44
0
18,530
Try to copy a large movie onto the same drive at the same time when a capture is going on, keep doing this and see how many copy / write operations you can do localy before you drop frames. This should remove the network as an issue?????
 

grafixmonkey

Distinguished
Feb 2, 2004
435
0
18,790
Thanks WOZ those are all good suggestions. Let me see what I can figure out today.

I'm pretty sure it's random write performance of the array that's crashing things. I think I will run SiSoft Sandra between several different nodes on the network, and see what each node reports in the storage and network benchmarks, that should tell me if the switches or cables are causing dropped packets, or if one machine is on a problem connection. I'm trying to figure a way to run the WD diagnostic utilities on those drives... In my experience those diagnostic utilities have not liked being run through RAID cards, and I may have to find a system that has a port on the motherboard. SATA is much different than PATA though, and last time I tried that I was using PATA, so who knows. At least now we have some arrays that are "single disk" type and essentially turning out to be useless, so they have no information on them. I did notice that WD now has a "Raid Edition" drive which advertises a "time-limited error correction" feature, whatever that means... I hope the drives I bought can have error correction turned off, and I double hope that when I do it doesn't cause all the data to be corrupted.

What boggles me is that exporting doesn't work to those single drives over the network?? Exporting requires only 1MB/sec of bandwidth! Also, when exporting to single drives the average "bytes per write" is around 32K, rather than 64K like the other arrays. Maybe that's telling me something about how the network likes to send its data.
 

grafixmonkey

Distinguished
Feb 2, 2004
435
0
18,790
it what, the network? or the raid card? They'd both better support that, or everything I think I know about both is wrong...

Still no luck with this. I found one computer that had something weird and 18KB/sec of bandwidth and 97% packet loss reported in Sandra's networking benchmark, but only occasionally would it report the packet loss. But I could run that benchmark and immediately afterwards copy a four gigabyte file over the network, and the file would copy in only a minute or so. Maybe sandra is screwed up by that system's config somehow, it's the first time I've ever run the sandra net bench between two computers that both had PCI-X network cards and RAID drives.
 

woz

Distinguished
Jun 25, 2004
44
0
18,530
I have used Sandra on everything up to a true Fiber connection with no problems.

97% packet loss! Swap the network cable.

What kind of router or switch are you running? Does your router support auto cross-over detection? If this is the case one computer could flood your network with broken packets.

I have also seen that on some routers/switches (d-link) because they support auto cross-over detection. It's kinda cool because you don't have to use a cross-over or an up-link port when going to swith to swith becaue the router is smart enough to to do everthing internal. Unfortunaly at times the router will try to test the connection and it looses lons of packets, in normal use it's not a problem but with realtime data it becomes an issue. Also some routers/switches have a good size (Multi Mb)internal RAM buffer that can be monitored.

Have you talked to the RAID vender yet and see what they suggest as a configuration?