Need advice on what kind of storage I need

icedown

Distinguished
Apr 11, 2009
4
0
18,510
I have setup a satellite data feed at my house. My current receiver system has 1 sata 320gb drive in it. It is not able to keep up. I am looking for the best types of hardware to for this situation.

The data comes in via a network multicast off of a satellite receiver. Average speeds are around 500-700kB/sec with peak speeds exceeding 1.2mB/sec, sometimes for around 10min. This data is broken down into files, averaging between 2000 and 4000 files per minute. These files range in size from less than 1kB to > 400MB depending on the product. The problem comes during the processing of this stream. Each file is written into a spool directory and the file type and name given to various decoding programs. They each read the file, process it, then write it back to the drive in another directory. Each file goes through at least one decoder but can be passed to as many as 4 simultaneously. Then those files are read from across the network by various programs that we use.

I'm not sure what the best way of dealing with this will be. It also doesn't help that don't make much money so I also have to keep this thing cheap considering how much I've already spent on this setup. I've been looking at sata II, scsi and various raid setups but can't really find any good information on which direction to take with this problem.
 

specialk90

Distinguished
Apr 8, 2009
303
0
18,790
Hi there.

The problem seems to be Random Access speed with lots of random reads & writes. I think I can give you as great of a solution as possible while keeping costs low.
Hopefully, you have an Intel motherboard with their Raid chipset(ICH8R/9R/10R).
$200 - get 2 74GB Raptors 10k rpm drives and setup the OS/Apps and 'Reads' in Raid 0. Then setup a Raid 1 array for 'Writes'. I am assuming that Reads are the incoming data and Writes are the decoded final files. And I am assuming that your software is capable of sending Reads and Writes to different paths.
If you need more storage, than you could try a 2nd 320GB drive(must be same model for Raid). If a 2nd drive in the Raid setup I explained still doesn't provide enough speed, then try getting 3 more 320GB drives and run them in Raid 10, which combines Raid 1 with Raid 0.
Just for FYI, Raid 1 'mirrors' 2 drives so if one fails, then you don't lose any data. Raid 0 'stripes' or splits the data between drives; however, if one drive fails, you lose ALL data.

Another pointer is the type of drive: a regular desktop drive is certainly not designed for what you need. Seagate's SV and ES.2 drives are designed for 24/7 heavy usage. Their SV line is designed for video recording(ie surveillance) and ES.2 are designed for servers. They cost more but are made for heavy usage, and you certainly are in the heavy usage arena.

Side thought: the software you are using isn't designed to put the partial data into Ram and then write to disk once a file is completely downloaded? This simple process would save your drives a lot.

I'll add more once you can answer my questions.

PS I highly doubt you will need SAS/15k drives. Even the prior generation 74GB Raptors can handle quite a bit, and the current VelociRaptors provide even better speed but cost a lot more.
 

icedown

Distinguished
Apr 11, 2009
4
0
18,510
Thank you for such a detailed reply. As far as total amount of data, the feed usually totals ~20gB/day, after decoding, storage is usually around 60gB. The data is weather data so it usually doesn't stay on the drive for more than 48hrs.

The system is running on linux so read/write to separate disks is not a problem. I can even separate the small data from the big data because the data that is placed in a given directory tends to stay around the same size, with the exception of the spool directory that is.

As far as holding it in ram until completely downloaded, i'm not really sure. I would think that it does on most files because it has to piece the file together from frames using sequence numbers. Raid 1 is not needed due to the general volitility of the data.

What about onboard vs controller card also.
 

specialk90

Distinguished
Apr 8, 2009
303
0
18,790
I really think that the pieces are not being held in ram due to the fact that your drive is being overtaxed. It might be setup to only store small chunks, like 5MB, and then write to disk which would still cause problems for a single drive.

Onboard vs hardware Raid: now that I reread your original statement, I realize that you really need some storage power. What motherboard do you have and how many PCI-Express slots does it have?(this is for hardware raid)

I really think 2 sets of raid 0 would be best: 1 for incoming/decoding and the other for reading by your network and at least 10k rpm. Before I go any further, what is you budget?
SSD drives are almost what you need but the large size difference of your files combined with constant writes- this would bog down the top of line Intel X25. SSD's just aren't ready for prime time in heavy writing environments. I can also tell you that using the smallest stripe in your Raid will help with all the small files but hurt the speed of larger files.

The best solution is using Ram to store the pieces and Raid for decoding & reading. Is the Decoding done via other hardware or is it done via software on the host pc? If on host pc, can it still keep the files in ram and decode from there so then it only needs to write to disk when done decoding.
 

icedown

Distinguished
Apr 11, 2009
4
0
18,510
The software i'm using is written in tcl. I'm not very familiar with the language so it will take me some time to decipher how it deals with the fragmented data in the feed. As far as hardware, I'm currently running a AMD Athlon 2.5gHz, 1.5gb ram and a GA-7N400-pro2. It has a flaky sata hd controller on it, but currently it's the most powerful spare I have. When I get the cat5 cable buried from my sat to the house, I'm probably going to put it into my house server. It's a dual xeon 2.4gHz box. It has open pci and pci-x slots. It's an Intel server mb, I don't remember the model.

A statistic sample is at http://www.stormguardsolutions.net/stats.html These are in one minute intervals. Each product is a file. This is not at a peak time. This is around average. This is before decoding. Maybe this will help explain the data stream a little better. missed products are ones that not all the frames arrived. This is due to a slight misalignment in on the dish itself. I'm still refining dish settings.

 

specialk90

Distinguished
Apr 8, 2009
303
0
18,790
Because your current motherboard is not an Intel, (so it doesn't have Intel's excellent onboard Raid) I would try a 2nd drive and separate the reads and writes between the two drives. If you want to spend time on trial and error, you could try another 7200rpm. Oh wait, that reminds of something great for your needs: short stroking. This is something you can already do with your current drive. What you do is create a single partition on the first 40-50GB and leave the rest of the drive clean(so no more partitions at all). This is short stroking. It shortens the amount of physical space the drive must seek, GREATLY improving random access. Tom's Hardware just did an article on this testing 250GB Hitachi's, 1TB Hitachi's, 15k Seagate's and a couple SSD's. On the 250GB single drive, performance improved 65% when going from the full drive to the first 34GB and about 90% using only the first 12GB. Law of diminishing returns shows that 34GB-ish is the sweet spot for a 250GB drive. And because you said that about 60GB of data is collected and dumped every few days, you can benefit greatly from short stroking. I don't think Raid will be needed after all. Another thing to consider is the amount of space already used by the OS & software - is this on the 320GB drive?

If you wanted to compare drives, look at 'Web Server' IOP benchmarks because Web Servers tend to be the most demanding by accessing lots of small files quickly.

Look at the Seagate ES.2 500GB drives($90@newegg). Their Web Server performance is very close to the 10k rpm Velociraptor and yet its only a 7200rpm drive. The ES.2 is also designed to run 24/7 under heavy use, whereas regular desktop drives are not. The ES.2 is an 'Enterprise Class' drive meaning that it is designed for higher IOP/s(transactions per second). The ES.2 vs its desktop sibling, the 7200.11, the ES.2 is about 15-20% faster in server benchmarks(ie IOP/s). With a 500GB drive, you can use the first 65-70GB and have speeds near 15k drives. This setup will certainly be faster than using a 74GB Raptor(10k rpm) since you would use the full drive and its $10 less. FYI, I'm not a Seagate fanboy. I use both WD Raptors and Seagate 7200.11 500GB drives(& 7200.10 250GB drives) and love them both. However, of the 13 Seagate drives I am using and the 8 Raptors over the last 2 years, I have had only 1 drive die - a Raptor died a few weeks ago.

Also, if you wanted to know about other Enterprise Class drives, there is the Hitachi "Ultrastar"(Deskstar is for desktop) and the WD RE3 (Caviar -Green/Black are for desktop).

Sorry for me going on and on but I really want to see how well Short Stroking works for you.
 

basalt51

Distinguished
May 20, 2008
18
0
18,510
or create a RAID 0 (or whatever RAID you like) setup with the short-stroked drives!

Also, I know Linux (well, Ubuntu/Fedora/OpenSUSE that is) doesn't recognize my ICH10R RAID so be sure to plan ahead with your server class motherboard, though I'm sure that one is fine (ICH8/9) and probably uses a true hardware RAID controller.
 

Zenthar

Distinguished
Damn I love this question, so many possible causes/solutions :D. Not sarcastic here, I just love system architecture (both software and hardware).

If you use the command "top" can you see what your what is your CPU load, how much I/O wait you get and how much RAM is in use? This will tell you what the bottleneck currently is.

If you CPU is maxed-out by many distinct heavy processes (like data crunchers), you might even loose additional performance because of context switching; this would be solved best by a dual/quad core CPU.

If you get too much I/O wait, you have to know what is taxing you HD the most. For example, from what I understand your data feed is written to multiple files, these files are then read by multiples processes (sometime simultaneously) and their output are then written AGAIN in files. If the "satellite files" don't need to be kept for long and don't take too much space, a small SSD might give you a hell of a boost since 1.2MB/s throughput isn't really a big deal by itself. As other suggested, having the "satellite files" and "final files" on 2 separate disks would also reduce the load considerably. You could also try to give short stroking a try (whoever though of that name should be shot ...).

If too much RAM/swap used ... then add RAM ... duh!.

Unfortunately, it could also be that the application is badly written. There are many ways to code such an application in a way that would make it an unnecessary resource hog ... but there are ways to make it nicer to your system too. For example, instead of sending the file name to all the "crunchers", the are ways to send the content itself directly. It will require more RAM, but RAM is much cheaper than fast HDDs.