Logical Drive problems - 14.5TB raid6 under Win7 x64 NTFS

sirv

Distinguished
Feb 16, 2008
27
0
18,540
Walking the line between being descriptive and providing all the details, and avoiding a wall of text is tough. So, I've divided it in parts and you can skip the parts you think are irrelevant.

Contents:
Problem: short description
Hardware
Software
Partition Info
Raid Controller
Process of Error Finding
Things it's probably not


Problem: short description
I have 10x 2TB drives in a raid6 configuration (actual size roughly 14.5TB excluding parity drives). The raid controller initializes the logical drive fine, however a (slow) format in windows appears to simply unallocate the drive upon OS restart, whereas a quick format appears to work for a while (as in, a few reboots) then reset to raw unformatted partition (after 3.5TB was copied to it).

Hardware:
Promise SuperTrak EX16350 (16x SATA300 RAID, 128MB PCI-e x8) - Oddly, the controller's BIOS claims it has 256MB memory
Samsung EcoGreen F3EG, 2TB - x10, in a raid6 (roughly 14.5TB of actual storage) - connected to promise controller
Western Digital Caviar Green WD10EACS, 1TB - x6, in a raid6 (roughly 3.6TB of actual storage) - connected to promise controller
OCZ Vertex 2 SOCZSSD2-2VTXE120G 120GB - connected to mobo
HDD 320GB 7200RPM S-ATA300 Seagate 7200.10 16MB Cache - x2, connected to mobo, soft (in windows) mirror raid
MSI DKA790GX
AMD Phenom X4 9950
OCZ Platinum Dual Channel OCZ2P10004GK - 2GB, x4
Silverstone Strider PSU 750W (SST-ST75F-P) - powers everything else
POWER SUPPLY YESICO 560W SilentCool w/modular cable - 'always on' - powers 5 drives: 4 2TB drives and 1 1TB drive

Software:
Windows 7 Ultimate x64

Partition Info
The computer has 4 partitions even without the new 14.5TB raid6 - all are NTFS and Healthy:
[SSD, MBR] C: - Simple, Basic - Boot, Page File, Crash Dump, Primary Partition
[SSD, MBR] System Reserved (no drive letter) - Simple, Basic - System, Active, Primary Partition
[2x 320GB, MBR] D: - Mirror, Dynamic
[3.6TB raid6, GPT] T: - Simple, Basic - Primary Partition
[14.5TB raid6, GPT] S: - Simple, Basic RAW - Primary Partition(... sometimes; Before it was reverting back to unallocated)

Raid Controller
The raid controller goes through initialization fine (I set sector size to max of 4096 and stripe size to 64k), I guess it takes roughly 32 hours. One drive took roughly 8 hours to check (full drive read) using samsung's estool and there are 2, 4, 4 drives over 3 connections to the raid card so that's what you'd expect really. All 10 hard disks and the logical drive check out as OK in the controller bios. The hard disks also have the same identical firmware. Aside from the logical drive initialization there is no background activity. The raid card currently also holds a 6x 1TB raid6 which is working fine (unfortunately it's also filled to the brim). For years I've had this raid6 running in conjunction with a 4x 320GB 0+1 raid on the raid card with no problems.

Process of Error Finding
First I allocated the drive in the raid controller BIOS. Fool that I was, I did not wait for the logical drive to be initialized. After a first (slow) format in windows, and restarting, the drive showed as 22% initialized in the raid controller bios and subsequently the drive was back to unallocated in windows. Now I waited for the drive initialization to be complete in the raid bios and proceeded to do another (slow) format in windows. Because the stripe size is 64k, I picked an "Allocation Unit Size" of 64k. Volume name of "Storage" (same as T: ), file system of NTFS (no other choice given, anyway). Quick format ticked off. "Enable file and folder compression" was unavailable (because the allocation unit size wasn't default, I assume) - and undesired. This second format did the same as the previous: the drive was unallocated (computer management -> storage -> drive management) after an OS restart.

So I removed the logical drive in the controller BIOS, reallocated it (same options as before - 4k sector size (options: 512 to 4096), 64k stripe size (options: 32k, 64k, 128k)) and waited some 50+ hours (far more than the estimated 32 hours required) before rebooting. After reboot the drive showed as fully initialized in the raid bios. I did another slow format in windows. The drive (apparently) works just fine before rebooting. I copied over a few gigs of music - directory listing appeared fully functioning and music playback was no problem. After another reboot the drive was once again unallocated. I did a quick format (same settings still), copied over some music, tested playback, rebooted. The drive appeared to be fully functioning even after several reboots. Next I copied over 3.5TB of data. Suddenly some directories became unavailable. You could see most directories, but (I think) not inside most directories with files in it. The drive's properties showed 3.5TB in use, however selecting all folders in the root (excluding of course an empty recycle bin directory and the system volume information directory) showed a mere 300GB of actual files. After a reboot this drive reverted to a RAW filesystem status in computer management - but not unallocated.

Things it's probably not
At this point I'm stumped.
- NTFS is limited to 256TB with 64k clusters (aka, I think, Allocation Unit Size). Even with 4k clusters it's just under 16TB which is still far more than the 14.5TB it actually is.
- Windows 7 versions do have a ram limitation but no drive size limitation that I could find.
- Before I did all this, I contacted the promise support. Their tech guy assured me that while there might be size limitations of a 32bit OS, a 64bit OS could have a virtually endless size (responding to my question whether the raid controller card had a drive size limit, given that it's handling (including both parity drives) roughly 18.1TB of data).
- I have read (in a 4 year old forum post) that mixing MBR with GPT results in a maximum of 4 primary partitions. Now I'm guessing this is circumvented either by newer technology, or one of several other reasons. I have added a 2TB drive directly, and while I can't remember whether this was an MBR or GPT drive, it worked fine and when it was removed and the 14.5TB drive was added in its stead the partitions and drives remained the same number so I don't think this can be it either.
 
Solution
Hey again.

I had the exact same thought actually. First I tried all kinds of format parameters etc, without any luck..
Every time i copied more than about 2 TB it crashed just like yours..

Then i deleted the raid and set it up again with 512 byte sector size (and 64k stripe size). I formatted it with 64k allocation unit size as usual, and now it seems to be working fine!! I have copied about 5TB to it so far, it has completed init, no errors what so ever.. and have been running for about 10 days now.. I have restarted a lot of times, copied and deleted from the drive etc..

In other words, it seems the problem was the sector size (don't ask me why) but it is working perfectly now!!! =)

(And btw, I created the logical drive...
G

Guest

Guest
Hi.

It seems we have the same problem, pretty much.

I have the Promise SuperTrak EX8350, with 8 x 2TB WD Green HDD's in raid 6.

I have tried several different approaches, but every time it fails just like yours, after i copy about 2TB+ onto it. That seems to be the magic limit here at least. No idea why!
And the full initialization takes about 130 hrs!!!! so it's a pain in the f** butt to experiment and try different things here!

I also have set sector to 4096 and cluster to 64k btw.

I have no idea what to do here now. It's really starting to get on my nerves!

If you have found any solution to this i would be very grateful if you would share it.

Tx..

/aleks
 

sirv

Distinguished
Feb 16, 2008
27
0
18,540
Well it's been a month and I still haven't solved this (of course it doesn't help that I use this machine so I can't just let it run diagnostics all day).

It's definitely not the Allocation Unit Size you pick when formatting. I've used 64k, 4096 and 512, both in quick and slow/full format, none of which helped. I've also run Samsung's ESTool (with full read surface scan) on 6 of the 10 disks so far just in case the controller doesn't know one's broken but so far no errors yet. I've also tried several volume sizes (when formatting) both at the front and the end of the disk. Whether it's the full 14.5TB, front or end half, front quarter or end quarter, the error still persists. I made the quarters smaller than the current raid 6 that does work, just to be sure.

Then one day I had the brilliant thought of looking at the event viewer. It seems so obvious but I don't deal with these kinds of problems daily so... Anyway:
Event ID: 55
Source: Ntfs
"The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume P:."
13 Errors (not critical) within 3 seconds, 12 of which mention "New Volume", one of which mentions "P:". When I go google this, the microsoft solution is to format using 4096 allocation size instead of 512. ...
But what's more baffling is if it really is an NTFS problem, why is my other raid 6 (with all the same data and more) working just fine? I should note that the last time the volume failed I was looking and it was around about the 2TB (data on volume) mark.

I also thought perhaps, even in a 64bit OS, windows might not like copying more than 2TB of data at once. But when I split the copying up into smaller parts (none bigger then 1.25TB) it still broke down.

So, after all that here's what I'm looking at as possible culprits right now:
* The raid controller doesn't like 4096 sectors. I went to check and the other raid 6 is indeed 512 sectors.
* Some way, some how it's NTFS's problem. Never mind why the other raid 6 would be working fine though.
* In a very very very long shot it might be AVast's (anti virus) fault. I doubt it.
* One of the drives is failing but the raid controller (and perhaps even ESTool) don't realize it. I have trouble believing something like that would get past raid 6 redundancy though.

So, one possible cause as unlikely as the next but I'm starting to grasp at straws.
 

sirv

Distinguished
Feb 16, 2008
27
0
18,540
Everything was up-to-date before I started - firmware, drivers, ...

The symptoms don't seem to point towards the controller and I asked them before I made any purchase (of the extra 10 hard drives and a bigger PC case I mean), but I'll open up another ticket and see if they have any advice to add.
 
Does the controller support some sort of integrity check of the volume, and have you tried doing that when the problems occur? If the integrity check fails, then you either have some bad drives or you've got some sort of controller glitch. If the integrity check passes, the drives and controller firmware are probably OK and you may have some sort of OS-level driver glitch.

The last thing I'd suspect would be NTFS itself. It has a very solid track record. There's always the possibility of some strange issue with it, but it's not where I'd start looking.
 

shelob

Distinguished
Sep 24, 2010
1
0
18,520
Hey again.

I had the exact same thought actually. First I tried all kinds of format parameters etc, without any luck..
Every time i copied more than about 2 TB it crashed just like yours..

Then i deleted the raid and set it up again with 512 byte sector size (and 64k stripe size). I formatted it with 64k allocation unit size as usual, and now it seems to be working fine!! I have copied about 5TB to it so far, it has completed init, no errors what so ever.. and have been running for about 10 days now.. I have restarted a lot of times, copied and deleted from the drive etc..

In other words, it seems the problem was the sector size (don't ask me why) but it is working perfectly now!!! =)

(And btw, I created the logical drive immediately, formatted it, and started copying to it, while the full background init was running. Didn't have the patience to wait until init was complete first, but still it worked perfectly=)

/Aleks
 
Solution

sirv

Distinguished
Feb 16, 2008
27
0
18,540
First, shelob appears to have found the solution (that was going to be my very next thing to try), although I haven't done extensive testing yet I have now copied all the data filling nearly a quarter of the volume. Yay.

Next, thanks sminlal for having me revisit Promise's WebPAM (management) software. I don't know if other factors were involved but I never got it working before (and everything was running fine so I wasn't terribly motivated to get it working) - it's not mentioned anywhere but apparently it requires Java, so I've got that working for the first time.

When the Volume(s) disappeared, I also lost drives 13 & 14 making the raid first degraded, then critical. After that I let the raid rebuild (14 dropped out once more, and there was a building-wide power outage somewhere in there). Even though a redundancy check was then somewhat pointless (if you've used both parity sets to rebuild 2 lost drives, all parity checks are bound to pan out) I ran a "redundancy check" anyway, to ensure it was at least accessible and correctly so (a broken drive might, perhaps, still return bad data even if it was recently rebuilt from good data). All that pans out, and during all that the volumes were still 'raw'. I attempted another go to see if I couldn't do the redundancy check without 2 drives falling out, but after copying over 2 TB both 13 and 14 dropped out.

So the next two nights were spent using the ESTool on drives 13 and 14. Drive 14 couldn't even get past the random read test (and, apparently, 8 hours later was still giving errors). Drive 13, which is perhaps more worrying, came through the test clean, including the full drive read test. Perhaps it's that they were on the same 4pin->2xSata power connector, perhaps that they were on the same promise->4xSata data cable (but then, so were drives 15 and 16). Perhaps drive 13 fluked the ESTool test and it'll fail later. That's beyond this topic anyway.

Finally, I wasn't suggesting NTFS had a problem, just that there's a tiny chance there might be a problem with its implementation - considering these are rather rare circumstances. But that was a pretty desperate guess. Fortunately, turns out this isn't it.

Thanks for all the help, hopefully time will prove this was indeed the solution (which still seems odd, since the controller will now have to keep track of 8x as many sectors - oh well, as long as it's fixed). I'll try and remember to post back in a few days to hopefully confirm this.
 

sirv

Distinguished
Feb 16, 2008
27
0
18,540
Okay. First I replaced the bad drive 14. Drive 13, at least so far (and as mentioned, by Samsung's ESTool) seems perfectly fine. Next I deleted the old logical drive, created the new one (sing 512 byte sectors), shut down the PC and prepared the replaced drive 14 for an ESTool check just in case. This was in the morning, and I do the 8+ hour ESTool check while sleeping, so for the rest of the day I was playing with the logical drive, confirming the fix. I mention this because, oddly enough, the initialization stopped as soon as drive 14 was removed and the logical drive was marked 'degraded'. I realize initialization is merely zero-ing out the physical drives, but still odd.

So once the new drive 14 also checked out, I again recreated the logical drive (just in case) and unlike shelob I managed the patience to hold out the full initialization. Once that was done I created a volume covering the full logical drive, and copied over my data. Conveniently, my data is just under 25% of the new volume, so I could test its bounds by just copying over my data 4 times. I also ran a checkdisk (for the hell of it) which passed with flying colors. I've added some numbers below for the hell of it, but long story short everything appears to (finally) be working.

Tested:
414,796 files in 27,552 folders (Maximum directory depth is 8 or so - nothing taxing there)
Capacity: 16,002,518,548,480 bytes (14.5 TB)
Free Space: 28,362,866,688 bytes (26.4 GB)

Also (I admit I've zipped quite a few smaller files up front for this), a comparison between 4kB and 64kB allocation unit size (formatting option):
Comparison Size: 3.62 TB (3,987,850,488,877 bytes)
Size on disk (4kB): 3.62 TB (3,988,105,908,224 bytes)
Size on disk (64kB): 3.63 TB (3,993,143,869,440 bytes)
Because of rounding, 10 GB seems like a lot, but it's actually 4.7 GB or 0.126%. Keep in mind this is true only for my current storage situation where the average file size is 9.17 MB.

So thanks to you both for your help.
 

TRENDING THREADS