Raid 5 Building Stuck a 0%

orion_supernova

Reputable
Dec 27, 2018
20
0
4,510
So:

Windows 7 Home Premium Edition
Asus 990FX R1.0
6x WD Red 4tb HDD 5600 rpm
1x 240 GB SSD OS
1x 120 GB SSD Games

The 6 WD drives are run in RAID 5 config directly from the 990FX mobo, configured through a 2 part pre-BIOS/BIOS configuration.
-You set the sata port config through the bios but manage the raid upon restart.
I have 8x SATA ports on the mobo, all occupied, port two is connected to a vertial hot swap connection on the top of the case and apparently platter drives can't run in it.

To skip all the backstory, look for the ############## near the bottom

One upon a time, many years ago the raid was fully functioning.
I looked upon my creation and it was good, and i smilied
5x drives RAID'd which showed both in the pre-BIOS and RAIDXpert as being a 19,999 GB partition.
Windows showed it as a 18.1 TB Drive.

RAIDXpert showed 6x drives as being integrated, even though through BIOS one drive was specifically set as a hot spare before initial creation. - I decided that this meant that the spare was just dedicated to the RAID.
No config was done through RAIDXpert to start with.

One day coming home from a friends, a cable remained unplugged upon boot and the RAID went offline in the pre-BIOS.
Upon boot the computer took 7 minutes exactly to boot through the windows screen, (yes i timed it as it was the first issue i had, had with the RAID)
When it did finally boot it showed the raid as offline.
I found the issue, restarted with the cable plugged in, and the raid came back, but with the same 7 minute delay every boot.
Relieved i hadn't lost the data i didn't question this further and for months i trolled the interwebs with abandon.

Later a drive did actually die, and the RAID went offline completely, sadly all 14 TB of data was lost as i was not able to recover it. Much of it was original footage and records and some beloved TV shows & movies that i had lost the discs to many years ago.
I tried many places over about 3 months to recover, i even had microsoft tech support trying to help me, but with no backups to fall back to, i eventually gave up, wiped the RAID and all associated drivers and started a new.
I replaced the dead drive with a seagate barracuda 7200rpm, (I did not know at the time if the drives had to be exactly the same.) and reinitialized the RAID to a blank canvas of 19,999 TB
Seeing that come up on the screen was both a hearth breaking and hopeful moment.

The Pre-BIOS and RAIDXpert both showed the RAID in the CRITICAL state with only 5 discs, and i just assumed that was because the SEAGATE drive was not accepted as it was too different.
I wrote the tragedy off as a fluke of nature, act of god if you will, and without the funds to buy a correct disc was determined to rebuild as best i could.

Sadly, just before Christmas this year, tragedy struck again.
I sat at my computer in the morning to play a bit of Rimworld before work, and found that the shortcut on the taskbar would click, but then not do anything.
So i tried opening steam to the same response.
As a moment of dread settled over me, i opened windows explorer, and my heart sank as the little message in the corner of the screen from RAIDXpert contritely exclaimed that the RAID had gone offline.
There was nothing i could do but shut down.

After work that after noon, i booted the machine with baited breath.
The Pre-BIOS showed the RAID in it's usual CRITICAL condition!
Windows detected the raid!
I was able to play movies and tv shows, it was all there.
Everything seemed normal until i was about 10 minutes into a Rimworld game and the RAID suddenly went offline! (It's possible that the game was attempting to save, but i have know way of knowing or even speculating on the chances of that)
Anticipating a dead drive, i had stopped by the PC store on the way home and bought 2x new WD 4tb 5600prm drives, exactly the same as the rest of the drives in the stack, except 4 years newer.
I tried a number of variations, switching out drive 4 (the dead drive) and drive 5 the (the Seagate) and found that as long as 4 was in, the raid would boot, but would be unstable.
I've jiggled the cables, blown out the dust, and have the original discs in slots 1,2,3,4,6, plus i have replaced the seagate drive with a new one.
The RAID has been stable while being downloaded too for the last hour, so i am starting to get my hopes up that i have over reacted to a loose or bad cable, but still it worries me that i have had so much trouble recovering from what could be seen as a dead disc; something that a RAID 5 should be resilient to.

Here's the kicker
All through this process of switching out drives 4 and 5; whenever the RAID had failed and i had rebooted, the pre-BIOS would show the raid as REBUILDING, and when it boots into widows, RAIDXpert shows that it is running the rebuilding process, but never progresses past 0% rebuilding.
I am able to pause the rebuilding, and abort it, but aborting only has the process immediately restart and hang at 0% again.
I am able to start a patrol schedule, it will progress to 30% or 31% on 5 drives, but will hang at 0% on drive 4.

####################################

The raid is currently in what i would call a normal state.

The raid currently has 5x original discs in it (1, 2, 3, 4, 6), plus i have replaced disc 5 with a WD to match the stack, Disc 4 was the one that was most recently giving me issues.
I was not able to designate disc 5 as a spare through the BIOS, it would seem that this has to be done on initialization to be done this way.
RAIDXpert however allowed me to designate the drive as a spare, at which point it is immediately integrated into the RAID as far as both RAIDXpert and pre-BIOS are concerned,
This happens even if the drive is added while windows is running.
Even before the disc is added and the RAID shows up as a 5 disc array, RAIDXpert shows it to be rebuilding and stuck at 0%.
And lastly, for a real twist, if i got the raid online and working in a critical state, and then had disc 4 fail, the raid was just going offline, it would show as having discs 1, 2, 3, 5, 6 all connected and functioning, but the rebuild would just sit at 0%

Now for the part i need help with.

I currently have access to all 3.7tb of data on the array, the obvious answer on how to protect it is to back it up, but i would like to try and understand the issue at the heart of this, as i'm worried that if a disc actually does die, i might loose all of my data forever.

What's the point of a RAID array if loosing a single disc is going to loose all the data anyway, i might as well just go for an extended disc, it seems that would at least have more recoverability.
Obviously this isn't how it's supposed to be, how can i make this not happen in the future?
 
RAID is not totally bullet proof, but does give you some redundancy to allow for a failed drive that is easily replaced -- but you still need to backup all the data just in case 2 drives fail or a second fails during a rebuild.

I would recommend that if you really want to use RAID and have a PCIe slot (full size x16, although it will only require x8), buy an inexpensive Adaptec 6805T with two 8087 to SATA fanout cables. There are always a ton of them on EBay, and some are new out of China and are quite cheap although shipping is slow at a couple weeks.

THIS is an example for $47 with shipping.

HERE is more information on the card and all the drivers/software.

Motherboard RAID is worthless and the array breaks easily. You can't run RAID 5 software in W7, although you can in W10 Storage Spaces.
 
So i started up the PC today to find the array online and functioning.
I have successfully replaced and integrated a new spare disc in slot 5.
Disc 4 dropped out again an the array went offline.
I replaced disc 4 with a new and rebooted.

I got the same preBIOS message telling me that the array was rebuilding.
When i got into windows i started up RAIDXpert to check to see what was happening and found the rebuild status had now stuck on 3% rather than the usual 0%

I have left it like this for nearly 2 hours and it has not moved so i am fairly certain it is frozen again.

Could there be some sort of conflict between the preBIOS and RAIDXpert where they are both trying to run a independent rebuild process that could be causing the freeze.

Are there any other diagnostic or management programs out there that i might be able to use instead of RAIDXpert to see if that is the issue?

AMD are giving me the expected basic level customer support, linking me to pages from the RAIDXpert manual.

I understand there are other solutions out there for RAIDing and am seriously considering moving all of the drives over to a NAS to free up space on the MOBO, but i'd like to find out what is happening to this RAID rather than just giving up and moving on, because that's not how you learn.

Does anyone have any insight into the issues i'm having?
 
Strangely, when i took out the new disc in slot 4 and put the original WD disc back in and rebooted.
PreBIOS shows the array as rebuilding.
Windows shows the array as functioning fine.
RAIDXpert shows the array functioning fine, but rebuilding and at 3% still...

I'm really getting suspicious of RAIDXpert and would really like to find an alternative to it.
 
A dedicated NAS is the way to go here.
Qnap, Synology, Theacus...

I have a 4 bay Qnap, 4 x 4TB Seagate Irowolf, RAID 5.
Rock solid for 2 years.

No way I'd do all that on my daily driver PC.

And of course, there is also a full weekly backup of the entire array. Just because...
 
If it helps, here is the error report from the most recent system startup.
Configuration:
5 Original discs in ports 1,2,3,4,6
Disc 5 (Original disc was a spare that was not integrated) freshly replaced and fully integrated.


# Source Severity Time Description
1 AMD Chipset SATA Controller - Controller 1 Warning 2019/01/01 17:24:26 Media patrol on disk (Port Number 4,Target ID 1) aborted at 16% because of error
2 AMD Chipset SATA Controller - Controller 1 Error 2019/01/01 17:24:26 Rebuild on logical drive "The_Clans" aborted at 3% because of error
3 AMD Chipset SATA Controller - Controller 1 Warning 2019/01/01 17:24:26 Disk (Port Number 4,Target ID 1) unplugged
4 AMD Chipset SATA Controller - Controller 1 Error 2019/01/01 17:24:26 Logical drive "The_Clans" goes offline
5 AMD Chipset SATA Controller - Controller 1 Error 2019/01/01 17:24:26 Disk (Port Number 4,Target ID 1) Setdown
6 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 BSL update on disk (Port Number 5,Target ID 1) at LBA 0x010eba501
7 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 BSL update on disk (Port Number 4,Target ID 1) at LBA 0x010eba501
8 AMD Chipset SATA Controller - Controller 1 Warning 2019/01/01 17:24:26 Task 20 disk error on disk (Port Number 4,Target ID 1) at LBA 0x010eba501 (Length 0x7f) with status 20; Error register: 0
9 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 BSL update on disk (Port Number 4,Target ID 1) at LBA 0x012c018
10 AMD Chipset SATA Controller - Controller 1 Warning 2019/01/01 17:24:26 Task 20 disk error on disk (Port Number 4,Target ID 1) at LBA 0x012c018 (Length 0x8) with status 20; Error register: 0
11 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 Media patrol on disk (Port Number 6,Target ID 1) resumed
12 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 Media patrol on disk (Port Number 5,Target ID 1) resumed
13 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 Media patrol on disk (Port Number 4,Target ID 1) resumed
14 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 Media patrol on disk (Port Number 3,Target ID 1) resumed
15 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 Media patrol on disk (Port Number 2,Target ID 1) resumed
16 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 Media patrol on disk (Port Number 1,Target ID 1) resumed
17 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 Rebuild on logical drive "The_Clans" resumed
18 AMD Chipset SATA Controller - Controller 1 Warning 2019/01/01 17:24:26 Logical drive "The_Clans" goes critical


I'm confused by the line:
6 AMD Chipset SATA Controller - Controller 1 Information 2019/01/01 17:24:26 BSL update on disk (Port Number 5,Target ID 1) at LBA 0x010eba501