Question Penguin server won't boot after swapping failed hard drive in RAID 6

Feb 9, 2023
4
0
10
So I have a Penguin Relion 1900 1U rack mounted server running CentOS 7 that recently had a hard disk failure. It had 4 hard drives configured in RAID 6. To replace the failed HDD, I powered off the machine, swapped the failed one with a new drive of the same size (4TB) and powered on the machine again.
The system booted into the EFI shell instead of loading the OS. I exited from the shell and went to BIOS and noticed the status of the RAID array was rebuilding under the RAID menu in advanced setting of the BIOS.

6-7 hours later, I noticed the rebuild had completed and that the RAID status was in 'Optimal' condition with everything looking good. I saved and quit out of the BIOS, but it went into the EFI shell instead of loading the OS again.

I rebooted the machine and under the BIOS boot sequence priority list could not see the virtual drive provided by the RAID. But instead there was a "SCSI Hard Drive, ..." option.

My booting is set to UEFI mode, so I switched to Legacy under the CSM configuration settings in BIOS and when I reboot, it goes into network booting and the "SCSI Hard Drive ..." disappears from the boot order sequence list.

When in Legacy boot, I do get the option to load into the RAID BIOS, which I did to take a look. All status shows optimal and I can see the Virtual Drive created and present in the RAID BIOS. It's just not being picked up by the system BIOS and loading the OS.

I have attached pictures of my BIOS settings.

RAID setting in the System BIOS (only appears when UEFI booting, also notice boot device set to [None])

map -r output on UEFI shell

Boot order in BIOS

CSM config

Any help would be much appreciated. Already spent a whole day try to fix it!
 
Have you tried booting a USB copy of CenOS ? Are the files actually intact on the RAID?

Why did you build RAID6 with 4 drives ? Seems like a waste.
Just tried booting off a CentOS USB today and went into troubleshoot mode. The auto-repair couldn't detect a Linux partition and went into the troubleshooting shell. Though I could see that the RAID volume exists and can be seen. Just that it is not mounted nor is booting from it. When I try mounting using mount /dev/sda3 /mnt, I get wrong fs type, bad option, blah blah error (pic attached)

Using lsblk also shows the 7.3TB RAID drives (pic attached)

fdisk -l
lsblk

As for why its in RAID 6, the previous guy handling the server messed up and configured it RAID 6 instead of RAID 5. It's been that way ever since.

Appreciate the help!
 
Just tried booting off a CentOS USB today and went into troubleshoot mode. The auto-repair couldn't detect a Linux partition and went into the troubleshooting shell. Though I could see that the RAID volume exists and can be seen. Just that it is not mounted nor is booting from it. When I try mounting using mount /dev/sda3 /mnt, I get wrong fs type, bad option, blah blah error (pic attached)

Using lsblk also shows the 7.3TB RAID drives (pic attached)

fdisk -l
lsblk

As for why its in RAID 6, the previous guy handling the server messed up and configured it RAID 6 instead of RAID 5. It's been that way ever since.

Appreciate the help!
If your data is intact, I would get an 8+ TB external drive and backup the contents of the array. Then wipe it and start over.
 
How much is your time worth? How much does down time cost? That has to be balanced also...
So keeping the server up isn't critical. We're just a research lab and the server we have hosts a bunch of PhD students data. Considering that the backup we have is a month old. We'd prefer trying to get our current volume fixed. How long it takes, doesn't matter too much. As far as I can say, all the data is intact on the volume, it's just that the OS cannot see it or boot into it. That's what I'm trying to figure out.
 
So keeping the server up isn't critical. We're just a research lab and the server we have hosts a bunch of PhD students data. Considering that the backup we have is a month old. We'd prefer trying to get our current volume fixed. How long it takes, doesn't matter too much. As far as I can say, all the data is intact on the volume, it's just that the OS cannot see it or boot into it. That's what I'm trying to figure out.
If you can't see the data via the USB booted OS, then maybe the data isn't really there.