[SOLVED] Tried to Replace SSD in Degraided Raid 1, now both drives "Offline Members"

Status
Not open for further replies.

micahfaulkner75

Distinguished
Mar 21, 2014
12
0
18,510
Thank you in advance for your help. I confess, I know only enough about computers to be dangerous!

A friend of mine set up a SuperMicro server for me years ago. I use it for media and file storage. The setup is a Centos 7 OS installed across a 2-SSD drive Raid 1 array, and then I have 4 platter drives in the front slots as a Raid 5 with all my content.

The other day, I saw that one of the OS drives had failed. So I ordered the same SDD drive again, removed the dead drive, inserted the new one, and rebooted.

After that, I got this screen. Seemed good to go, so I clicked "exit" and left it to do its thing.

5fsYgjH.jpg


When I returned, I found this ugly list of errors:

EiEjhHN.jpg


Eventually this froze for a long time. Then I rebooted it and got this screen that lists both drives as "Offline Member."

enMFXLz.jpg


At this point, I am mildly panicked. I have not done anything since then. I have only tried to find advice online and am still stumped.

I assume that my next step is to Reset the new drive to Non-RAID and try again...maybe by remounting the original one in some way?

I suspect I am near a point where I might do terrible damage if I do something wrong. The server was chosen as a safe solution with Raid 5 so that we'd never lose our photos, etc. So, worrisome.

Here is a picture of the Boot screen that I usually see. Hopefully this will help someone point me in the right direction for troubleshooting.

SazZ1lw.jpg


Thanks again for any help solving this.

Micah
 
Solution
My current system has 6x drives.
Each with specific data (mostly.
CAD, photo, video, games, etc.
Each individually backed up nightly to a folder tree on my NAS. But that could just as easily be to an external drive or two.

I can recover any drive, or the entire system, to any state it was in the last 30 days.
This covers all the forms of software loss, as well as physical drive death.

Currently, with your RAID 5 and its data...if you accidentally delete something, it is gone.
There is no other copy to get back.

A good backup routine will do a Full backup of all the data, then Incremental or Differential thereafter.
For instance, if you add a new movie, that is ALL that happens in the subsequent Incremental. Not the entire 7TB again...
This is when the following procedure is recommended.

Instantiate the RAID 1, with 2 blank drives
Recover from the full backup you made before this happened
Attach the RAID 5 array.
Hopefully it recognizes.
If not, recreate the RAID 5 and recover the data from the backup of that.


But....
I'm assuming that backup stuff does not exist?
 
  • Like
Reactions: micahfaulkner75
Yeah, beyond what USAFRet suggests, if the data is important, I'd look for someone local who can physically access your PC. This is one of the hardest things to walk people through remotely.

Whether or not this turns out well, please do a serious re-evaluation of your home setup. The issue isn't wasn't you, but your friend that knows "just enough to be dangerous." RAID makes zero sense in the use case you've provided, nor was anything about this RAID setup well considered.
 
This is when the following procedure is recommended.

Instantiate the RAID 1, with 2 blank drives
Recover from the full backup you made before this happened
Attach the RAID 5 array.
Hopefully it recognizes.
If not, recreate the RAID 5 and recover the data from the backup of that.


But....
I'm assuming that backup stuff does not exist?


I actually might have a backup. I have the two original SSD drives still from way back (not sure what state they're in). If either or both of them are readable, would they still be able to attach to the RAID 5? Would the OS updates between those eras prohibit that, or no?
 
Yeah, beyond what USAFRet suggests, if the data is important, I'd look for someone local who can physically access your PC. This is one of the hardest things to walk people through remotely.

Whether or not this turns out well, please do a serious re-evaluation of your home setup. The issue isn't wasn't you, but your friend that knows "just enough to be dangerous." RAID makes zero sense in the use case you've provided, nor was anything about this RAID setup well considered.

I'm not sure why RAID makes zero sense. Isn't RAID a safer way to store my data than on a single drive? Or did I miss your point.
 
I actually might have a backup. I have the two original SSD drives still from way back (not sure what state they're in). If either or both of them are readable, would they still be able to attach to the RAID 5? Would the OS updates between those eras prohibit that, or no?
Completely unknown.
There are 2 parts to 'backups'.
One is actually doing it.
The other is how to recover.
 
I'm not sure why RAID makes zero sense. Isn't RAID a safer way to store my data than on a single drive? Or did I miss your point.
A RAID 1 or 5 is good for continued uninterrupted uptime, in the face of a physically dead drive.
It does nothing for all the other forms of data loss.
Virus, ransomware, accidental deletion or formatting, corruption.

It would be good for a webserver, when unscheduled downtime = lost sales.
Any company that runs RAID 1 on their system also has (or should have) a comprehensive backup routine.

The RAID by itself is almost worse than nothing, because it gives a false sense of security.

If you can afford an hour downtime while you recover data from a real backup, the RAID 1 or 5 is not needed, and just gets in the way.
 
A RAID 1 or 5 is good for continued uninterrupted uptime, in the face of a physically dead drive.
It does nothing for all the other forms of data loss.
Virus, ransomware, accidental deletion or formatting, corruption.

It would be good for a webserver, when unscheduled downtime = lost sales.
Any company that runs RAID 1 on their system also has (or should have) a comprehensive backup routine.

The RAID by itself is almost worse than nothing, because it gives a false sense of security.

If you can afford an hour downtime while you recover data from a real backup, the RAID 1 or 5 is not needed, and just gets in the way.


So, it's better to just have the OS on a single drive and then also cloned onto a backup drive for if things go south?

The RAID 5 was meant to secure the data (photos, movies, etc). It's like 7 TB of content that we use regularly. I don't understand how that would be backed up, or am I misapprehending?

All that aside, should I try to plug in either or both of the old drives into those slots?

Do you have any idea why the REBUILD function suddenly bonked the working half of the RAID 1 into that OFFLINE MEMBER state?

If something is an OFFLINE MEMBER, does it mean that the drive is dead? Or just unmounted? Something else?

Thanks again.
 
My current system has 6x drives.
Each with specific data (mostly.
CAD, photo, video, games, etc.
Each individually backed up nightly to a folder tree on my NAS. But that could just as easily be to an external drive or two.

I can recover any drive, or the entire system, to any state it was in the last 30 days.
This covers all the forms of software loss, as well as physical drive death.

Currently, with your RAID 5 and its data...if you accidentally delete something, it is gone.
There is no other copy to get back.

A good backup routine will do a Full backup of all the data, then Incremental or Differential thereafter.
For instance, if you add a new movie, that is ALL that happens in the subsequent Incremental. Not the entire 7TB again.

And I have had to use that, after the death of a 960GB SanDisk SSD.
It died suddenly. No idea why, and mostly didn't care. It was dead dead dead.

Slot in a new drive, click click....wait about an hour.
All 605GB data recovered exactly as it was at 4AM that morning, when that drive ran its nightly backup.

My procedure is the first post here.
Somewhat modified since I wrote it, but thats the basics:
 
  • Like
Reactions: micahfaulkner75
Solution
So....

Ignore the RAID 1 for the OS drive(s).
On one of those, or some other drive, install the CentOS.
Then, connect the drives in the RAID 5.
See if the newly installed OS can recognize that RAID 5 array.


What's the likelihood that a fresh install of CentOS will recognize that RAID 5 array?

Should this work? Or am I in a bad place?
 
More likely than not(?).

But absolutely no guarantees.
We don't know how your friend set up originally.


He didn't seem to be doing anything weird when he was installing. Seemed to be doing a pretty default install.

I assume having the exact same release of Centos 7 will matter, right? I wonder if there's a way I can figure that out...

I assume I can do that...

Let's assume I can ID the exact release and he didn't do anything fishy in the RAID setup, the RAID 5 should be recognizable?

What happens if people a thousand years from now find a RAID 5 array in a bunker somewhere, but in separate platter drives sitting in a cardboard box? How are they able to reassemble the data? How do people like the FBI, etc, get that stuff back out when it's spread over multiple drives?
 
assume, maybe, should be.....
All unknowns, until you see the actual data.

How might the FBI/CIA/GCHQ/FSB do it?
There are tools to probably reassemble a RAID array.
ReclaiMe, for one.
http://www.freeraidrecovery.com/?s=rd

But the tool is only one part of it. The skills to use it are the other.
And also, it is the presumed data on it.
The FBI does not care about your movie collection, and will spend exactly 0 seconds on recovering it.
Now...if it was known that you had the plans for a working cold fusion reactor....

But what they would do it make a forensic copy of all drives. Probably a couple.
Work on those, rather than mess with the only copy of the originals.


And a thousand years from now?
The drives won't work anyway, so no worries there.
 
assume, maybe, should be.....
All unknowns, until you see the actual data.

How might the FBI/CIA/GCHQ/FSB do it?
There are tools to probably reassemble a RAID array.
ReclaiMe, for one.
http://www.freeraidrecovery.com/?s=rd

But the tool is only one part of it. The skills to use it are the other.
And also, it is the presumed data on it.
The FBI does not care about your movie collection, and will spend exactly 0 seconds on recovering it.
Now...if it was known that you had the plans for a working cold fusion reactor....

But what they would do it make a forensic copy of all drives. Probably a couple.
Work on those, rather than mess with the only copy of the originals.


And a thousand years from now?
The drives won't work anyway, so no worries there.

Is it smart for me to pull my platter drives from the front of the machine while I'm messing with this OS problem? Or will that matter? Or will it cause problems?
 
assume, maybe, should be.....
All unknowns, until you see the actual data.

How might the FBI/CIA/GCHQ/FSB do it?
There are tools to probably reassemble a RAID array.
ReclaiMe, for one.
http://www.freeraidrecovery.com/?s=rd

But the tool is only one part of it. The skills to use it are the other.
And also, it is the presumed data on it.
The FBI does not care about your movie collection, and will spend exactly 0 seconds on recovering it.
Now...if it was known that you had the plans for a working cold fusion reactor....

But what they would do it make a forensic copy of all drives. Probably a couple.
Work on those, rather than mess with the only copy of the originals.


And a thousand years from now?
The drives won't work anyway, so no worries there.


Also: I still have my terminal open from when I first started seeing a problem and looking at it, I see this bit of text:

smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.10.1.el7.x86_64] (local build)

Looking at the Centos wiki, it looks like this would be version 7.6-1810 (here's a link).

Am I doing this right?
 
Update:

So, I think I'm fixed up. Here's what I did and what went on.

First, I downloaded a live USB image for Centos 7.6-1810 from here (I downloaded a few. The Gnome version didn't work, but the KDE one did) and installed it to USB using Etcher.

It booted up just fine and I could see the RAID-5 storage just fine. I plugged in a 1TB external drive.

I used the cp command to copy the vital data (photos, etc) onto the drive (if I lost TV shows or whatever, I could always get those back).

After everything was safe-and-sound, I rebooted the server and got to the Option-ROM screen and reverted the NEW drive to Non-RAID.

On reboot, the whole machine fired up (still with all the errors). It worked for about 8 hours, then failed.

I think there was just so much erroring going on, that it committed suicide or something.

BUT...

This morning, I got to thinking about a pair of drives from this that were quite old. Could one of them still work?

I plugged it in, and it fired right up. I stuck the new drive next to it in the other slot and within an hour or so, the full RAID1 was back in order and happy (though, still in the state it was in 5 years ago or whatever, so needed updates and adjusted software preferences).

After it was fully cloned, I pulled the original out, wrote "MASTER" on it, and put it back in the little drawer by my server. Then I plugged another blank drive into its slot and let the system clone the RAID1 back over to IT.

So, what's the lesson here?

Lesson 1: Have a physical backup of your OS ready (and don't forget that you have one!).

Lesson 2: If your server is running software, pushing out media, etc, it is NOT a backup plan for your content. That content needs a backup plan too!

Thanks USAFret for all your help here. And thank you for your advice about safety.

I hope my ordeal can help someone else!
 
Status
Not open for further replies.