News 77TB of Research Data Lost Because of HPE Software Up

Admin · Dec 31, 2021

HPE admits a software bug that led to an accidental wipeout of 77TB of data.

77TB of Research Data Lost Because of HPE Software Up : Read more

USAFRet · Dec 31, 2021

What part of "offline backup" was unclear?

Alvar "Miles" Udell · Dec 31, 2021

USAFRet said:
What part of "offline backup" was unclear?

Saddest thing is it sounds like they haven't learned that lesson yet.

HP Supercomputer System Caused 77TB Data Loss At Japan's Kyoto Uni (gizchina.com)

Since it became impossible to restore the files in the area where the backup was executed after the files disappeared, in the future, we will implement not only the backup by mirroring but also an enhancement such as leaving the incremental backup for some time. We will work to improve not only the functionality but also the operation management to prevent a recurrence.

InvalidError · Dec 31, 2021

USAFRet said:
What part of "offline backup" was unclear?

Offline backups won't save you when it is your broken backup script that is deleting files instead of actually backing them up.

-Fran- · Dec 31, 2021

Sometimes people at the office whine a lot about following "due process" when moving things into Live/Production environments; specially new people (think grads) and "cowboys" that come from small companies. This is the reason why there's people second guessing your work (in a good way) and asking questions about what you're doing and if you're 150% sure you understand what it is you're doing. As sad as it is, this is a good reminder that you always have to question anyone, even vendors, when they say "I have to do something in your system".

To all you people part of SysOps and Development that hate filling forms and going to review meetings, this is why due process exists within Companies; specially big ones.

Regards.

USAFRet · Dec 31, 2021

InvalidError said:
Offline backups won't save you when it is your broken backup script that is deleting files instead of actually backing them up.

True.
Obviously, multiple layers of brokenness.

It just weirds me out...every day, we are admonished to back up your data, make good passwords, good browsing habits...
And then, the major companies you entrust your data and info to....screw it up.

derekullo · Dec 31, 2021

Even if you delete data your snapshots should still have the data in them.

I can only assume they weren't keeping snapshots.

hotaru251 · Dec 31, 2021

This is why updates shouldnt be automatic.

dalauder · Dec 31, 2021

derekullo said:
Even if you delete data your snapshots should still have the data in them.

I can only assume they weren't keeping snapshots.

It only says "days of work are gone" from December 14th to 16th. So it sounds like they didn't actually lose that much, they just store WAY too much data. Seriously, what research system puts 30TB in permanent storage per day?

InvalidError · Dec 31, 2021

derekullo said:
Even if you delete data your snapshots should still have the data in them.

I can only assume they weren't keeping snapshots.

That only works when the backup script or software responsible for creating snapshots or whatever backup strategy they were using is actually doing its job as intended instead of destroying the files it was meant to preserve.

They got screwed over by a buggy backup script. Their data would likely have been fine if they hadn't attempted to back it up with the "updated" backup script that ended up destroying two days worth of data before they realized something went wrong.

Alvar "Miles" Udell · Dec 31, 2021

dalauder said:
It only says "days of work are gone" from December 14th to 16th. So it sounds like they didn't actually lose that much, they just store WAY too much data. Seriously, what research system puts 30TB in permanent storage per day?

According to the Gizchina article, only 4 groups were not recoverable. If I understand the article correctly, the 77TB of files include all 14 groups, so the actual loss may be a small fraction of that.

RANGE OF INFLUENCE OF FILE LOSS

Click to expand...

Target file system: / LARGE0

File deletion period: December 14, 2021 17:32-December 16, 2021 12:43

Disappearance target file: December 3, 2021, 17:32 or later, Files that were not updated

Lost file capacity: Approximately 77TB

Number of lost files: Approximately 34 million files

Number of affected groups: 14 groups (of which 4 groups cannot be restored by backup)

TheOtherOne · Dec 31, 2021

What about multiple mirrors of your files when it's CRITICAL RESEARCH data? ¯\(ツ)/¯
Do they really not have resources when you see a random crappy file hosting website has multiple mirrors to store files and here we are talking about a Uni doing important research. They should've all those files stored at multiple locations/servers as they get uploaded. And of course the usual double backup of main server on whatever schedule they have set it.

InvalidError · Dec 31, 2021

TheOtherOne said:
What about multiple mirrors of your files when it's CRITICAL RESEARCH data? ¯\(ツ)/¯

If it is your backup management script that is destroying source files, having 100 backups wouldn't help since the files are being deleted as you are attempting to back them up.

TheOtherOne · Dec 31, 2021

InvalidError said:
If it is your backup management script that is destroying source files, having 100 backups wouldn't help since the files are being deleted as you are attempting to back them up.

I am not referring to the usual backup methods here. Multiple mirrors as if when they upload the file, it gets uploaded simultaneously to multiple servers. Backup software won't have access to the paths of those files anyway since it would backup the data on main server. And again, double or even triple backup would prevent that too unless backup software (for some n00b reasons) is given wrong permissions to do whatever it wants willy nilly on multiple machines.

InvalidError · Jan 1, 2022

TheOtherOne said:
I am not referring to the usual backup methods here. Multiple mirrors as if when they upload the file, it gets uploaded simultaneously to multiple servers. Backup software won't have access to the paths of those files anyway since it would backup the data on main server.

If HP could screw up a backup script, they could just as easily screw up a mirroring script.

Nothing is ever safe from screw-ups, especially new data that hasn't been backed up anywhere else yet.

And if you are backing up ~40 TB daily, chances are your entire data set isn't something you want to store mirrors of on top of backups and whatever additional redundancy may be built into the system itself.

dalauder · Jan 1, 2022

TheOtherOne said:
What about multiple mirrors of your files when it's CRITICAL RESEARCH data? ¯\(ツ)/¯
Do they really not have resources when you see a random crappy file hosting website has multiple mirrors to store files and here we are talking about a Uni doing important research. They should've all those files stored at multiple locations/servers as they get uploaded. And of course the usual double backup of main server on whatever schedule they have set it.

It doesn't really sound like it's "CRITICAL RESEARCH data". From what I can figure, they've probably already caught up on all the data they lost.

dalauder · Jan 1, 2022

TheOtherOne said:
I am not referring to the usual backup methods here. Multiple mirrors as if when they upload the file, it gets uploaded simultaneously to multiple servers. Backup software won't have access to the paths of those files anyway since it would backup the data on main server. And again, double or even triple backup would prevent that too unless backup software (for some n00b reasons) is given wrong permissions to do whatever it wants willy nilly on multiple machines.

That's how you backup a major submission milestone maybe, or especially finalized research data. But this is just daily intermittent stuff that will presumably get revised anyways in the next semester. A backup script is what you do with that stuff.

peachpuff · Jan 1, 2022

USAFRet said:
What part of "offline backup" was unclear?

But that costs extra...

ThatMouse · Jan 1, 2022

Billion dollar super computer and they can't afford a real backup? The excuses are astonishing and make zero sense. I don't know why HPE is taking the blame. The systems administrator is not someone you can just outsource. HPE doesn't know what files need backing up. Did the researchers just send HPE an email and say, all our important files are back-up? Yes. That's good enough for me!

digitalgriffin · Jan 1, 2022

InvalidError said:
Offline backups won't save you when it is your broken backup script that is deleting files instead of actually backing them up.

That's why you rotate tapes/backups. It is effective then.

The article is unclear however if it's all their data, or just the new data that was created in those couple data.

InvalidError · Jan 1, 2022

digitalgriffin said:
That's why you rotate tapes/backups. It is effective then.

The article is unclear however if it's all their data, or just the new data that was created in those couple data.

Since it is two specific days of data that were lost, I'd guess it was the NEW DATA that was being destroyed by the borked script. So that data wouldn't exist in any other backups since it didn't exist yet when older backups were made and no longer existed for subsequent backups to pick up.

scottsoapbox · Jan 1, 2022

Data isn't safe unless it's in at least 3 different physical locations using at least 2 different automated backup processes.

digitalgriffin · Jan 1, 2022

InvalidError said:
If it is your backup management script that is destroying source files, having 100 backups wouldn't help since the files are being deleted as you are attempting to back them up.

Thats what a differential rolling tape backup is for. When the backup starts, you take an image of all the file time date stamps and lock those files, until they are backed up (compared against previous backup). If a file mod is requested the old file gets copied to special storage till it is archived on tape. With a rolling tape differential backup you are set this way.

InvalidError · Jan 1, 2022

scottsoapbox said:
Data isn't safe unless it's in at least 3 different physical locations using at least 2 different automated backup processes.

Data isn't safe when it is your backup system that is destroying data before it can be backed up.

InvalidError · Jan 1, 2022

digitalgriffin said:
Thats what a differential rolling tape backup is for. When the backup starts, you take an image of all the file time date stamps and lock those files, until they are backed up (compared against previous backup). If a file mod is requested the old file gets copied to special storage till it is archived on tape. With a rolling tape differential backup you are set this way.

If it is the backup script that is screwing up files, your new files are still screwed.

News 77TB of Research Data Lost Because of HPE Software Up

Administrator

Titan

Dignified

Titan

Glorious

Titan

Splendid

Splendid

Splendid

Titan

Dignified

Distinguished

Titan

Distinguished

Titan

Splendid

Splendid

Reputable

Distinguished

Splendid

Titan

Reputable

Splendid

Titan

Titan

Share this page