News 77TB of Research Data Lost Because of HPE Software Up

Apr 1, 2020
1,437
1,089
7,060
What part of "offline backup" was unclear?

Saddest thing is it sounds like they haven't learned that lesson yet.

HP Supercomputer System Caused 77TB Data Loss At Japan's Kyoto Uni (gizchina.com)

Since it became impossible to restore the files in the area where the backup was executed after the files disappeared, in the future, we will implement not only the backup by mirroring but also an enhancement such as leaving the incremental backup for some time. We will work to improve not only the functionality but also the operation management to prevent a recurrence.
 
Sometimes people at the office whine a lot about following "due process" when moving things into Live/Production environments; specially new people (think grads) and "cowboys" that come from small companies. This is the reason why there's people second guessing your work (in a good way) and asking questions about what you're doing and if you're 150% sure you understand what it is you're doing. As sad as it is, this is a good reminder that you always have to question anyone, even vendors, when they say "I have to do something in your system".

To all you people part of SysOps and Development that hate filling forms and going to review meetings, this is why due process exists within Companies; specially big ones.

Regards.
 
  • Like
Reactions: RodroX

USAFRet

Titan
Moderator
Offline backups won't save you when it is your broken backup script that is deleting files instead of actually backing them up.
True.
Obviously, multiple layers of brokenness.

It just weirds me out...every day, we are admonished to back up your data, make good passwords, good browsing habits...
And then, the major companies you entrust your data and info to....screw it up.
 
Even if you delete data your snapshots should still have the data in them.

I can only assume they weren't keeping snapshots.
It only says "days of work are gone" from December 14th to 16th. So it sounds like they didn't actually lose that much, they just store WAY too much data. Seriously, what research system puts 30TB in permanent storage per day?
 

InvalidError

Titan
Moderator
Even if you delete data your snapshots should still have the data in them.

I can only assume they weren't keeping snapshots.
That only works when the backup script or software responsible for creating snapshots or whatever backup strategy they were using is actually doing its job as intended instead of destroying the files it was meant to preserve.

They got screwed over by a buggy backup script. Their data would likely have been fine if they hadn't attempted to back it up with the "updated" backup script that ended up destroying two days worth of data before they realized something went wrong.
 
  • Like
Reactions: dalauder
Apr 1, 2020
1,437
1,089
7,060
It only says "days of work are gone" from December 14th to 16th. So it sounds like they didn't actually lose that much, they just store WAY too much data. Seriously, what research system puts 30TB in permanent storage per day?

According to the Gizchina article, only 4 groups were not recoverable. If I understand the article correctly, the 77TB of files include all 14 groups, so the actual loss may be a small fraction of that.

RANGE OF INFLUENCE OF FILE LOSS
  • Target file system: / LARGE0
  • File deletion period: December 14, 2021 17:32-December 16, 2021 12:43
  • Disappearance target file: December 3, 2021, 17:32 or later, Files that were not updated
  • Lost file capacity: Approximately 77TB
  • Number of lost files: Approximately 34 million files
  • Number of affected groups: 14 groups (of which 4 groups cannot be restored by backup)
 

TheOtherOne

Distinguished
Oct 19, 2013
220
74
18,670
What about multiple mirrors of your files when it's CRITICAL RESEARCH data? ¯\(ツ)
Do they really not have resources when you see a random crappy file hosting website has multiple mirrors to store files and here we are talking about a Uni doing important research. They should've all those files stored at multiple locations/servers as they get uploaded. And of course the usual double backup of main server on whatever schedule they have set it.
 
  • Like
Reactions: drtweak

TheOtherOne

Distinguished
Oct 19, 2013
220
74
18,670
If it is your backup management script that is destroying source files, having 100 backups wouldn't help since the files are being deleted as you are attempting to back them up.
I am not referring to the usual backup methods here. Multiple mirrors as if when they upload the file, it gets uploaded simultaneously to multiple servers. Backup software won't have access to the paths of those files anyway since it would backup the data on main server. And again, double or even triple backup would prevent that too unless backup software (for some n00b reasons) is given wrong permissions to do whatever it wants willy nilly on multiple machines.
 

InvalidError

Titan
Moderator
I am not referring to the usual backup methods here. Multiple mirrors as if when they upload the file, it gets uploaded simultaneously to multiple servers. Backup software won't have access to the paths of those files anyway since it would backup the data on main server.
If HP could screw up a backup script, they could just as easily screw up a mirroring script.

Nothing is ever safe from screw-ups, especially new data that hasn't been backed up anywhere else yet.

And if you are backing up ~40 TB daily, chances are your entire data set isn't something you want to store mirrors of on top of backups and whatever additional redundancy may be built into the system itself.
 
Last edited:
What about multiple mirrors of your files when it's CRITICAL RESEARCH data? ¯\(ツ)
Do they really not have resources when you see a random crappy file hosting website has multiple mirrors to store files and here we are talking about a Uni doing important research. They should've all those files stored at multiple locations/servers as they get uploaded. And of course the usual double backup of main server on whatever schedule they have set it.
It doesn't really sound like it's "CRITICAL RESEARCH data". From what I can figure, they've probably already caught up on all the data they lost.
 
I am not referring to the usual backup methods here. Multiple mirrors as if when they upload the file, it gets uploaded simultaneously to multiple servers. Backup software won't have access to the paths of those files anyway since it would backup the data on main server. And again, double or even triple backup would prevent that too unless backup software (for some n00b reasons) is given wrong permissions to do whatever it wants willy nilly on multiple machines.
That's how you backup a major submission milestone maybe, or especially finalized research data. But this is just daily intermittent stuff that will presumably get revised anyways in the next semester. A backup script is what you do with that stuff.
 

ThatMouse

Distinguished
Jan 27, 2014
224
95
18,660
Billion dollar super computer and they can't afford a real backup? The excuses are astonishing and make zero sense. I don't know why HPE is taking the blame. The systems administrator is not someone you can just outsource. HPE doesn't know what files need backing up. Did the researchers just send HPE an email and say, all our important files are back-up? Yes. That's good enough for me!
 

InvalidError

Titan
Moderator
That's why you rotate tapes/backups. It is effective then.

The article is unclear however if it's all their data, or just the new data that was created in those couple data.
Since it is two specific days of data that were lost, I'd guess it was the NEW DATA that was being destroyed by the borked script. So that data wouldn't exist in any other backups since it didn't exist yet when older backups were made and no longer existed for subsequent backups to pick up.
 
  • Like
Reactions: digitalgriffin
If it is your backup management script that is destroying source files, having 100 backups wouldn't help since the files are being deleted as you are attempting to back them up.

Thats what a differential rolling tape backup is for. When the backup starts, you take an image of all the file time date stamps and lock those files, until they are backed up (compared against previous backup). If a file mod is requested the old file gets copied to special storage till it is archived on tape. With a rolling tape differential backup you are set this way.
 

InvalidError

Titan
Moderator
Thats what a differential rolling tape backup is for. When the backup starts, you take an image of all the file time date stamps and lock those files, until they are backed up (compared against previous backup). If a file mod is requested the old file gets copied to special storage till it is archived on tape. With a rolling tape differential backup you are set this way.
If it is the backup script that is screwing up files, your new files are still screwed.