Question Erratic external hard drive behavior

Apr 1, 2020
4
0
10
So, at the beginning of February, our 20TB G-RAID Thunderbolt 2 external hard drive failed. We use this hard drive to store backups for 9 different machines (it's connected via thunderbolt cable to one of them, and the rest are backed up over the network via shared folder). We had not configured it to use RAID 1, and we didn't need to recover any of the lost backups that were stored, so we chose to take advantage of our warranty and swap it out.

They sent us a new one, and we chose to configure it with RAID 1. I did this after formatting to NTFS, etc. etc. for Windows use. All seemed well at this point. Our backups run over the weekends, alternating between two groups of 4-5 computers. I try to stagger them a little more, by starting 2 on Friday nights, and the other 2-3 on Saturday mornings. The first weekend that this new external hard drive was in use, the power went out. Both the hard drive and the computer were plugged into the UPS, so they didn't lose power, but the backup jobs did fail with errors no more detailed than "An error has occurred and the backup did not complete." All of the computers being backed up were also plugged into UPS. It's worth noting as well that Jenkins never lost connection with any of these machines during that time, and there were no other indications of network loss.

I ran chkdsk on the drive, and it reported a plethora of bad sectors. I ran it again with the /r and /f flags, but this failed with read errors saying "The disk does not have enough space to replace bad clusters". We decided to wipe and reformat it to NTFS again, to start fresh. chkdsk reported no errors, and our backups ran fine for about 3 weeks.

The power went out again, late on a Thursday night before the backups were triggered to run. The backups failed with the error "The request could not be performed because of an I/O device error". Checking event logs, Windows Logs > System was filled with Event 140 errors from "Ntfs (Microsoft-Windows-Ntfs)" stating "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: G:, \Device\HardDiskVolume7. (The I/O device reported an I/O error)." Chkdsk reported the file system to be RAW, and could not scan the hard drive. HDDScan detected the drive, but not the size and could not run any SMART scans or read/write tests. Interestingly, Disk Management showed the drive as active, healthy, and NTFS, even after rescanning and refreshing.

First thing I always do when troubleshooting is reboot, so I rebooted the computer that the external hard drive was connected to, and surprisingly the disk went back to normal. chkdsk detected the NTFS file system, completed the scan and reported no bad sectors or other errors. HDDScan recognized the drive, ran full read/write tests with no worrisome issues, and that SMART scan was fine except for "UltraDMA CRC Errors" with the Value/Worst of "200". Backups worked fine, yet again.

Two weeks later now... there have been no more instances of power outages of any kind. Backups have been going fine for two weeks, but then failed again over this weekend with the same "I/O device" errors from above. Chkdsk reported a RAW file system, HDDScan couldn't scan the drive, etc. Once again, a reboot has restored functionality to the external hard drive as before, and all disk checks have been clean of errors.

What in the world is going on here? I have not seen behavior like this from any external hard drive before. I'm not sure if this unit has just been defective from the beginning, and so much time has passed since this first started that it's getting harder to determine the cause. Research keeps pointing my attention to the instances of power loss and the possibility of having a faulty cable, but our office is shutdown and inaccessible right now due to the quarantine restrictions, so I can't test swapping out a new one. Has anyone seen symptoms similar to this?
 
The issue may not be with the external drives or devices but the PC it is plugged into. Have another PC you can plug it into to test that? Since a reboot of the PC seems to fix everything. Have you rebooted just the external drives and see i issue is corrected as well or just the PC?

Not sure if it works with thunderbolt drives, but look into HD Sentenial for checking of SMART status. I put it on all my clients server and have it email me if anything changes
 
The issue may not be with the external drives or devices but the PC it is plugged into. Have another PC you can plug it into to test that? Since a reboot of the PC seems to fix everything. Have you rebooted just the external drives and see i issue is corrected as well or just the PC?

Not sure if it works with thunderbolt drives, but look into HD Sentenial for checking of SMART status. I put it on all my clients server and have it email me if anything changes

I did not think the fault would lie with the computer it is plugged into. We have an identical, 20tb G-RAID Thunderbolt 2 external hard drive also plugged into the same computer, and we have never had a single error with that drive (after over 2 1/2 years of use). For this reason, I did not think it was computer-specific but rather drive- or cable-specific.

I have not "rebooted" the drive itself. As we cannot access our office to reconnect it if something goes wrong, we have avoided all forms of ejecting or dismounting.

I'll look into HD Sentenial.