Question Should I worry about this drive? (12TB Western Digital Gold archive drive with a lot of corrupted files and folders, but all tests claim it's healthy)

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
Context
(Note for reading: I often switch the words "drive" and "disk", consider them equal in this post.)
Hardware
I have big archive drives, most are over at least 10 TB Western Digital Gold (some are a bit different).
Everything is in .RAR archive files (because they are testable and repairable).
Each volume has at least a "twin" that is, most of the time, stored in a different building.
Most of the time, drive are stored outside of any computers.
I swap them directly in (non-running) computer in something I name SATA slots, also named hot swap bays or different names... they're the same has a NAS "doors" but in the front of a computer case. (Something like this: https://media.startech.com/cms/products/main/hsb100satbk.main.jpg )
Synchronisation is made by hand, I don't use RAID as to my understanding, it would make drives depend on each other and they would have to stay both together in computer and this would forbid the swapping and the storage of one the drive in a chest in different building and bring other downsides. (To my understanding RAID redundancy is a system resilience solution, not a data archive/backup solution. And I do not want to depend on any archive/backup software.)

Software
(Windows 10, NTFS)

Things I usually do when testing the drives:
  • [optional] Check S.M.A.R.T. info with CrystalDiskInfo
  • [optional] Check S.M.A.R.T. info with Western Digital Dashboard
  • [optional] Run short test with Western Digital Dashboard.
  • [optional] Run long test with Western Digital Dashboard.
  • Test all the .RAR files
  • [optional] Repair broken .RAR files or copy pristine file from the "twin" drive.
  • If I have both copies at hand, test disk against twin using WinMerge.
I have been using this solution for year and other than having to repair a .RAR file here and there sometimes, I never had real issues ... (Only one 8 TB Western Digital Red drive started failing and was replaced).

What happened
Define, two disks: "disk being tested" and "his twin disk".
Both 12 TB Western Digital Gold, filled approximately: 7TB

  • Tested all .RAR file on a drive (It may be useful to mention: I test files using a home made C# archive testing UI that usually works well. Though, on purpose, it skips (silently I think) folders that cannot be accessed, otherwise it could have some problems with some system folder like "System Volume Information". So, reading further, you will understand that it may have missed some folders that it could not access. But what is certain is that it doesn't write or bring any modifications to the files themselves, it calls the official rar.exe or unrar.exe, it only writes text logs on another drive.)
  • All tested files were OK.
  • WinMerge detected plenty of differences, folder that were present on the twin drive and not on the drive I was testing.
  • I checked, and these folders existed on both drives, BUT on the drive that I was testing, they were saying something like "This folder cannot be access but is [something] or corrupted..." (not the precise message, I didn't make a screenshot)
  • I removed the twin copy.
  • I started running chkdsk /r on the drive being tested.
  • I saw that ETA was 306 hours or something and stopped chkdsk, I thought I should try short solutions first.
  • I checked S.M.A.R.T. info with Western Digital Dashboard, they were reported as "Excellent"
  • I ran short test with Western Digital Dashboard: no problem found.
  • I ran long test with Western Digital Dashboard: no problem found
  • I check S.M.A.R.T. info with CrystalDiskInfo too: Good
  • I re-inserted the twin copy.
  • Compared tested drive and twin with WinMerge again: Not perfect, but much better, much less red in list. And instead of "corrupted folders", it seem all folder can be accessed. But in a few places, WinMerge has detected that some of the .RAR archive are different from one disk to the other.
  • Checked those archives that were different (but shouldn't) (usually very large files), I found that, on the disk being tested, those file were zero bytes. They are fine on the "twin disk"
  • So I am now repairing the archives by copying a few files from "twin disk" to "disk being tested".
(There was no computer restart in the middle of an operation or anything like this.)

Edit: I just noticed right now that "drive being tested" has a "bootTel.dat" file at its root, created this morning (I need to go and can't investigate on what I it is right now).


What I still plan to do
- I will still run a test of all the .RAR files on the "disk being tested" and I expect/hope to find no further errors (this may take something like 24h so I won't be 100% sure before tomorrow).

Edit: I actually started this last test and it seem MANY .RAR files are broken. (Current result: 27 good, 11 bad, for a total of 425 files, I expect that these some files that were inside those folders that were initially "corrupted"... this is gigantic as I would usually expect somethings between 0 and 2, but more often 0, bad files on a whole drive). Despite all the files seeming OK according to WinMerge (fast compare) and drive being good according to Western Digital Dashboard.
So I think I'm going to re-test the other drive (the "twin" one), reformat tested drive and re-copy everything.



My questions
  • What do you think happened? (I would guess something messed up with the drive's file table, but I don't know much about file tables.)
  • What partly repaired it? (Maybe I shouldn't have interrupted chkdsk /r ? )
  • Should I worry about the drive, should I replace it? Should I re-format it? Or should I consider that this was an isolated software/system incident (since all test from Wester Digital Dashboard were OK.)?
 
Last edited:
What is the bootTel.dat file in Windows?

https://superuser.com/questions/1341021/what-is-the-boottel-dat-file-in-windows

I think that running CHKDSK in repair mode is not a good idea. Microsoft cares more about the consistency of your file system that it does about your data. This means that your "bad" files may have been scrubbed to preserve the consistency of your file system. In any case CHKDSK has now probably masked the original source of your problems.

My approach would have been to run CHKDSK in read-only mode. That would have identified those aspects of the NTFS metadata that were in error. You could then use a free disk editor such as DMDE to investigate further.
 
Last edited:

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
Thanks. I read that, I also read that as a user, not much can be done of this .dat file (haven't investigated further regarding this file), so that means that this morning, the computer has made tests when I restarted it. This doesn't me much information.
And the question remain, should I still trust this drive and consider this a software incident, or replace it...
 

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
Oh OK. I think after clicking you link I had forgotten to read the read of your first reply.
I don't know stull like DMDE, I should check what it does.

But if I understand right, now is too late to do that.
Questions remain, should I trust the drive? Should I reformat it? Should I replace it?...
 
Last edited:

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
Thanks.
Well I own that Western Digital Gold 12TB (WD121KRYZ-01W0RB0 *) drive since 2018 or 2019, so I guess that if it corrupted data I would already have noticed, and my other drive of the same model don't have this issue (maybe just on unit?). So I think/hope that I can exclude this option for this drive.

* From a quick Internet search, I don't see obvious reports of data corruption with this model.

I will try a memory test on that computer tomorrow...
Is GSmartControl better than CrystalDiskInfo? (From a first look it seems to have more stuff. Though for tests, I use to rely on the hard drive manufacturer software.)
 
The hard drive manufacturer's software is usually the least informative. Seagate's SeaTools, for example, won't tell you how many bad sectors a drive has, even if there are thousands of them. The drive will continue to receive a passing grade until the number of bad sectors exceeds Seagate's threshold. WD's DataLifeGuard isn't much better.
 

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
Oh OK. I thought, since they manufactured it, maybe they're better to detect errors and fix their specific drive... I had noticed though that they are less informative.

Then which software(s) would you recommand to perform tests on drive? GSmartControl

(Though in my case, I usually rely primarily on test .rar archive and WinMerge, assuming that if this data "layer" is alright on at least one drive of a twin pair, it's the most important to me as I can always transfer to other drives I do other drive tests less often.)
 
Every other SMART tool will report the raw data, including the number of bad sectors. The "short" and "extended" tests are included in the ATA standard -- they are not manufacturer specific.

HDDScan, Victoria for Windows are just two tools which will identify "slow sectors" during a surface scan. No manufacturer's tool will do this.
 
  • Like
Reactions: MaxT2

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
I ran Windows Memory Diagnostics on my 3 computers* that can potentially hold these archive drives, I let the "Standard" tests run, no error was found on any computer.

(* If 1 = main/recent computer, 2 = older computer, 3 oldest computer, the error was found on computer 2. The drive was likely used in computer 1 at some point, probably never in computer 3.
Not sure why but the test was much slower on computer 2 than on computer 3 ... maybe computer 3 has much less RAM but I must say I don't remember.)

Next steps (later this afternoon) :
1- I think I'll copy the pristine copies of the corrupted files from pristine "twin" drive to the problematic drive so I keep having 2 copies of everything right now.
2- Then I think I'll do more tests on the drive that had the problems , tryin these softwares: GSmartControl, HDDScan , Victoria...
 

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
  • Various tests with Victoria show nothing red (nothing actually worth than a few "light grey or so", second status from top).
  • HDDScan short SMART data all green.
  • A quick test with HDDScan showed no issues
  • Anoter test with HDDScan which I thought would be a shod one but end up taking like 12 hours (I just don't know all the differences in the test) show no error. It should some "worse" results than Victoria but yet nothing in red.
  • "Data, name, size" WinMerge show everything in OK.
- Now running a "Quick data" test with WinMerge, but it seems it'll take quite long (I thought that there used to a be a "test first and last 15 kb "of each file or something like this, but it seems that changed).

So, so far no signs that disk is bad anywhere and no hint on what happened.
 
Last edited:
I'm a little unclear as to how you are determining that certain RAR files are bad. Are you saying that some of the files within the RAR archive are bad or zeroed?

Can you show us the script you are using to test these RARs?

Have you examined your RARs with a hex editor, eg HxD (freeware)? You can use HxD to perform a byte-for-byte comparison of two RARs.
 

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
I cannot really show the "script", it's a complete Visual Studio C# project+ it's not a code I want to share for now, but this may be explained further:
Usual tests: My software calls rar.exe or unrar.exe test function in command line, received the command line output (the text that the user reads) and from there it can tell if a .rar file is broken (it can actually test other archive formats, and also *.flac). This usual use case works perfectly fine (otherwise I would end with 100% false negatives or something... and I have not noticed false positives or false negatives) (by the way if it reports a bad file, afterwards, I also open it in WinRar for confirmation).

Also, it lists files that are not testable formats (like a few *.txt files), but currently it silently skips folder that it cannot access.

The unusual scenario that happened:
  • Test .RAR files, no error.
  • WinMerge "data, size, path" compared: many folder existed on pristine copy and not on tested copy
  • I realised these were not accessible in Windows Explorer...
  • I started running chkdsk and different tests, then I created this thread.
  • I re-run the RAR archive tester, and realised that the folder could now be accessed, but those files were broken. (I assumed these are the files the were inside the previous unaccessible folders, I can only assume because I did not list them back then)
  • So what I named bad .RAR files in my recent posts seem to be very different from the pristine files or from what I would usually get when a .RAR file has a few corrupted data and usually can be repaired from recovery record... these bad .RAR files in my recent post cannot even be open by WinRar at all. I have not written down what WinRar said but it was like these files were just random data, though size were similar. Again, there are parts of these that I can only assume as I haven't made bit to bit comparisons of these files and I have now copied the pristine copys from the "twin" drive as I wanted to keep having two copies of each content.)
My guess so far is that something messed up the file system, the chkdsk or some other restored the file system but was unable to match the files to the data that they initially contained.
 
Last edited:

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
WinMerge "Quick Contents" test finished. It took something between 12 and 24 hours.
At the end it detect one of the files was different from a drive to the other. It was on one of the file that was previously broken, and I think I remember copying it from the pristine drive already, but maybe I forgot it.
I corrected the file and I'll run this WinMerge test again.

Here is also a result of a read-only chkdsk I just done. What I notice first is that is mentions "orphan" stuff:
WARNING! /F parameter not specified.
Running CHKDSK in read-only mode.

Stage 1: Examining basic file system structure ...
2816 file records processed.
File verification completed.
Phase duration (File record verification): 1.88 seconds.
0 large file records processed.
Phase duration (Orphan file record recovery): 0.16 milliseconds.
0 bad file records processed.
Phase duration (Bad file record checking): 0.15 milliseconds.

Stage 2: Examining file name linkage ...
222 reparse records processed.
4184 index entries processed.
Index verification completed.
Phase duration (Index verification): 766.47 milliseconds.
0 unindexed files scanned.
Phase duration (Orphan reconnection): 2.39 milliseconds.
0 unindexed files recovered to lost and found.
Phase duration (Orphan recovery to lost and found): 0.78 milliseconds.
222 reparse records processed.
Phase duration (Reparse point and Object ID verification): 2.48 milliseconds.

Stage 3: Examining security descriptors ...
Security descriptor verification completed.
Phase duration (Security descriptor verification): 15.61 milliseconds.
684 data files processed.
Phase duration (Data attribute verification): 1.24 milliseconds.

Windows has scanned the file system and found no problems.
No further action is required.

11444093 MB total disk space.
7840469 MB in 1935 files.
1204 KB in 686 indexes.
0 KB in bad sectors.
426399 KB in use by the system.
65536 KB occupied by the log file.
3603206 MB available on disk.

4096 bytes in each allocation unit.
2929688063 total allocation units on disk.
922420903 allocation units available on disk.
Total duration: 2.67 seconds (2676 ms).
 
Last edited:

MaxT2

Commendable
Apr 14, 2021
82
6
1,545
When I search Google about "hard drive file table broken but disk has now flaws" (better keywords suggestions?) it is often suggested to reformat the drive (though they never seem to described the precise case I have here).
I may do that...