[SOLVED] How to scan a folder for corrupted DOC/DOCX files?

Soul7aker

Honorable
Apr 26, 2014
3
0
10,510
Hello, I have over 3000 office files between doc, docx, ppt, and xls. And I want to separate the corrupted files from the "healthy" ones.
// I'm not interested in fixing the corrupted ones.
Is there any way or software to scan the entire folder and sort the corrupted files from the rest?
 
Solution
General FYI:

Unzipping files via Powershell is very straightforward via the PS prompt:

Expand-Archive -Path 'C:\input.zip' -DestinationPath 'C:\output'

Reference:

https://www.shellhacks.com/windows-zip-unzip-command-line/

There are many other similar links.

Deleting files is done via Remove-Item.

As to a failed unzip I am not sure sure about handling that per se.

What you would need to do is to use Powershell and attempt to unzip a zipped file you already know is corrupted.

Note the error code(s).

However, it may be even easier than that via Try-Catch

Reference:

https://www.webservertalk.com/powershell-try-catch-tutorial-guide

Sort of putting it (the above references) together:

try
{...
Hello, I have over 3000 office files between doc, docx, ppt, and xls. And I want to separate the corrupted files from the "healthy" ones.
// I'm not interested in fixing the corrupted ones.
Is there any way or software to scan the entire folder and sort the corrupted files from the rest?
I've not seen anything that can detect a corrupted Word file from a non corrupted word file.
 
  • Like
Reactions: Soul7aker
docx/xlsx/pptx files are in fact ZIP archives, so if you can write a script to traverse your folders, and check whether these documents are valid ZIP archives, you can separate them.
Powershell would be a good way to start learning about scripting on Windows (and even Linux).
 
And I will ask what criteria, metrics, etc, are being used to define a "corrupted" "*.zip", *.doc or other applicable file extension?

Or is all that is needed is a failed attempt to unzip a file?

From the proverbial "10,000 foot view" it may be possible to use Powershell (as suggested by @Alabalcho) to open files and if the opening (due to corruption or whatever issues) fails and generates an error code of any sort then the file is moved to an archive or just simply deleted.

Logic being:

If file unzip = fail then delete file else open, install, list unzipped files. etc..

Very likely that a Powershell approach could be customized to specific failure errors. Overall, likely messy and cumbersome.

Could go deeper if hash tags are used.
 
General FYI:

Unzipping files via Powershell is very straightforward via the PS prompt:

Expand-Archive -Path 'C:\input.zip' -DestinationPath 'C:\output'

Reference:

https://www.shellhacks.com/windows-zip-unzip-command-line/

There are many other similar links.

Deleting files is done via Remove-Item.

As to a failed unzip I am not sure sure about handling that per se.

What you would need to do is to use Powershell and attempt to unzip a zipped file you already know is corrupted.

Note the error code(s).

However, it may be even easier than that via Try-Catch

Reference:

https://www.webservertalk.com/powershell-try-catch-tutorial-guide

Sort of putting it (the above references) together:

try
{
Expand-Archive -Path 'C:\input.zip' -DestinationPath 'C:\output'
}
catch
{
Remove-Item C:\input.zip
}
Write-Output "Unzip failed - Zip file removed"


Note: above script not tested. Just thinking out loud......

Plus the script would need to be inside some "loop" that parses through all of the target 3000 files.

Another consideration is that if any any given file does successfully unzip you may end up filling C:\output. Probably need to delete the contents of C:\output between unzip (Expand-Archive) attempts.

Overall appears straightforward to test and try.

Caveat: Just do so in a safe test environment wherein that if things go astray no real data will be lost.

I still have lots to learn about Powershell. :)
 
Solution
i'd not have it delete the files when still trying to figure out the script. instead just have it moved to a separate folder.

if you run the practice script on a copy of the file, then i guess it would be ok since you still have the main copy to start over with.


but as noted above, you do need to really define what a "corrupt" file actually means to you so you can be very specific about what you are doing in the script. it can be powerful but it does not read your mind and will do what you tell it to do, whether it's what you actually meant or not!!
 
  • Like
Reactions: Ralston18
But - but - are corruption of the zip archive the only way for a MS Office file to be corrupted ? If no then doing a batch zip archive consistency check won't guarantee all files aren't corrupted.

Also - what defines a corrupted file? If it means some content is missing, chances are that even MS Office itself won't detect (i.e. put up an error message on opening) a less than complete document file.