Question Can a file be corrupted while having the same size?

Status
Not open for further replies.

ditrate

Great
Sep 4, 2022
122
1
85
You record a checksum taken on the known uncorrupted version, then compare to a checksum, using the very same program used to create the first, taken on the suspected corrupted version. If they don't match then there has been one or more changes in the 2nd copy. If they match then they are identical.
No, I not doing such deep stuff. I only storing original size in bytes (logs).
 
You could use Windows' in-built Powershell from the CMD prompt to compute the checksums of your files. The following session computes the MD5 hashes for all files in the specified path and writes the results to a log file (\temp\MD5_sums.txt).

Code:
C:\>powershell
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Try the new cross-platform PowerShell [url=https://aka.ms/pscore6]https://aka.ms/pscore6[/url]

PS C:\> Get-FileHash *.* -Algorithm MD5 > \temp\MD5_sums.txt
PS C:\> exit

C:\>

Code:
Algorithm       Hash                                                                   Path                                                                                                                                                
---------       ----                                                                   ----                                                                                                                                                
MD5             D9EBEC6668A6092FCBD1713C347AA5E0                                       C:\autoexec.bat                                                                                                                                    
MD5             ED4FC5980BD8B1AD869FF725C7776338                                       C:\config.sys                                                                                                                                      
MD5             2F25C43273A10EE038518BD68B177966                                       C:\exelist.txt                                                                                                                                      
MD5             1DE223B9ED10E519311D2B9BC48FB651                                       C:\junk.txt                                                                                                                                        
MD5             9C78D40D5758BC788AF84BD5E09D1465                                       C:\Log.txt                                                                                                                                          
MD5             949A9099694308AA2099DABB2855BF91                                       C:\Reflect_Install.log
 
No, I not doing such deep stuff. I only storing original size in bytes (logs).
You need to perform some sort of checksum method (MD5, SHA, or whatever you want) to verify the integrity of any file.

The size means nothing, especially if the corruption doesn't necessarily make the data invalid. For example if you have a plaintext log file with a list of numbers, corruption can cause one of those numbers to change. But how do you know if the number was just a freak blip from the thing or you had data corruption?
 

ditrate

Great
Sep 4, 2022
122
1
85
You could use Windows' in-built Powershell from the CMD prompt to compute the checksums of your files. The following session computes the MD5 hashes for all files in the specified path and writes the results to a log file (\temp\MD5_sums.txt).

Code:
C:\>powershell
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Try the new cross-platform PowerShell [url=https://aka.ms/pscore6]https://aka.ms/pscore6[/url]

PS C:\> Get-FileHash *.* -Algorithm MD5 > \temp\MD5_sums.txt
PS C:\> exit

C:\>

Code:
Algorithm       Hash                                                                   Path                                                                                                                                                
---------       ----                                                                   ----                                                                                                                                                
MD5             D9EBEC6668A6092FCBD1713C347AA5E0                                       C:\autoexec.bat                                                                                                                                    
MD5             ED4FC5980BD8B1AD869FF725C7776338                                       C:\config.sys                                                                                                                                      
MD5             2F25C43273A10EE038518BD68B177966                                       C:\exelist.txt                                                                                                                                      
MD5             1DE223B9ED10E519311D2B9BC48FB651                                       C:\junk.txt                                                                                                                                        
MD5             9C78D40D5758BC788AF84BD5E09D1465                                       C:\Log.txt                                                                                                                                          
MD5             949A9099694308AA2099DABB2855BF91                                       C:\Reflect_Install.log
Thanks. So, if bytes size identical it's doesn't matter?
 
Thanks. So, if bytes size identical it's doesn't matter?
Absolutely NOT. As an example, by your method, a file containing financial data is recorded as 100 Bytes. Now, I come along and replace that file with another file containing exactly 100 Bytes of completely random garbage. Use the same filename, put it in the same place with the same size (100 Bytes) and your method will conclude that the file is completely unchanged. Checksums prevent this.
 

ditrate

Great
Sep 4, 2022
122
1
85
Absolutely NOT. As an example, by your method, a file containing financial data is recorded as 100 Bytes. Now, I come along and replace that file with another file containing exactly 100 Bytes of completely random garbage. Use the same filename, put it in the same place with the same size (100 Bytes) and your method will conclude that the file is completely unchanged. Checksums prevent this.
So I should completely delete everything and hashing every file? Bruh, it's a bit difficult.
 
Thanks. So, if bytes size identical it's doesn't matter?

Take a text file, change one character, then compute the MD5 hashes of each file. The hashes will be completely different.

Code:
C:\>echo abc > abc.txt

C:\>echo bbc > bbc.txt

C:\>powershell
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Try the new cross-platform PowerShell [url=https://aka.ms/pscore6]https://aka.ms/pscore6[/url]

PS C:\> Get-FileHash ?bc.txt -Algorithm MD5 > \temp\MD5_sums.txt
PS C:\> exit

C:\>type \temp\MD5_sums.txt

Algorithm       Hash                                                                   Path
---------       ----                                                                   ----
MD5             20ABE48E15DEAF253A389062CF29DF72                                       C:\abc.txt
MD5             7514E5791EACB3FCC134C2FEC37FF9C2                                       C:\bbc.txt
 
If you're trying to detect corruption based on file size, consider this scenario:
Byte 1Byte 2Byte 3Byte 4Byte 5Byte 6Byte 7Byte 8
Non corrupted0001020304050607
CorruptedFF01020304DEAD07

These two files are going to show up as being 8 bytes long. Now without looking at the bytes themselves, tell me, which one is corrupted?
 

ditrate

Great
Sep 4, 2022
122
1
85
If you're trying to detect corruption based on file size, consider this scenario:
Byte 1Byte 2Byte 3Byte 4Byte 5Byte 6Byte 7Byte 8
Non corrupted0001020304050607
CorruptedFF01020304DEAD07

These two files are going to show up as being 8 bytes long. Now without looking at the bytes themselves, tell me, which one is corrupted?
Hash sum (MD5, SH1) is a must for the server, I see.
 
Also to note : Many file types, that is in fact zip files (as for MS Office, Open document format, etc) and can often use same tools to make consistent check as if it was zip files (it's just the file ending that is changed).

So, you can use 7-zip program to check if zip files.
 
And how to know?

A major point of a checksum is the key word "change". If you don't know what something was before a change, then you have no way to know there was a change. Checksums are taken and written down somewhere. Then, at some later point in time, a checksum is computed again, and compared to the original. How do you know the checksum is changed or wrong? That is key to the whole question. Did you have a previous checksum? Was this a download from a web site that mismatches what the web site says is wrong? Or is this (as others mention) some sort of built-in checksum mechanism (e.g., btrfs in Linux)?

Unless you closely describe what checksum it is that you are talking about, and how you know that checksum is wrong, then it will be very difficult to give you a useful answer. Suggestions are being given regarding operating system filesystems, compression/decompression tools, reinstall of parts or all software, so on. Please give a very specific description of what it is the checksum applies to, and how you know the checksum is wrong.

Note that in some cases a checksum can be used to reverse out the error for small errors, e.g., certain RAID systems have checksums designed to correct up to 4 bits of error. The earlier mentioned "cosmic ray" bit flip (@Tugrul_512bit) is actually very common in RAM, less common in disk drives, but not rare. Somewhere I recall seeing that at sea level there is usually 1 bit flip every three weeks for a continuously run computer, and at high altitude, e.g., in Colorado mountains, an average of one bit flip every two weeks (the cosmic rays have some absorption by atmosphere, and more atmosphere is protective; Colorado also has the highest skin cancer rates in the USA because of this). If a bit flip in RAM is written to a file (e.g., as it is unpacked and written to disk), then the disk would contain a corrupt file due to one byte being wrong.
 
  • Like
Reactions: Grobe
Status
Not open for further replies.