A small Linux app that will search supplied directory(s) and compare files using sha256 checksum and produces CSV reports on duplicate and unique files that can be used with spreadsheets or scripts.
This repository only contains X86-64 excutable. For the ARM version go to duplicateFF_for_ARM_aarch64
duplicateFF created from the bash script duplicate_FF (private Github repository) using shc.
This requires a Linux bash environment to run. Will run in Microsoft WSL2(Linux), duplicateFF will not run in MSYS2, Gitbash and Cygwin environmants An exceutable created from the shc utility always requires bash. More : Github shc
- It requires a least one directory to search many directories can be compared.
- Searches can be only the current directory or all subdiorectories, or any subdirectory level that is required.
- Files names and maximum file sizes can be used as filters to narrow searches and save time.
- Reports are CSV format which can be imported into a spreadsheet.
Directories with name '$RECYCLE.BIN' are ignored. Linux sees some MS Windows directories as executable only, a user or app can go into them but can't read them. If the Windows "executable only" directory is user accessible, it be easily corrected by respondiong to "You don't currently have permission to access this folder". If the directory is not user accessible then its probably a system directory that is not worth checking for dulicate files.
Linux hidden files are ignored, hidden directories are processed.
To move or remove duplicate files there are some suggestions in How to delete duplicate files using the duplicateFF_KeepCopy and duplicateFF_RemoveALL scripts in this repository.
Foreign Language Characters: If Microsoft Excel is the default application for CVS files, Microsoft Excel will not display foreign language characters correctly. To fix this problem, change the default app for CSV files to Notepad or Wordpad and manually import into excel, alternatively rename *.csv file to a *.txt and manually import into Microsoft Excel. Do not attempt to use Open with and select Excel, it always has to be an Import.
duplicateFF -l <search level> [-f <filter>] [-k|K|m|M|g|G <size>] -s <source directory> ... -s <source directory>
Check all mp4 files that are smaller 300MB in Fred's downloads and ./video, all subdirectories will be processed
duplicateFF -f '.mp4' -m 300 -s './video/' -s '/home/fred/downloads' -l 0
Check all image (jpg) files my home directory, ignoring directory paths longer that three directories i.e. only search three levels deep.
duplicateFF -l 3 -f '.jpg' -s .
Note: -s . could be replaced with -s $PWD or the full path. Search filter will process .jpg and .JPG files
Inputs of 'filter' 'source directory' 'output directory' should have single or double quotes otherwise any names with white space will not be processed.
-l Search level
- -l 0 Process the current directory and all subdirectories.
- -l 1 Process only the current directory
- -l 2 Process the current directory and one level below.
- -l n Process n levels deep in directory structure, includes current directory.
-f File name filter
Optional case insensitive filter, filter the source by part or the whole name of a file. This option can only be used once.
-s Directories to process
One or many directories can be entered each must start with -s.
Maximum file size filter
Maximum file sixe filter is optional, default is 20 GiB. Ignoring large files can save time.
-k or -K kilobytes (KiB)
-m or -M megabytes (MiB)
-g or -G gigabytes (GiB)
Additional Notes
- Default output directory, will be is created in the directory from which the script is run.
- Temporary files are created in the directory from which the script is run - all removed or moved on scipt termination.
- All hidden (dot) files are ignored, hidden directories are processed.
- The 'filter' and 'source directory' require single or double quotes for spaces in the filter and directory input.
- Filtering example : -f 'rar' that will pick both the word 'LIBRARY' and suffix '.rar'. They will be ignored
- Only the last instances of -f filter is used, the rest are ignored.
- Only the last instances of file size filter is used, the rest are ignored.
- Files with identical names and have differeing checksums are considered different files. File contents determines if they are copies.
- Files with different contents and the same SHA256 have known to occur.
- The app is designed to be thorough, not designed for speed.
- Windows file systems occasionally produce some odd stuff that cannot be processed when mounted on Linux.
Output directory created in current directory with name duplicate_chk_<date-time> where date-time = yymmdd-HHMMSS. Output directory contains:
- duplicate_FILES1_<date-time>.csv
- duplicate_FILES2_<date-time>.csv
- all_files_<date-time>.csv
- unique_files_<date-time>.csv
- log_<date-time>.txt
duplicate_FILES1_<date-time>.csv Format – One file per row
CSV Columns
- sha256 checksum
- fully pathed file name
- full path of containing directory
- file size in KiB
duplicate_FILES2_<date-time>.csv Format – Every row is as unique sha256 value with file size and all files of the same shar256 value.
If more files match the checksum they are added as columns in pairs i.e. repeats of columns 5 and 6.
CSV Columns
- sha256 checksum
- file size in KiB
- fully pathed file name (file #1)
- full path of containing directory (file #1)
- fully pathed file name (file #2)
- full path of containing directory (file #2)
A CSV list of all files processed all_files_<date-time>.csv ...... Format: check_sum,"<full path>/<file name>"
A CSV list of all unique files unique_files_<date-time>.csv ... Format: check_sum,"<full path>/<file name>"
If unique files is missing then there are no unique files.
Basic logging log_<date-time>.txt