Thursday, December 1, 2011

Looking for duplicate files using freeware and open source tools

Recently I have been consolidating my files and backups in an attempt to centralise my backup efforts, and to also optimise hard drive usage and reduce data duplication. For me, this spans Windows, Macs and Linux systems.
Documented here are some of my findings, useful commands, and programs I have been using to look for duplicate files.

The open source

I tried a few open source tools including Duplicate Files Finder (windows) and DUMP3 (java). With all respect to the authors  work on their projects, they didn't do it for me. I would be happy to learn of other open tools for this kind of work.

DFF seemed to work, but personally I didn't get on very well with the results window, especially with a lot of results. The file counts seemed to be a little off too :( I didn't use it in the end.
I cannot recommend DUMP3, out of the box it was sooooo slow I gave up. On the positive side the GUI has a lot of promise.
There is also DUFF (windows), which looks like it has potential but sounded buggy and I didn't try it.

Googling also revealed a duplicate file finder thread on superuser, feel free to check it out.
This su thread lead me to discover Michael Thummerer's AllDup. What a great tool and its freeware. Props to Synetech inc su user for introducing me!

Yet Another Duplicate File Remover was also mentioned on the su thread and I came across it when Googling too. I did not test YADFR but it looks like it has potential.

The freeware

Alldup is everything I can think of, that a duplicate file manager needs to be, with whistles and bells on! Alldup even stopped me overzealously nuking some MP3's by intelligently checking "are you really sure you want to do that?"

The  GUI is fantastic, and made perfect sense to me. The results window is the best I've seen, very powerful. There are many features and implementations in this program where I have thought "that's probably how I would of done it", which is why I like it so much and highly recommend you check it out.

I've checked tens of thousands of files with Alldup, of many flavours and sizes and its works FAST! The results are so easy to work with too!

Freeware also worth mentioning, but not tested is Duplicate Cleaner.

The Homebrew

Before I found AllDup, my approach was as follows, which should work with bash under Cygwin, Linux and OS X.

The first challenge was to generate a list of files to work with, I went with generating a list of file extensions for a given location, so I could see what types of files I was dealing with.

Sanity checks, to verify results later, how many and what types of objects are we dealing with?
$ find . | wc -l && find . -type f | wc -l && find . -type d | wc -l
12840 (all objects)
11354 (files)
1486 (dirs)
The last two numbers should sum equal the first number, if not, you've probably got some links or special file types in the location you're analysing. Good to know this up-front.

Now to figure out what kinds of file are in the location you're analysing. First up, check how many files don't have a regular extension:
$ find . -type f | egrep -vi '\..{1,5}$' | wc -l
32
If the result was non-zero, you'll want to inspect the results, to ensure they can be ignored, or not.
$ find . -type f | egrep -vi '\..{1,5}$' | less
The following command should give a complete list of file extensions for the given location, from the results one can choose the extensions to focus on.
$ find . -type f -and -printf "%f\n" | egrep -io '\.[^.]*$' | sort | uniq
.jpeg
.jpg
.m3u
.m4a
.m4v
.mp3
.mpeg
.ogg
Now manually inspect the list of files for the extensions that are interesting:
$ find . -regextype posix-egrep -and -type f -and -iregex '.*\.(mp3|m4a|ogg)$'  | less
Once you're happy with the list, its time to generate some checksums. There are many hashing algorithms available for doing this. Two common ones are md5 and crc. On my i7 system under Cygwin, scanning ~9000 files took 44 minutes with md5sum vs. 20 minutes with cksum.
$ find . -regextype posix-egrep -and -type f -and -iregex '.*\.(mp3|m4a|ogg)$' -and -print0 | xargs -0 -P1 -I{} -- cksum {} > my-stuff.cksum
So after generation of a list of checksums with your preferred hashing algorithm, its time to put the hashes to work. The concept is as follows:
Isolate the hashes, sort them, find duplicates, output to file, use the list of duplicate hashes to match the files in your main checksum list.
It looks something like this:
$ awk '{print $1}' my-stuff.cksum | sort | uniq -d > my-stuff-dups.txt
Now search for the dups in your cksum file:
fgrep -f my-stuff-dups.txt MP3/my-stuff.cksum | less

1 comment:

ivanden said...

Here is my suggestion for this "DuplicateFilesDeleter" which do the rest of the work. Google it and run in your windows PC.
Good Luck.