Finding files with identical contents

Need to search a filesystem for all the files which have identical contents? Read on for a Perl script that does that.

Note that this isn’t just a solution to the (much simpler) problem of finding files in different directories which have the same name; I’m talking about the actual data inside the file being duplicated. This script also works reasonably efficiently, so it’s still useful in cases where you have an extremely large number of files and/or the files in question are very large.

The basic idea is that we start by making a hash table where each bucket is file size in bytes. (It’s cheap to find the size of a file — you don’t have to read or even open the file; just fetch the metadata.) Two files which are identical must necessarily have the same size, so throw out hash buckets containing exactly one file. For those which remain, compare the contents of all the files within each individual bucket (against only other files in the same bucket).

As with most of the useful shell scripts I’ve written, this one was created in an attempt to use technology to avoid the deserved consequences of my own carelessness. In this particular case, I’d made several copies of a big batch of digital camera pictures, renamed some of the pictures in some of the copies, and shuffled them around a fair amount before realizing I was duplicating a lot of data. The script was a good remedy for that problem, and I’ve frequently found it handy since.

While it is written in Perl and is reasonably portable, it’s designed to be used as part of a pipeline. Thus, it is only really useful as-is on a UNIX-like system, and then only to users with a reasonable degree of experience with the shell.

download script source (3K plain text)

The “same” script is my original work, released under GPL v2. That means it’s free to use and share, but comes with no warranty. Please see the license for full details.

By dhenke

Email: dhenke@mythopoeic.org

Leave a comment

Your email address will not be published. Required fields are marked *