[plug] Managing files across diverse media? file deduplication, checksums

Tue Sep 23 12:53:08 UTC 2014

On Mon, Sep 22, 2014 at 2:58 PM, Nick Bannon <nick at ucc.gu.uwa.edu.au> wrote:

> On Mon, Sep 01, 2014 at 08:10:05PM +0800, John McCabe-Dansted wrote:
> > [...] backup everything onto new media yet again. I'd like to be able to
> > quickly maintain a list of sha256 or md5 sums that could be used to:
>
> I know just how you feel!
>
> > 1) To list all files on X that are not duplicated/backed up on other
> media
> > 2) Deduplicate files on X quickly (using existing md5 hashes).
> > 3) To list all files that are not duplicated onto offline or WORM storage
> > 4) To list all files that are not duplicated onto offsite storage
> > 5) Match JPGs by EXIF date.
> but!
> > md5deep wants to regenerate hashes for unmodified files on every run.
>
> I haven't found the perfect thing yet; but something might be close to
> it.
>
> First, I've found "rdfind" pretty simple. If you want to straightforwardly
> remove duplicates or consolidate them with links/symlinks, it will do
> that efficiently across multiple directories. If you just take a deep
> breath and do that, it can make manual cleanups and de-duplication much
> more tractable.
>
> "summain" looks interesting, though at first glance I expect it to have
> the same problem as "md5deep". The same author's rather neat backup
> program "obnam" has done that in the past, but I need to check it out
> again.
>
> "shatag" looks promising!
> Maybe "cfv"?
>
> https://tracker.debian.org/rdfind
> https://tracker.debian.org/summain
> https://tracker.debian.org/obnam
> https://tracker.debian.org/shatag
> https://tracker.debian.org/cfv
>
> > I am looking at writing a tool to record and manage file IDs across
> > media [1], but doing this right could take quite a while.
> > [1] https://github.com/gmatht/joshell/tree/master/mass_file_management
>
> Nick.
>
> I have written something that may help;
https://github.com/rlp1938/Duplicates
It's written in C so the usual ./configure && make && sudo make install is
needed.

It does work across file systems if required.
There are some optimisations:
1. 0 size files are never considered.
2. Files with unique sizes are dropped from consideration.
3. Files of identical size are dropped if they don't match on the smaller
of filesize or 128 kb.
4. The rest are md5summed and those that have matching sums are output to
stdout. Redirect to a file of course.

Broken symlinks discovered along the way are output on stderr.

Also:
https://github.com/rlp1938/Processdups
That is an interactive program to help you deal with the consequences of
finding your list of duplicates. It gives you the option of preserving 1
only of a duplicated group, hard linking them together (if on 1 filesystem,
symlinking otherwise), deleting all, or just ignoring the group.

Good luck.
Bob Parker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.plug.org.au/pipermail/plug/attachments/20140923/76939fcf/attachment.html>