[plug] Managing files across diverse media? file deduplication, checksums

Nick Bannon nick at ucc.gu.uwa.edu.au
Mon Sep 22 07:58:46 UTC 2014


On Mon, Sep 01, 2014 at 08:10:05PM +0800, John McCabe-Dansted wrote:
> [...] backup everything onto new media yet again. I'd like to be able to
> quickly maintain a list of sha256 or md5 sums that could be used to:

I know just how you feel!

> 1) To list all files on X that are not duplicated/backed up on other media
> 2) Deduplicate files on X quickly (using existing md5 hashes).
> 3) To list all files that are not duplicated onto offline or WORM storage
> 4) To list all files that are not duplicated onto offsite storage
> 5) Match JPGs by EXIF date.
but!
> md5deep wants to regenerate hashes for unmodified files on every run.

I haven't found the perfect thing yet; but something might be close to
it.

First, I've found "rdfind" pretty simple. If you want to straightforwardly
remove duplicates or consolidate them with links/symlinks, it will do
that efficiently across multiple directories. If you just take a deep
breath and do that, it can make manual cleanups and de-duplication much
more tractable.

"summain" looks interesting, though at first glance I expect it to have
the same problem as "md5deep". The same author's rather neat backup
program "obnam" has done that in the past, but I need to check it out again.

"shatag" looks promising!
Maybe "cfv"?

https://tracker.debian.org/rdfind
https://tracker.debian.org/summain
https://tracker.debian.org/obnam
https://tracker.debian.org/shatag
https://tracker.debian.org/cfv

> I am looking at writing a tool to record and manage file IDs across
> media [1], but doing this right could take quite a while.
> [1] https://github.com/gmatht/joshell/tree/master/mass_file_management

Nick.

-- 
   Nick Bannon   | "I made this letter longer than usual because
nick-sig at rcpt.to | I lack the time to make it shorter." - Pascal


More information about the plug mailing list