<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Sep 22, 2014 at 2:58 PM, Nick Bannon <span dir="ltr"><<a href="mailto:nick@ucc.gu.uwa.edu.au" target="_blank">nick@ucc.gu.uwa.edu.au</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">On Mon, Sep 01, 2014 at 08:10:05PM +0800, John McCabe-Dansted wrote:<br>

> [...] backup everything onto new media yet again. I'd like to be able to<br>

> quickly maintain a list of sha256 or md5 sums that could be used to:<br>

<br>

I know just how you feel!<br>

<br>

> 1) To list all files on X that are not duplicated/backed up on other media<br>

> 2) Deduplicate files on X quickly (using existing md5 hashes).<br>

> 3) To list all files that are not duplicated onto offline or WORM storage<br>

> 4) To list all files that are not duplicated onto offsite storage<br>

> 5) Match JPGs by EXIF date.<br>

but!<br>

> md5deep wants to regenerate hashes for unmodified files on every run.<br>

<br>

I haven't found the perfect thing yet; but something might be close to<br>

it.<br>

<br>

First, I've found "rdfind" pretty simple. If you want to straightforwardly<br>

remove duplicates or consolidate them with links/symlinks, it will do<br>

that efficiently across multiple directories. If you just take a deep<br>

breath and do that, it can make manual cleanups and de-duplication much<br>

more tractable.<br>

<br>

"summain" looks interesting, though at first glance I expect it to have<br>

the same problem as "md5deep". The same author's rather neat backup<br>

program "obnam" has done that in the past, but I need to check it out again.<br>

<br>

"shatag" looks promising!<br>

Maybe "cfv"?<br>

<br>

<a href="https://tracker.debian.org/rdfind" target="_blank">https://tracker.debian.org/rdfind</a><br>

<a href="https://tracker.debian.org/summain" target="_blank">https://tracker.debian.org/summain</a><br>

<a href="https://tracker.debian.org/obnam" target="_blank">https://tracker.debian.org/obnam</a><br>

<a href="https://tracker.debian.org/shatag" target="_blank">https://tracker.debian.org/shatag</a><br>

<a href="https://tracker.debian.org/cfv" target="_blank">https://tracker.debian.org/cfv</a><br>

<br>

> I am looking at writing a tool to record and manage file IDs across<br>

> media [1], but doing this right could take quite a while.<br>

> [1] <a href="https://github.com/gmatht/joshell/tree/master/mass_file_management" target="_blank">https://github.com/gmatht/joshell/tree/master/mass_file_management</a><br>

<span class=""><font color="#888888"><br>

Nick.<br>

<br></font></span></blockquote><div>I have written something that may help;</div><div><a href="https://github.com/rlp1938/Duplicates">https://github.com/rlp1938/Duplicates</a></div><div>It's written in C so the usual ./configure && make && sudo make install is needed.</div><div><br></div><div>It does work across file systems if required.</div><div>There are some optimisations:</div><div>1. 0 size files are never considered.</div><div>2. Files with unique sizes are dropped from consideration.</div><div>3. Files of identical size are dropped if they don't match on the smaller of filesize or 128 kb.</div><div>4. The rest are md5summed and those that have matching sums are output to stdout. Redirect to a file of course.</div><div><br></div><div>Broken symlinks discovered along the way are output on stderr.</div><div><br></div><div>Also:</div><div><a href="https://github.com/rlp1938/Processdups">https://github.com/rlp1938/Processdups</a><br></div><div>That is an interactive program to help you deal with the consequences of finding your list of duplicates. It gives you the option of preserving 1 only of a duplicated group, hard linking them together (if on 1 filesystem, symlinking otherwise), deleting all, or just ignoring the group.</div><div><br></div><div>Good luck.</div><div>Bob Parker</div><div><br></div><div><br></div></div><br>

</div></div>