[plug] speed: find vs ls

Fri Jul 29 09:56:43 AWST 2022

On 29/7/22 02:19, Thomas Cuthbert wrote:
> Also https://github.com/sharkdp/fd <https://github.com/sharkdp/fd> is a find clone with an emphasis on performance which might be useful!
> 

G'day Thomas,

Appreciate the input. This process is just collecting the list of files to then run a verification step on, so the speed is in finding the files.

In this instance it's not appreciably faster than find. An strace shows it's doing what find does and running stat on each file in each directory.

root at rpi31:/mnt/backup/work# echo 2 > /proc/sys/vm/drop_caches 
root at rpi31:/mnt/backup/work# time fdfind --maxdepth 2 --type f bkb.rhash.crc32 .

real	6m34.952s
user	0m6.893s
sys	0m24.713s
root at rpi31:/mnt/backup/work# echo 2 > /proc/sys/vm/drop_caches 
root at rpi31:/mnt/backup/work# time fdfind --maxdepth 2 --type f --glob bkb.rhash.crc32 .

real	6m31.966s
user	0m6.958s
sys	0m24.489s

> On Fri, 29 July 2022, 2:15 am Thomas Cuthbert, <tcuthbert90 at gmail.com <mailto:tcuthbert90 at gmail.com>> wrote:
> 
>     As a guess I'd say the excessive metadata syscalls are due to your -type predicate and maybe the format string (find has a number of other fmt parameters that reference stat info). It sounds like you have lots of directories too; limiting the number of number of directories will reduce the rate of dentry and metadata reads. squid does something similar to group objects together with its L1/L2 cache_dier hierarchy.
> 
>     Also do you need to hash the whole file? Seeing as you already have the metadata in cache you could probably get a quick performance win by comparing the metadata to a previous value or just only hashing the metadata.
> 
>     On Thu, 28 July 2022, 5:22 pm Brad Campbell, <brad at fnarfbargle.com <mailto:brad at fnarfbargle.com>> wrote:
> 
>         G'day all,
> 
>         An observation while I'm still playing with my sizeable set of backup directories.
>         I've been adding a bit that creates a file of crc32s of the updated files, and then toying around with a script to crawl the drive and check them all.
> 
>         I started using find to give me a list of dirs that contain the files. It was spending a *lot* of time just creating the list. In fact it spent more time looking for the files than the subsequent iteration and check of each one.
>         I must qualify that with the fact, I'm about 10 days into creating the crcs and most directories already have ~800 days worth of backups.
> 
>         The script run with :
>         for j in `find . -maxdepth 2 -type f -name bkb.rhash.crc32 -printf "%h\n"` ; do
> 
>         Checked 170 directories with 0 errors in 0:00:34:58
> 
>         stracing find, it's dropping into each directory and performing a stat on every file. Some dirs have a *lot* of files.
> 
>         I thought about trying a bit of globbing with ls instead, and blow me down if it wasn't "a bit faster".
> 
>         The script run with :
>         for j in `ls ??????-????/bkb.rhash.crc32 2>/dev/null` ; do j=($dirname $j)
> 
>         Checked 170 directories with 0 errors in 0:00:09:49
> 
>         I know premature optimisation is the root of all evil, but this one might have been a case of "using the right tool".
> 
>         Regards,
>         Brad
>         -- 
>         An expert is a person who has found out by his own painful
>         experience all the mistakes that one can make in a very
>         narrow field. - Niels Bohr
>         _______________________________________________
>         PLUG discussion list: plug at plug.org.au <mailto:plug at plug.org.au>
>         http://lists.plug.org.au/mailman/listinfo/plug <http://lists.plug.org.au/mailman/listinfo/plug>
>         Committee e-mail: committee at plug.org.au <mailto:committee at plug.org.au>
>         PLUG Membership: http://www.plug.org.au/membership <http://www.plug.org.au/membership>
>