[plug] Find similarly named files in directories
Timothy White
weirdit at gmail.com
Sat Jan 13 08:56:09 WST 2007
On 1/13/07, Carl Gherardi <carl.gherardi at gmail.com> wrote:
> On 1/12/07, Timothy White <weirdit at gmail.com> wrote:
> > Ok, so 2 simple sed's and I solve the space problem!! Not sure if
> > there are any other "bugs"
> >
> > find| sed 's/\ /:::/g' |sed -r 's/.*\/(.*)/\0 \1/'|sort -i -k 2|uniq
> > -i --all-repeated=separate -f 1| sed 's/[^ ]*$//' | sed 's/:::/\ /g'
> >
> > Rather simple, first check no file name as 3 :'s in a row, if it does,
> > find another "uniq" sequence to replace it with.
>
> Was a challenge for golf originally an i'm bored. 138 chars is benchmark.
>
> Slight shave - remove long options and a couple of seds:
> find . -printf "%h/%f:::%f\n" | sort -i -k 2 | uniq -i -D -f 1 | sed
> 's/:::.*$//'
> 99 chars
>
> Pretty sure that still good for whitespace. Same qualification for
> files with ::: in the name.
I wasn't able to get yours to reproduce the results of mine... Your
original one places ::: between the whole path/file and the basename.
Problem is, most of the utilites require the field separator to be
whitespace, and not a character (or set of characters) When I first
run your's, at the top of the file I got a whole heap of false
matches. And I didn't seem to get that many true matches. I tried to
modify it to use a tab as the separator, and all I could do was reduce
the number of hits, and the more I tried with the tab char, the less
hits I got. The whole problem is the uniq command, as it skips upto
the first whitespace, then compares. Also, the reason my uniq command
was so long, was to separate each block of similar files with a
newline between the blocks.
I'm not sure, but maybe your command somehow see's the same directory
name and thinks that they are similar (probably a space in the name?)
tim at linjeni:/data$ find . | sed 's/\ /:::/g' |sed -r 's/.*\/(.*)/\0
\1/'|sort -i -k 2|uniq -i -D -f 1| sed 's/[^ ]*$//' | sed 's/:::/\
/g'|wc -l
63115
tim at linjeni:/data$ find . -printf "%h/%f:::%f\n" | sort -i -k 2 | uniq
-i -D -f 1 | sed 's/:::.*$//'| wc -l 66304
tim at linjeni:/data$ find . -printf "%h/%f\t%f\n"|sort -i -k 2|uniq -i
-D -f 1|sed 's/\t.*$//'| wc -l
44835
So the problem, as the uniq man page says:
"A field is a run of whitespace, then non-whitespace characters.
Fields are skipped before chars."
I did realise that I could put the basename before the path, and
compare to a limit of chars, but that was getting messy.
If you can make find "escape" whitespace, then your cooking with gas.
I just gave up on find, escaped the whitespace (with sed), then did
the basename extraction and paste, and went from there. You can
probably combined my last 2 sed commands, I'm not sure.
(Btw, I didn't just run wc -l on the command output, I did have a peak
around to get an idea of what was different)
Tim
--
Linux Counter user #273956
Don't email joeblogs at scouts.org.au
More information about the plug
mailing list