Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
How do I find shortest match?
- X-seq: zsh-users 17089
- From: TJ Luoma <luomat@xxxxxxxxx>
- To: Zsh-Users List <zsh-users@xxxxxxx>
- Subject: How do I find shortest match?
- Date: Wed, 16 May 2012 15:18:31 -0400
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:subject:to:x-priority:message-id:mime-version :content-type:content-transfer-encoding:x-mailer; bh=pKC5p2r0wKU6+MQXozx1DSoEX/FLVSTucwAqc56OSlg=; b=xxrDt3nQ/aUNRiLk7wxA4ZlL5A27I9pN7/BejtqCEl05tfItuyBpoWHyPayXsmtz0M q7x5wKxiDjKnL7YPqSlCbKk3Fqcu6BhrAvZPCeiCZO3Z2OUmSCpmGMfFNL86iVNswWIV 2XjFqTc7ZRV+pO0GH8If2U2QAVd1MkelKF2sXQIhLzBa0MCVuOL6285agXee8y9Afm/E DfLmY8dteGtzQRuMMG7JCAzz4OPt5OELbXu5ETFdD+jWY31Or8OxDgASuhCVNs+RC2rF MyldkV0kEl1Exb+ji6PG4zMwQoiffJyNvN5BLxqabz2es/hoZMoc1VgzmfXIvlztAxfd X6SQ==
- List-help: <mailto:zsh-users-help@zsh.org>
- List-id: Zsh Users List <zsh-users.zsh.org>
- List-post: <mailto:zsh-users@zsh.org>
- Mailing-list: contact zsh-users-help@xxxxxxx; run by ezmlm
I have a folder which has a lot of txt files, and in that folder
are a lot of duplicate files. Most of the duplicates are
numbered like this:
10-6- Make a universal 10-6-7 Snow Leopard installer-1.txt
10-6- Make a universal 10-6-7 Snow Leopard installer-2.txt
10-6- Make a universal 10-6-7 Snow Leopard installer-3.txt
10-6- Make a universal 10-6-7 Snow Leopard installer-4.txt
10-6- Make a universal 10-6-7 Snow Leopard installer.txt
Not not all of them. For example, I might have another identical
file named
todo-make-snowleopardinstaller.txt
What I want to do is go through the entire folder and find all
duplicate files (files with identical md5sum).
Then I want to keep ONLY the one with the shortest filename.
Here's what I have so far
#!/bin/zsh
DIR=/Users/luomat/Dropbox/txt/
# to avoid 'arg list too long'
# note that 'gmd5sum' prints the sum
# and then two spaces, and then the filename
ALL=$(find $DIR -type f -print0 | xargs -0 gmd5sum)
# these are all the MD5 sums which occur MORE than one time
# (which we get by removing any results with only one result
SUMS=($(echo $ALL | awk '{print $1}' | sort | uniq -c |\
egrep -v '^ 1 ' | awk '{print $2}'))
for SUM in $SUMS
do
# for each unique MD5 sum, do this:
# get a list of all of the matching filenames MINUS the
# sum itself
MATCHES=($(echo "$ALL" | egrep "^$SUM" | sed "s${SUM} ##g"))
# ???
done
I don't know what to do in the ??? to compare the filenames and
choose the shortest one.
Any ideas?
Or is there a better way to do this?
Thanks
TjL
Messages sorted by:
Reverse Date,
Date,
Thread,
Author