Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

How do I find shortest match?

X-seq: zsh-users 17089
From: TJ Luoma <luomat@xxxxxxxxx>
To: Zsh-Users List <zsh-users@xxxxxxx>
Subject: How do I find shortest match?
Date: Wed, 16 May 2012 15:18:31 -0400
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:subject:to:x-priority:message-id:mime-version :content-type:content-transfer-encoding:x-mailer; bh=pKC5p2r0wKU6+MQXozx1DSoEX/FLVSTucwAqc56OSlg=; b=xxrDt3nQ/aUNRiLk7wxA4ZlL5A27I9pN7/BejtqCEl05tfItuyBpoWHyPayXsmtz0M q7x5wKxiDjKnL7YPqSlCbKk3Fqcu6BhrAvZPCeiCZO3Z2OUmSCpmGMfFNL86iVNswWIV 2XjFqTc7ZRV+pO0GH8If2U2QAVd1MkelKF2sXQIhLzBa0MCVuOL6285agXee8y9Afm/E DfLmY8dteGtzQRuMMG7JCAzz4OPt5OELbXu5ETFdD+jWY31Or8OxDgASuhCVNs+RC2rF MyldkV0kEl1Exb+ji6PG4zMwQoiffJyNvN5BLxqabz2es/hoZMoc1VgzmfXIvlztAxfd X6SQ==
List-help: <mailto:zsh-users-help@zsh.org>
List-id: Zsh Users List <zsh-users.zsh.org>
List-post: <mailto:zsh-users@zsh.org>
Mailing-list: contact zsh-users-help@xxxxxxx; run by ezmlm

I have a folder which has a lot of txt files, and in that folderare a lot of duplicate files. Most of the duplicates arenumbered like this:


10-6- Make a universal 10-6-7 Snow Leopard installer-1.txt
10-6- Make a universal 10-6-7 Snow Leopard installer-2.txt
10-6- Make a universal 10-6-7 Snow Leopard installer-3.txt
10-6- Make a universal 10-6-7 Snow Leopard installer-4.txt
10-6- Make a universal 10-6-7 Snow Leopard installer.txt

Not not all of them. For example, I might have another identicalfile named


    todo-make-snowleopardinstaller.txt

What I want to do is go through the entire folder and find allduplicate files (files with identical md5sum).


Then I want to keep ONLY the one with the shortest filename.

Here's what I have so far

#!/bin/zsh

DIR=/Users/luomat/Dropbox/txt/

    # to avoid 'arg list too long'
    # note that 'gmd5sum' prints the sum
    # and then two spaces, and then the filename
ALL=$(find $DIR -type f -print0 | xargs -0 gmd5sum)

    # these are all the MD5 sums which occur MORE than one time
    # (which we get by removing any results with only one result
SUMS=($(echo $ALL | awk '{print $1}' | sort  | uniq -c |\
        egrep -v '^   1 ' | awk '{print $2}'))



for SUM in $SUMS
do

    # for each unique MD5 sum, do this:


    # get a list of all of the matching filenames MINUS the
    # sum itself
    MATCHES=($(echo "$ALL" | egrep "^$SUM" | sed "s${SUM}  ##g"))

    # ???

done

I don't know what to do in the ??? to compare the filenames andchoose the shortest one.


Any ideas?

Or is there a better way to do this?

Thanks

TjL

Follow-Ups:
- Re: How do I find shortest match?
  - From: Peter Stephenson

Messages sorted by: Reverse Date, Date, Thread, Author