Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

How do I find shortest match?



I have a folder which has a lot of txt files, and in that folder are a lot of duplicate files. Most of the duplicates are numbered like this:

10-6- Make a universal 10-6-7 Snow Leopard installer-1.txt
10-6- Make a universal 10-6-7 Snow Leopard installer-2.txt
10-6- Make a universal 10-6-7 Snow Leopard installer-3.txt
10-6- Make a universal 10-6-7 Snow Leopard installer-4.txt
10-6- Make a universal 10-6-7 Snow Leopard installer.txt

Not not all of them. For example, I might have another identical file named

    todo-make-snowleopardinstaller.txt

What I want to do is go through the entire folder and find all duplicate files (files with identical md5sum).

Then I want to keep ONLY the one with the shortest filename.

Here's what I have so far

#!/bin/zsh

DIR=/Users/luomat/Dropbox/txt/

    # to avoid 'arg list too long'
    # note that 'gmd5sum' prints the sum
    # and then two spaces, and then the filename
ALL=$(find $DIR -type f -print0 | xargs -0 gmd5sum)

    # these are all the MD5 sums which occur MORE than one time
    # (which we get by removing any results with only one result
SUMS=($(echo $ALL | awk '{print $1}' | sort  | uniq -c |\
        egrep -v '^   1 ' | awk '{print $2}'))



for SUM in $SUMS
do

    # for each unique MD5 sum, do this:


    # get a list of all of the matching filenames MINUS the
    # sum itself
    MATCHES=($(echo "$ALL" | egrep "^$SUM" | sed "s${SUM}  ##g"))

    # ???

done


I don't know what to do in the ??? to compare the filenames and choose the shortest one.

Any ideas?

Or is there a better way to do this?

Thanks

TjL





Messages sorted by: Reverse Date, Date, Thread, Author