Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: Compare two (or more) filenames and return what is common between them
On Tue, 18 Mar 2014 20:23:09 +0000
Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx> wrote:
> The upshot is that for the input
>
> "One Two Nineteen"
> "One Two Three"
> "One Two Buckle My Shoe"
> "One Two Buckle My Belt"
> "One Three Four"
> "Two Three Sixteen"
> "Two Three Seventeen"
> "Three Forty Five"
>
> it prints
>
> Extracting common prefixes 'One Two Buckle My'...
> 'One Two Buckle My Shoe' goes in directory 'One Two Buckle My'
> 'One Two Buckle My Belt' goes in directory 'One Two Buckle My'
> Extracting common prefixes 'One Two', 'Two Three'...
> 'One Two Nineteen' goes in directory 'One Two'
> 'One Two Three' goes in directory 'One Two'
> 'Two Three Sixteen' goes in directory 'Two Three'
> 'Two Three Seventeen' goes in directory 'Two Three'
> Unmatched files:
> 'One Three Four'
> 'Three Forty Five'
Let my try to make this more adaptable by annotating it.
##start
# Sanitise the options in use for this function.
emulate -L zsh
# We'll need extendglob for matching.
setopt extendedglob
local -a words match mbegin mend split restwords
# Here's what we're going to apply the algorithm to.
# You'd probably get these from a "*" or something similar.
words=(
"One Two Nineteen"
"One Two Three"
"One Two Buckle My Shoe"
"One Two Buckle My Belt"
"One Three Four"
"Two Three Sixteen"
"Two Three Seventeen"
"Three Forty Five"
)
# We'll use two associative arrays for storing results.
# $groups holds the initial prefixes we're going to group;
# they're stored in the keys of the hash for easy access. We don't
# really need the value so we'll just stick 1 there so it has a
# non-zero length for testing. $foundgroups has the same keys but we'll
# only stick something there if there are at least two matching names,
# i.e. it's a real group. We could do this other ways e.g. by
# sticking a count in $groups, but this way it was dead easy
# to get all the groups out later without looping over the array again.
typeset -A groups foundgroups
# We're going to use spaces in the file name to divide it into words
# so we match whole words only. maxwords counts the max number of
# words in a file name --- in the example that's 5 in the case of
# "One Two Buckle My Shoe". (I'm being very confusing here because
# I also use "word" to describe a complete element of a the array
# "words", but that's what you get with buggy software from the net.)
integer maxwords
local word initial pat make
# First we'll count the space-delimited words in each input word.
for word in $words; do
# We're not interested in anything from the first "." on.
# Remove the longest trailing string beginning with a ".".
initial=${word%%.*}
# Split to space-separate words.
split=(${=initial})
# ${#split} is the number of such words in the, er, word.
if (( ${#split} > maxwords )); then
# So this is the largest number of space-separated words we've found.
maxwords=${#split}
fi
done
# Helper function to take an input word and split it into "maxword"
# space-separated words.
words_getinitial() {
# Complete name passed in as argument.
local word=$1
# As before, remove anything that looks like a suffix.
initial=${word%%.*}
if (( maxwords > 1 )); then
# This is a rather verbose way of matching space-separated words.
# [[:blank:]] and [^[:blank:]] are any character that are or not a
# space or tab, respectively. Appending ## says we want any number
# of such characters that's at least one. The (#b) says parentheses
# are live, so I could refer to them later, but I'm not using that
# feature so I should take it out :-). (#c<num)) says match the
# previous expression (what's in parentheses) <num> times. We match
# it maxwords-1 times because we're repeating the blank followed by
# space; together with the first word we get maxwords words.
pat="(#b)(([^[:blank:]]##[[:blank:]]##)(#c$((maxwords-1)))([^[:blank:]]##))"
else
# Simple case where we're just matching one word.
pat="(#b)([^[:blank:]]##)"
fi
# ${word##${~pat}} says "remove the longest string matching $pat from
# $word". The ~ is so the pattern characters in $pat are live, rather
# than just ordinary string characters. By putting in the (M) we
# get the matched string, rather than the original string with the
# matched bit removed.
initial=${(M)word##${~pat}}
# So at this point, $initial contains the initial $maxwords space-separated
# words from the string passed in, ignoring suffixes.
}
# functions -T words_getinitial
# We're going to start by looking for the longest possible matches,
# then gradually decrease maxwords until it reaches 0 or we run out
# of words.
while (( maxwords && ${#words} )); do
restwords=()
groups=()
foundgroups=()
# For all test words (file names)
for word in $words; do
words_getinitial $word
# $initial is the first $maxwords words from $word
[[ -z $initial ]] && continue
# Got something...
if [[ -n $groups[$initial] ]]; then
# ... for the second time, so record there's something to group.
foundgroups[$initial]=1
else
# ... for the first time, remember it.
groups[$initial]=1
fi
done
if (( ${#foundgroups} )); then
# Found some groups. The group names to use are the keys of
# $foundgroupds. Say what these are just for info.
# The expression is just a smarmy way of joining the keys together
# with a comma and a space.
print "Extracting common prefixes '${(kj.', '.)foundgroups}'..."
# Now see which words fit any of these groups.
for word in $words; do
words_getinitial $word
# As before, $initial contains the initial $maxwords words from $word.
if [[ -z $initial ]]; then
# Nothing here, so stick this one in the remainder to handle later.
restwords+=($word)
elif [[ -n $foundgroups[$initial] ]]; then
# Yes, this is one of the words to group.
# In real life we'd stick it in the directory, making
# sure the directory existed. For now just report.
print "'$word' goes in directory '$initial'"
else
# No, so stick this in the remainder list.
restwords+=($word)
fi
# Next time, we only need to consider the words we didn't
# handle this time, so assign these back to the original array.
words=($restwords)
done
fi
# Decrease the number of space-separated words to look for next time
(( maxwords-- ))
done
# If there are some words we didn't group, tell the user what these are.
if (( ${#words} )); then
print "Unmatched files:"
print "'${(pj.'\n'.)words}'"
fi
##end
--
Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/
Messages sorted by:
Reverse Date,
Date,
Thread,
Author