Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Compare two (or more) filenames and return what is common between them



On Tue, 18 Mar 2014 20:23:09 +0000
Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx> wrote:
> The upshot is that for the input
> 
>   "One Two Nineteen"
>   "One Two Three"
>   "One Two Buckle My Shoe"
>   "One Two Buckle My Belt"
>   "One Three Four"
>   "Two Three Sixteen"
>   "Two Three Seventeen"
>   "Three Forty Five"
> 
> it prints
> 
>   Extracting common prefixes 'One Two Buckle My'...
>   'One Two Buckle My Shoe' goes in directory 'One Two Buckle My'
>   'One Two Buckle My Belt' goes in directory 'One Two Buckle My'
>   Extracting common prefixes 'One Two', 'Two Three'...
>   'One Two Nineteen' goes in directory 'One Two'
>   'One Two Three' goes in directory 'One Two'
>   'Two Three Sixteen' goes in directory 'Two Three'
>   'Two Three Seventeen' goes in directory 'Two Three'
>   Unmatched files:
>   'One Three Four'
>   'Three Forty Five'

Let my try to make this more adaptable by annotating it.

##start
# Sanitise the options in use for this function.
emulate -L zsh
# We'll need extendglob for matching.
setopt extendedglob

local -a words match mbegin mend split restwords

# Here's what we're going to apply the algorithm to.
# You'd probably get these from a "*" or something similar.
words=(
	"One Two Nineteen"
	"One Two Three"
	"One Two Buckle My Shoe"
	"One Two Buckle My Belt"
	"One Three Four"
	"Two Three Sixteen"
	"Two Three Seventeen"
	"Three Forty Five"
)

# We'll use two associative arrays for storing results.
# $groups holds the initial prefixes we're going to group;
# they're stored in the keys of the hash for easy access.  We don't
# really need the value so we'll just stick 1 there so it has a
# non-zero length for testing.  $foundgroups has the same keys but we'll
# only stick something there if there are at least two matching names,
# i.e. it's a real group.  We could do this other ways e.g. by
# sticking a count in $groups, but this way it was dead easy
# to get all the groups out later without looping over the array again.
typeset -A groups foundgroups
# We're going to use spaces in the file name to divide it into words
# so we match whole words only.  maxwords counts the max number of
# words in a file name --- in the example that's 5 in the case of
# "One Two Buckle My Shoe".  (I'm being very confusing here because
# I also use "word" to describe a complete element of a the array
# "words", but that's what you get with buggy software from the net.)
integer maxwords
local word initial pat make

# First we'll count the space-delimited words in each input word.
for word in $words; do
  # We're not interested in anything from the first "." on.
  # Remove the longest trailing string beginning with a ".".
  initial=${word%%.*}
  # Split to space-separate words.
  split=(${=initial})
  # ${#split} is the number of such words in the, er, word.
  if (( ${#split} > maxwords )); then
    # So this is the largest number of space-separated words we've found.
    maxwords=${#split}
  fi
done

# Helper function to take an input word and split it into "maxword"
# space-separated words.
words_getinitial() {
  # Complete name passed in as argument.
  local word=$1
  # As before, remove anything that looks like a suffix.
  initial=${word%%.*}
  if (( maxwords > 1 )); then
    # This is a rather verbose way of matching space-separated words.
    # [[:blank:]] and [^[:blank:]] are any character that are or not a
    # space or tab, respectively.  Appending ## says we want any number
    # of such characters that's at least one.  The (#b) says parentheses
    # are live, so I could refer to them later, but I'm not using that
    # feature so I should take it out :-).  (#c<num)) says match the
    # previous expression (what's in parentheses) <num> times.  We match
    # it maxwords-1 times because we're repeating the blank followed by
    # space; together with the first word we get maxwords words.
    pat="(#b)(([^[:blank:]]##[[:blank:]]##)(#c$((maxwords-1)))([^[:blank:]]##))"
  else
    # Simple case where we're just matching one word.
    pat="(#b)([^[:blank:]]##)"
  fi
  # ${word##${~pat}} says "remove the longest string matching $pat from
  # $word".  The ~ is so the pattern characters in $pat are live, rather
  # than just ordinary string characters.  By putting in the (M) we
  # get the matched string, rather than the original string with the
  # matched bit removed.
  initial=${(M)word##${~pat}}
  # So at this point, $initial contains the initial $maxwords space-separated
  # words from the string passed in, ignoring suffixes.
}
# functions -T words_getinitial

# We're going to start by looking for the longest possible matches,
# then gradually decrease maxwords until it reaches 0 or we run out
# of words.
while (( maxwords && ${#words} )); do
  restwords=()
  groups=()
  foundgroups=()
  # For all test words (file names)
  for word in $words; do
    words_getinitial $word
    # $initial is the first $maxwords words from $word
    [[ -z $initial ]] && continue
    # Got something...
    if [[ -n $groups[$initial] ]]; then
      # ... for the second time, so record there's something to group.
      foundgroups[$initial]=1
    else
      # ... for the first time, remember it.
      groups[$initial]=1
    fi
  done
  if (( ${#foundgroups} )); then
    # Found some groups.  The group names to use are the keys of
    # $foundgroupds.  Say what these are just for info.
    # The expression is just a smarmy way of joining the keys together
    # with a comma and a space.
    print "Extracting common prefixes '${(kj.', '.)foundgroups}'..."
    # Now see which words fit any of these groups.
    for word in $words; do
      words_getinitial $word
      # As before, $initial contains the initial $maxwords words from $word.
      if [[ -z $initial ]]; then
        # Nothing here, so stick this one in the remainder to handle later.
	restwords+=($word)
      elif [[ -n $foundgroups[$initial] ]]; then
        # Yes, this is one of the words to group.
	# In real life we'd stick it in the directory, making
	# sure the directory existed.  For now just report.
	print "'$word' goes in directory '$initial'"
      else
        # No, so stick this in the remainder list.
	restwords+=($word)
      fi
      # Next time, we only need to consider the words we didn't
      # handle this time, so assign these back to the original array.
      words=($restwords)
    done
  fi
  # Decrease the number of space-separated words to look for next time
  (( maxwords-- ))
done

# If there are some words we didn't group, tell the user what these are.
if (( ${#words} )); then
  print "Unmatched files:"
  print "'${(pj.'\n'.)words}'"
fi
##end


-- 
Peter Stephenson <p.w.stephenson@xxxxxxxxxxxx>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/



Messages sorted by: Reverse Date, Date, Thread, Author