Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Suggestion: Option to ignore unmatched quotes when (Q) parameter-expansion flag



I have been in brief discussions about this on IRC, and was asked to inquire of the mailing list:


*** First I will describe my application, so that my suggestion for improvement of the shell doesn't seem contrived, and because I think this syntax situation is one that other shell users might benefit from learning about.

In my application, the user will be presented a list of strings that might contain any printable character that the user can reasonably generate from the keyboard.  These strings might *not* be any sort of valid zsh syntax.

I want the user to be able to type a shell pattern (as opposed to a regex), composed mentally by the user, that matches any group of those strings.

I prefer that the user be able to use not only backslash quoting, but also other forms of quoting (double quotes, single quotes, dollar single-quoting) to disable the pattern-matching meaning of characters the user may type, such as [].

Suppose the user wants to enter *[abc]* in which both of the '*' are pattern-matching characters; but the user will give double-quotes to indicate that the [] should be matched literally.  If variable ${u_input} contains the user's input, we start with something equivalent to:

u_input='*"[abc]"*'
a_string='one[abc]two'
if [[ "${a_string}" == ${u_input} ]] then
  print "it matches"
fi

The first problem was that the above [[ ]] pattern-matching statement didn't match, but instead would match something such as:
a_string='one"a"two'
... since the [abc] is taken as a character class that matches 1 character, rather than taken as the 5-character literal string that the user wants; and the "" are taken literally rather than as being quote characters.

If the double-quotes are put into the pattern comparison literally:

a_string='one[abc]two'                    # Same string as above
if [[ "${a_string}" == *"[abc]"* ]] then  # Same pattern as in ${u_input} above!
  print "it matches"
fi

... Now it matches, which is what we wish would happen when a variable expansion ${...} is on the right-hand side of the == operator.

Even with shell option GLOB_SUBST enabled, the only quoting honoured when substituting the contents of a variable into a shell pattern is '\' backslash.

The solution found so far is to use parameter-expansion flags, first (b), then (Q), as follows:

u_input='*"[abc]"*'  # Same shell pattern as above
u_input="${(Q)${(b)u_input}}"

# u_input is now *\[abc\]*
# ... the [] are quoted with \, the " are removed, the * are unquoted

a_string='one[abc]two'
if [[ "${a_string}" == ${u_input} ]] then
  print "it matches"  # Now it matches!
fi

How it works:

The (b) parameter-expansion flag uses backslashes to quote all the pattern-matching characters including those that were already quoted by the user.

The definition of (Q) is that (Q) removes the outermost level of quotes; backslash is always treated as being inside "" or ''.

Chars that were backslashed by (b) and NOT quoted by the user get unquoted by (Q);

... chars that WERE quoted by the user have the "" or '' removed by (Q) but the backslashes [that were added by (b)] left in place.

This technique works if the user gave '\' as well as '' "".


*** The current problem is the shell's handling of unbalanced quotation marks:

a_string='one[abc]"two'  # unbalanced doublequote

u_input='*[abc]"*'
# the user tries to match that doublequote and the [], all 3 literally

u_input="${(Q)${(b)u_input}}"

# u_input is now \*\[abc\]"\*
# the (Q) was skipped

If we add the (X) flag to make (QX) so the shell will print an error message, we find that the shell detects the unbalanced (").

The zshexpn documentation for X says "Without the [X] flag, errors are silently ignored."  It seems that, without (X), the unbalanced (") isn't 'ignored,' but rather causes the (Q) flag to fail entirely.


*** I looked at the source code a little bit, and I have the following suggestion, whose details are NOT terribly firm:

[I wrote these in the order in which the functions are called in C.  The outline of an idea for ignoring unbalanced quote marks is listed under gettokstr().]

-- subst.c, paramsubst(), starting ~line 1619:
~line 2028 decrements variable 'quotemod' if (Q) flag
So we could state in the documentation that issuing (Q) twice, in other words (QQ), will activate the suggested error-handling mode.

~line 3807 calls parse_subst_string() for arrays
~line 3847 calls parse_subst_string() for scalars
Those 2 blocks of code seem like good places to check for variable 'quotemod' < -1

-- lex.c, parse_subst_string(), starting ~line 1728:
  Add a parameter to this function:  ignore_unbalanced_quotes
Back in paramsubst() in subst.c, if quotemod < -1, then pass 2 to parse_subst_string() as ignore_unbalanced_quotes
  (All other calls to parse_subst_string() will pass 0 for ignore_unbalanced_quotes)
In parse_subst_string() in lex.c again:
~line 1744 currently calls gettokstr(, sub=1)
  If ignore_unbalanced_quotes != 0 then call gettokstr(, sub=ignore_unbalanced_quotes)

-- lex.c, gettokstr(), starting ~line 937:
~line 1314, case LX2_QUOTE, detects unbalanced (')
  if sub == 2 then:
    change the Snull that was added ~line 1284 to be a (')
~line 1335, case LX2_DQUOTE, detects unbalanced (")
  if sub == 2 then:
    instead of adding a Dnull (~line 1333), add a (")
~line 1377, case LX2_BQUOTE, detects unbalanced (`)
  if sub == 2 then:
    instead of adding a Tick (~line 1347), add a (`)
Using literal (')(") instead of Snull,Dnull will keep Snull,Dnull from being removed by remnulargs() or altered by untokenize()
Using literal (`) instead of Tick will keep Tick from being altered by untokenize() [Tick wouldn't be removed by remnulargs() ]

I've not found how variable 'quoteerr' [which means the (X) flag, which determines how errors such as unbalanced quotes are handled] in paramsubst() in subst.c is propagated through parse_subst_string() in lex.c to gettokstr() in lex.c - perhaps by means of the 'lexflags' variable.  The error message including the word 'unmatched' is sent to zerr() in gettokstr(), thus being evidence that 'quoteerr' must currently be propagated to gettokstr() somehow.  It might be suitable to use the same means of propagating a (QQ) flag, since flags are parsed in paramsubst() but the different error-handling that we want is in gettokstr()

I suggest that QQX should act like QX:  If quoteerr == 1, then the existing code in gettokstr() that calls zerr() would still run, even if quotemod == -2.  This is so a script author can simply change (QQ) to (QQX) to get diagnostic output.

Manpage zshexpn, section Parameter Expansion Flags:
  change last sentence of description of X to:
  Without the flag, errors silently cause the current processing step to be skipped.

Add to the description of Q flag:
  Handling of unbalanced quotes depends on whether the X flag is present.
  Also, if the Q flag is given twice and the X flag is *not* given, then unbalanced quotation marks are silently ignored; other forms of quoting are still removed.  (For example, if a string contains an unbalanced double-quote but the outermost level of quoting within the string includes balanced single-quotes, then the single-quotes will be removed.)

I don't think I'm up to making these changes myself, but I'd appreciate feedback that might result in a solution.

Many thanks for your time.

-dg1727




Messages sorted by: Reverse Date, Date, Thread, Author