Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

should we use PCRE2_MATCH_INVALID_UTF ?



Zsh can generally cope with sequences of bytes that can't be
decoded into characters, treating each byte that can't be
decoded into character as if it was a character in most places.

[[ $'\x80' = ? ]] returns true in UTF-8 locales as bytes that
cannot be decoded into characters are still considered as
a character internally. With var=$'é\x80', $# is still 2, $var[2]
that $'\x80' byte.

For =~, that's down to the system's regexp library without
rematchpcre and PCRE2 (formerly PCRE) with rematchpcre.

On a GNU system (here Debian forky)

  [[ $'x\x80' =~ '^.*$' ]]

returns false as "." won't match that $'\x80' which is not a
character.

But [[ $'x\x80' =~ '^.' ]] does return true as x is matched even
if what follows is not valid text in the user's locale.

With rematchpcre enabled:

$ [[ $'x\x80' =~ '^.*$' ]]
zsh: pcre_exec() error [-22]
$ [[ $'x\x80' =~ '^.' ]]
zsh: pcre_exec() error [-22]

Neither match and in both cases a (not very helpful) error is
output.

PCRE2 have a:

> PCRE2_MATCH_INVALID_UTF  Enable support for matching invalid UTF

flag.

That would not make "." match that $'\x80' byte but would align
the behaviour with that of GNU's ERE's at least.

Would it be worth adding? Patch below.

diff --git a/Src/Modules/pcre.c b/Src/Modules/pcre.c
index 98b52f8de..409ee90c2 100644
--- a/Src/Modules/pcre.c
+++ b/Src/Modules/pcre.c
@@ -82,7 +82,7 @@ bin_pcre_compile(char *nam, char **args, Options ops, UNUSED(int func))
     if (OPT_ISSET(ops, 's')) pcre_opts |= PCRE2_DOTALL;
     
     if (zpcre_utf8_enabled())
-	pcre_opts |= PCRE2_UTF;
+	pcre_opts |= PCRE2_UTF|PCRE2_MATCH_INVALID_UTF;
 
     if (pcre_pattern)
 	pcre2_code_free(pcre_pattern);
@@ -431,7 +431,7 @@ cond_pcre_match(char **a, int id)
     int return_value = 0;
 
     if (zpcre_utf8_enabled())
-	pcre_opts |= PCRE2_UTF;
+	pcre_opts |= PCRE2_UTF|PCRE2_MATCH_INVALID_UTF;
     if (isset(REMATCHPCRE) && !isset(CASEMATCH))
 	pcre_opts |= PCRE2_CASELESS;
 




Messages sorted by: Reverse Date, Date, Thread, Author