Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
should we use PCRE2_MATCH_INVALID_UTF ?
- X-seq: zsh-workers 54708
- From: Stephane Chazelas <stephane@xxxxxxxxxxxx>
- To: Zsh hackers list <zsh-workers@xxxxxxx>
- Subject: should we use PCRE2_MATCH_INVALID_UTF ?
- Date: Mon, 8 Jun 2026 19:33:07 +0100
- Archived-at: <https://zsh.org/workers/54708>
- List-id: <zsh-workers.zsh.org>
- Mail-followup-to: Zsh hackers list <zsh-workers@xxxxxxx>
Zsh can generally cope with sequences of bytes that can't be
decoded into characters, treating each byte that can't be
decoded into character as if it was a character in most places.
[[ $'\x80' = ? ]] returns true in UTF-8 locales as bytes that
cannot be decoded into characters are still considered as
a character internally. With var=$'é\x80', $# is still 2, $var[2]
that $'\x80' byte.
For =~, that's down to the system's regexp library without
rematchpcre and PCRE2 (formerly PCRE) with rematchpcre.
On a GNU system (here Debian forky)
[[ $'x\x80' =~ '^.*$' ]]
returns false as "." won't match that $'\x80' which is not a
character.
But [[ $'x\x80' =~ '^.' ]] does return true as x is matched even
if what follows is not valid text in the user's locale.
With rematchpcre enabled:
$ [[ $'x\x80' =~ '^.*$' ]]
zsh: pcre_exec() error [-22]
$ [[ $'x\x80' =~ '^.' ]]
zsh: pcre_exec() error [-22]
Neither match and in both cases a (not very helpful) error is
output.
PCRE2 have a:
> PCRE2_MATCH_INVALID_UTF Enable support for matching invalid UTF
flag.
That would not make "." match that $'\x80' byte but would align
the behaviour with that of GNU's ERE's at least.
Would it be worth adding? Patch below.
diff --git a/Src/Modules/pcre.c b/Src/Modules/pcre.c
index 98b52f8de..409ee90c2 100644
--- a/Src/Modules/pcre.c
+++ b/Src/Modules/pcre.c
@@ -82,7 +82,7 @@ bin_pcre_compile(char *nam, char **args, Options ops, UNUSED(int func))
if (OPT_ISSET(ops, 's')) pcre_opts |= PCRE2_DOTALL;
if (zpcre_utf8_enabled())
- pcre_opts |= PCRE2_UTF;
+ pcre_opts |= PCRE2_UTF|PCRE2_MATCH_INVALID_UTF;
if (pcre_pattern)
pcre2_code_free(pcre_pattern);
@@ -431,7 +431,7 @@ cond_pcre_match(char **a, int id)
int return_value = 0;
if (zpcre_utf8_enabled())
- pcre_opts |= PCRE2_UTF;
+ pcre_opts |= PCRE2_UTF|PCRE2_MATCH_INVALID_UTF;
if (isset(REMATCHPCRE) && !isset(CASEMATCH))
pcre_opts |= PCRE2_CASELESS;
Messages sorted by:
Reverse Date,
Date,
Thread,
Author