Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
MBEGIN when =~ finds bytes inside characters (Was: [PATCH v5] regexp-replace and ^, word boundary or look-behind operators (and more).)
- X-seq: zsh-workers 52719
- From: Stephane Chazelas <stephane@xxxxxxxxxxxx>
- To: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>, Zsh hackers list <zsh-workers@xxxxxxx>
- Subject: MBEGIN when =~ finds bytes inside characters (Was: [PATCH v5] regexp-replace and ^, word boundary or look-behind operators (and more).)
- Date: Sat, 9 Mar 2024 09:21:11 +0000
- Archived-at: <https://zsh.org/workers/52719>
- In-reply-to: <20240309084158.jiyx2is3tbrwyzia@chazelas.org>
- List-id: <zsh-workers.zsh.org>
- Mail-followup-to: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>, Zsh hackers list <zsh-workers@xxxxxxx>
- References: <20191216212706.i3xvf6hn5h3jwkjh@chaz.gmail.com> <20191217073846.4usg2hnsk66bhqvl@chaz.gmail.com> <20191217111113.z242f4g6sx7xdwru@chaz.gmail.com> <2ea6feb3-a686-4d83-ab27-6a582424487c@www.fastmail.com> <20200101140343.qwfx2xaojumuds3d@chaz.gmail.com> <20210430061117.buyhdhky5crqjrf2@chazelas.org> <CAH+w=7bHxSbFr60ZU0+oZ6+qEejhfBYTzvL7=aXadY5XzWtSzw@mail.gmail.com> <20210505114521.bemoiekpophssbug@chazelas.org> <20240308153050.u63fqtcjyr2yewye@chazelas.org> <20240309084158.jiyx2is3tbrwyzia@chazelas.org>
2024-03-09 08:41:58 +0000, Stephane Chazelas:
[...]
> + while [[ $subject =~ $regexp ]]; do
> + # append initial part and substituted match
> + result+=$subject[1,MBEGIN-1]${(Xe)replacement}
[...]
BTW, likely not zsh's fault but here on Ubuntu 22.04
With:
$ a=$'ABC/\U0010fffe/DEF'
$ print -r - ${(q)a}
ABC/$'\364\217\277\276'/DEF
So with a string containing a 4-byte multibyte character.
$ regexp-replace a $'\276' $'\277'
$ print -r - ${(q)a}
ABC/$'\364\217\277\276'/D$'\277'F
See $'\277' not replacing $'\276' but E instead.
It's my bad as a user to be doing that with multibyte enabled in
a locale with a multibyte charset.
$ a=$'ABC/\U0010fffe/DEF'
$ set +o multibyte
$ regexp-replace a $'\276' $'\277'
$ print -r - ${(q+)a}
$'ABC/\U0010ffff/DEF'
$ set -o multibyte
$ print -r - ${(q)a}
ABC/$'\364\217\277\277'/DEF
Is OK
The problem here is:
$ [[ $a =~ $'\276' ]]
$ echo $MBEGIN $MEND
8 8
$ [[ $a =~ D ]]
$ echo $MBEGIN $MEND
7 7
And could very well be caused by a bug in my regex library,
maybe a variation of
https://sourceware.org/bugzilla/show_bug.cgi?id=31075 for regex.
If the problem is in the system's regexps, I can't think of
anything zsh could do about it except maybe checking that
subject and regexp decode as text properly, and error out if not
like it does in pcre mode.
zsh pattern matching seems to be handling it better.
$ [[ $a = (#b)*($'\276')* ]] && echo match; typeset mbegin mend
mbegin=( -1 -1 )
mend=( -1 -1 )
$ [[ $a = (#b)*(D)* ]] && echo match; typeset mbegin mend
match
mbegin=( 7 )
mend=( 7 )
I wonder if PCRE2_MATCH_INVALID_UTF/PCRE2_NO_UTF_CHECK could be
used to improve matching with invalid UTF-8 for the pcre mode,
at least for the pcre builtins where offsets are byte-wide
rather than character-wise.
--
Stephane
Messages sorted by:
Reverse Date,
Date,
Thread,
Author