Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre
- X-seq: zsh-workers 42045
- From: Stephane Chazelas <stephane.chazelas@xxxxxxxxx>
- To: Zsh hackers list <zsh-workers@xxxxxxx>
- Subject: Re: please consider using PCRE_DOLLAR_ENDONLY (and PCRE_DOTALL) for rematchpcre
- Date: Wed, 22 Nov 2017 21:40:25 +0000
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:mail-followup-to:references :mime-version:content-disposition:in-reply-to:user-agent; bh=Fd6mCQp9tLuV7BrG53DSDLo8YPt7W4/KNNCd/e6+QMQ=; b=VxX+NehbKa9kFkJbhPzqyzyL+ealdOv3/vQX6hr8oetB+pRMkFKRoZ+8u+7GgJoaeG 5vhm0lT7JhSlylG1faR8sufppYY+NXGCd41CoDPAhSjaNCtDaVYecXFf3YYW3Twpgjrj Q/417db1NXYjbybHCvhi/sw0dc2gZUj3vBsM4g2y6wKv3AR83u9QgOJywoCB2d8d6/K8 U0Q7XPt/FHOF1GW8gThSYTbqgAlnK/ZI52vllY0zCwdOaeFfX+XqV45xzG3NlY07Bk9j aCcAZ1Ii87Jw7BL5fPvPlylASVbG8GI0UJGs3SFXQO0FDeLAKAU0tlG9BkYvqmONElO9 mcBg==
- In-reply-to: <20171122122519.GA13771@chaz.gmail.com>
- List-help: <mailto:zsh-workers-help@zsh.org>
- List-id: Zsh Workers List <zsh-workers.zsh.org>
- List-post: <mailto:zsh-workers@zsh.org>
- List-unsubscribe: <mailto:zsh-workers-unsubscribe@zsh.org>
- Mail-followup-to: Zsh hackers list <zsh-workers@xxxxxxx>
- Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
- References: <20171122122519.GA13771@chaz.gmail.com>
2017-11-22 12:25:19 +0000, Stephane Chazelas:
[...]
> It can be worked around ([[ $a =~ 'a\z' ]], [[ $a =~ '(?s).'
> ]]), but IMO at least PCRE_DOLLAR_ENDONLY (if not PCRE_DOTALL)
> should be the default at least for [[ $string =~ ... ]] as
> in shells, $string usually do not include the newline delimiter.
[...]
The situation in other tools languages:
ksh93:
$ ksh93 -c "[[ $'a\n' = ~(P:a$) ]] || echo no; [[ $'\n' = ~(P:.) ]] && echo yes"
no
yes
(both PCRE_DOLLAR_ENDONLY and PCRE_DOTALL (or equivalent as
ksh93 comes with its own pcre-like implementation))
$ php -r 'echo preg_match("/a$/", "a\n") . "\n" . preg_match("/./", "\n") . "\n";'
1
0
neither PCRE_DOLLAR_ENDONLY nor PCRE_DOTALL. Clearly documented
and has a "D" flag to enable PCRE_DOLLAR_ENDONLY
https://secure.php.net/manual/en/reference.pcre.pattern.modifiers.php
$ php -r 'echo preg_match("/a$/D", "a\n") . "\n";'
0
ssed:
printf 'a\n\n' | ssed -Rn 'N;/a$/=;/a./!='
neither PCRE_DOLLAR_ENDONLY nor PCRE_DOTALL
GNU grep:
$ printf 'a\n\0' | ltrace -e 'pcre_compile' grep -zP 'a$'
grep->pcre_compile("a$", 2080, 0x7ffcaf25aff8, 0x7ffcaf25aff4, 0x1e89280)
PCRE_DOLLAR_ENDONLY (32) but not PCRE_DOTALL
python (not PCRE)
neither PCRE_DOLLAR_ENDONLY nor PCRE_DOTALL. Documented:
https://docs.python.org/3/library/re.html
\Z means the opposite from perl/PCREs! (matches at the end only)
fish (string match -r pcre strings...)
neither PCRE_DOLLAR_ENDONLY nor PCRE_DOTALL
So I'd understand if you leave it as it is as many other tools
do not use PCRE_DOLLAR_ENDONLY.
I still find the idea of $ not matching only at the end of the
subject dangerous, as most people assume it does (like it does
in BRE and ERE). If not changed, it would be worth clearly
documenting (if only to flag the difference with ERE and warn of
potential implications). See how the documentation current has
this misleading example:
[[ "$text" -pcre-match ^d+$ ]] &&
print text variable contains only "d's".
Should be:
print text variable contains only "d's" optionally followed by a newline character
or:.
[[ "$text" -pcre-match '^d+\z' ]]
It affects perl and co already. Like, many people do:
rename 's/\.back$//i' ./*
When they meant:
rename 's/\.back\z//i' ./*
Same for PCRE_DOTALL
rename 's/-.*//' ./*-*
when they meant
rename 's/(?s)-.*//' ./*-*
for instance.
--
Stephane
Messages sorted by:
Reverse Date,
Date,
Thread,
Author