Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: =~ doesn't work with NUL characters
- X-seq: zsh-workers 41307
- From: Stephane Chazelas <stephane.chazelas@xxxxxxxxx>
- To: Phil Pennock <zsh-workers+phil.pennock@xxxxxxxxxxxx>
- Subject: Re: =~ doesn't work with NUL characters
- Date: Thu, 15 Jun 2017 10:50:34 +0100
- Cc: Zsh hackers list <zsh-workers@xxxxxxx>
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:mail-followup-to:references :mime-version:content-disposition:in-reply-to:user-agent; bh=3zKXe6sVs6UCxEOKXOAf+1cHolNCm6xGcMXWXSVUC54=; b=l/PUA+VdYujQ9PC/53DEHLtLunZnKN0A7o5WxcJLmtMgfKW+eMMc22MSQECCMRDobV Ey5Vwo6guUYe5XzSTCwQAfYgKL4q47KwClIjUN56JHwItyKVcJ3f7M0iLW8YvakuGxGu BRSciJ9a3HT2pEAFrtX1tDyXscKZrs4oeRMwcEa9Pccn+wmo+fk6ooqT74k46ucWFlPl Va9mpBpCtajmFn/3BLKcmdqc9qy3J1sCCJA9dL+gtKB8uNhkG/kQHxqZpfo2RoP7p6a/ kX3XlyD4hS68FBI/1tmHrCoyyIoqImwNlo3xMMqH1Coj9CjzKRWR2HqsghSy8OdJ0wE2 ux2A==
- In-reply-to: <20170614204938.GA76510@tower.spodhuis.org>
- List-help: <mailto:zsh-workers-help@zsh.org>
- List-id: Zsh Workers List <zsh-workers.zsh.org>
- List-post: <mailto:zsh-workers@zsh.org>
- Mail-followup-to: Phil Pennock <zsh-workers+phil.pennock@xxxxxxxxxxxx>, Zsh hackers list <zsh-workers@xxxxxxx>
- Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
- References: <20170613100217.GA9529@chaz.gmail.com> <20170614204938.GA76510@tower.spodhuis.org>
2017-06-14 16:49:38 -0400, Phil Pennock:
[...]
> Without rematchpcre, this is ERE per POSIX APIs, which don't portably
> support size-supplied strings, relying instead upon C-string
> null-termination.
>
> Current macOS has regnexec() but this is not in the system regexp
> library I see on Ubuntu Trusty or FreeBSD 10.3. It appears to be an
> extension from when they switched to the TRE implementation in macOS
> 10.8. <https://laurikari.net/tre/>
>
> Trying to support this would result in variations in behaviour across
> systems in a way which I think might be undesirable. The whole point of
> adding the non-PCRE implementation was to match Bash behaviour by
> default, and Bash does the same thing.
[...]
A dirty trick in UTF-8 locales (the norm these days) may be to
encode NUL as U+7FFFFF00 (and bytes 0x80 -> 0xff that don't
form part of valid characters as U_7FFFFF{80..FF}) (in both the
string and regexp).
That wouldn't work with every regexp implementation though as
some would treat those as invalid characters if they go by
the newer definition where valid characters are only
0000->D7FF, E000->10FFFF.
But with those that do, that would also make the behaviour more
consistent in cases like:
[[ $'\x80' = ? ]] vs [[ $'\x80' =~ '^.$' ]]
That wouldn't help in things like [[ x =~ $'[\0-\177]' ]] (which
anyway doesn't make sense in locales other than C/POSIX) though.
--
Stephane
Messages sorted by:
Reverse Date,
Date,
Thread,
Author