Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
UTF-8 and PCRE and metafy
- X-seq: zsh-workers 28870
- From: Phil Pennock <zsh-workers+phil.pennock@xxxxxxxxxxxx>
- To: zsh-workers@xxxxxxx
- Subject: UTF-8 and PCRE and metafy
- Date: Tue, 8 Mar 2011 01:52:16 -0500
- Dkim-signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=spodhuis.org; s=d200912; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:Subject:To:From:Date; bh=HHmqc78NjqJvnYX/szGQajVfl0RFThMwNAwhfV9BwtQ=; b=rguZAdKlfwwcPH1v+LgmN6dJgT599d8dB5hYxNtXdVje6NJGtrb8u3Cb8jFD1lXL4515aP4suRgIZd2e7S3dIW4FoAYg0mE2MEWRvY4M9dDACrQrO/LY2pwPicNXpeF6irm6N/KZws02Fzvl/4OfcQtUINRaoPGuI9ssm+Vde2Q=;
- List-help: <mailto:zsh-workers-help@zsh.org>
- List-id: Zsh Workers List <zsh-workers.zsh.org>
- List-post: <mailto:zsh-workers@zsh.org>
- Mail-followup-to: zsh-workers@xxxxxxx
- Mailing-list: contact zsh-workers-help@xxxxxxx; run by ezmlm
4.3.11 with rematch_pcre:
% [[ 'foo→bar' =~ ^f.* ]]
zsh: pcre_exec() error: -10
Same with -pcre-match
% locale charmap
UTF-8
Error -10 is PCRE_ERROR_BADUTF8.
In the pcre.c module, we explicitly enable PCRE_UTF8 if UTF8 is in
effect and supported.
By the:
zwarn("pcre_exec() error: %d", r);
I shoved in a couple more zwarn()s to confirm that the string is in
non-meta form:
zwarn("pcre_exec() error: %d", r);
zwarn("lhstr: %s", lhstr);
zwarn("rhre: /%s/", rhre);
→
zsh: pcre_exec() error: -10
zsh: lhstr: foo→bar
zsh: rhre: /^f.*/
pcretest(1):
% pcretest
PCRE version 8.12 2011-01-15
re> /^f.*/
data> foo→bar
0: foo\xe2\x86\x92bar
Okay, so as long as the char is making it through intact as UTF-8 then
PCRE should be handling it.
Debug each char in lhstr as an int, find it's *not* in non-meta form --
why does it print just fine, then? :(
% [[ 'foo→bar' =~ ^f.* ]]
zsh: pcre_exec() error: -10
zsh: lhstr: foo→bar
zsh: lhstr/%l: foo→bar
zsh: rhre: /^f.*/
zsh: utf-8 enabled? 1
zsh: lhstr char* item: 102
zsh: lhstr char* item: 111
zsh: lhstr char* item: 111
zsh: lhstr char* item: -30
zsh: lhstr char* item: -125
zsh: lhstr char* item: -90
zsh: lhstr char* item: -125
zsh: lhstr char* item: -78
zsh: lhstr char* item: 98
zsh: lhstr char* item: 97
zsh: lhstr char* item: 114
So after line 336 of pcre.c I add:
unmetafy(lhstr, NULL);
Test:
% unset preexec_functions ; unfunction precmd
% [[ 'foo→bar' =~ ^f.* ]] ; print -l $? $MATCH foo $match
pattern.c:1403: BUG: - missing from numeric glob
0
foo?^<bar
foo
zefram
I'm guessing I need a bunch of calls to metafy() to process the results
of extraction in zpcre_get_substrings() ? Where does the string
"zefram" come from? I mean, Andrew is capable and all, but springing
into existence like that was surprising.
Is there guidance on correct API usage here for calling metafy() and
having lengths all match up?
-Phil
Messages sorted by:
Reverse Date,
Date,
Thread,
Author