The below is believed to be solid, guarding hard against the previous problems and should be good to go. This is also at: <https://github.com/philpennock/zsh-code/tree/re2_cmds> (and the `re2_raw_wip` branch is the unsquashed progress towards this single commit). ----------------------------8< git patch >8----------------------------- Add support for Google's BSD-licensed RE2 library, via the cre C-language bindings (also BSD-licensed). Guard with --enable-re2 for now. Adds 4 infix conditions and two commands, but not yet any support for changing how =~ binds. Includes docs and many many tests. Includes guard against a cre2 library unusable in practice, which bit with hard errors (C++ linkage problems with different compilers). The commands are re2_compile and re2_match; they're feature-complete in what they have, but there's a TODO to add named-regexps instead of just one anonymous regexp preserved between them. (That functionality would be well suited for the other regexp engines too). I left the framework in, so that people can see where it would go and either say "sure, go for it" or "er, no". (Includes Yodl comments in the docs). There's no "multiline" support for changing the ^...$ matching on newlines vs start/end of text, as I couldn't get it working (and the cre2 docs describe the opposite of every other regexp engine and the RE2 docs). --- Doc/Makefile.in | 2 +- Doc/Zsh/mod_re2.yo | 126 +++++++++++ INSTALL | 13 ++ Src/Modules/re2.c | 585 ++++++++++++++++++++++++++++++++++++++++++++++++++++ Src/Modules/re2.mdd | 7 + Test/V11re2.ztst | 317 ++++++++++++++++++++++++++++ configure.ac | 42 ++++ 7 files changed, 1091 insertions(+), 1 deletion(-) create mode 100644 Doc/Zsh/mod_re2.yo create mode 100644 Src/Modules/re2.c create mode 100644 Src/Modules/re2.mdd create mode 100644 Test/V11re2.ztst diff --git a/Doc/Makefile.in b/Doc/Makefile.in index 2752096..8c00876 100644 --- a/Doc/Makefile.in +++ b/Doc/Makefile.in @@ -65,7 +65,7 @@ Zsh/mod_datetime.yo Zsh/mod_db_gdbm.yo Zsh/mod_deltochar.yo \ Zsh/mod_example.yo Zsh/mod_files.yo Zsh/mod_langinfo.yo \ Zsh/mod_mapfile.yo Zsh/mod_mathfunc.yo Zsh/mod_newuser.yo \ Zsh/mod_parameter.yo Zsh/mod_pcre.yo Zsh/mod_private.yo \ -Zsh/mod_regex.yo Zsh/mod_sched.yo Zsh/mod_socket.yo \ +Zsh/mod_re2.yo Zsh/mod_regex.yo Zsh/mod_sched.yo Zsh/mod_socket.yo \ Zsh/mod_stat.yo Zsh/mod_system.yo Zsh/mod_tcp.yo \ Zsh/mod_termcap.yo Zsh/mod_terminfo.yo \ Zsh/mod_zftp.yo Zsh/mod_zle.yo Zsh/mod_zleparameter.yo \ diff --git a/Doc/Zsh/mod_re2.yo b/Doc/Zsh/mod_re2.yo new file mode 100644 index 0000000..bfabfd1 --- /dev/null +++ b/Doc/Zsh/mod_re2.yo @@ -0,0 +1,126 @@ +COMMENT(!MOD!zsh/re2 +Interface to the RE2 regular expression library. +!MOD!) +cindex(regular expressions) +cindex(re2) +The tt(zsh/re2) module provides regular expression handling using the +RE2 library. +This engine assumes UTF-8 strings by default and zsh never disables this. +Canonical documentation for this syntax accepted by this regular expression +engine can be found at: +uref(https://github.com/google/re2/wiki/Syntax) + +The tt(zsh/re2) module makes available some commands and test conditions. + +Regular expressions can be pre-compiled and given explicit names; these +are not shell variables and do not share a namespace with them. There +is currently no mechanism to enumerate them. + +The supported commands are: + +startitem() +findex(re2_compile) +item(tt(re2_compile) COMMENT(TODO: [ tt(-R) var(NAME) ]) [ tt(-acilwLP) ] var(REGEX))( +Compiles an RE2-syntax regular expression, defaulting to case-sensitive. + +COMMENT(TODO: Option tt(-R) stores the regular expression with the given name, +instead of in anonymous global state.) +Option tt(-L) will interpret the pattern as a literal, not a regex. +Option tt(-P) will enable POSIX syntax instead of the full language. +Option tt(-a) will force the pattern to be anchored. +Option tt(-c) will re-enable Perl class support in POSIX mode. +Option tt(-i) will compile a case-insensitive pattern. +Option tt(-l) will use a longest-match not first-match algorithm for +selecting which branch matches. +Option tt(-w) will re-enable Perl word-boundary support in POSIX mode. +) +enditem() + +startitem() +findex(re2_match) +item(tt(re2_match) [ tt(-v) var(var) ] [ tt(-a) var(arr) ] \ +COMMENT(TODO:[ tt(-R) var(REGNAME) ]|)[ tt(-P) var(PATTERN) ] var(string))( +Matches a regular expression against the supplied string, storing matches in +variables. +Returns success if var(string) matches the tested regular expression. + +Without option+COMMENT(TODO: s tt(-R) or) tt(-P) will match against an implicit current regular +expression object, which must have been compiled with tt(re2_compile). +COMMENT(TODO: Option tt(-R) will use the regular expression with the given name.) +Option tt(-P) will take a regular expression as a parameter and compile and +use it, without changing the implicit current regular expression object as +set by calling tt(re2_compile). + +Without a successful match, no variables are modified, even those explicitly +specified. + +Upon successful match: the entire matched portion of var(string) is stored in +the var(var) of option tt(-v) if given, else in tt(MATCH); any captured +sub-expressions are stored in the array var(arr) of option tt(-a) if given, +else in tt(match). + +No offset variables are currently mutated; this may change in a future release +of Zsh. +) +enditem() + +The supported test conditions are: + +startitem() +findex(re2-match) +item(var(expr) tt(-re2-match) var(regex))( +Matches a string against an RE2 regular expression. +Upon successful match, the +matched portion of the string will normally be placed in the tt(MATCH) +variable. If there are any capturing parentheses within the regex, then +the tt(match) array variable will contain those. +If the match is not successful, then the variables will not be altered. + +In addition, the tt(MBEGIN) and tt(MEND) variables are updated to point +to the offsets within var(expr) for the beginning and end of the matched +text, with the tt(mbegin) and tt(mend) arrays holding the beginning and +end of each substring matched. + +If tt(BASH_REMATCH) is set, then the array tt(BASH_REMATCH) will be set +instead of all of the other variables. + +The tt(NO_CASE_MATCH) option may be used to make matching case-sensitive. + +For finer-grained control, use the tt(re2_match) builtin. +) +enditem() + +startitem() +findex(re2-match-posix) +item(var(expr) tt(-re2-match-posix) var(regex))( +Matches as per tt(-re2-match) but configuring the RE2 engine to use +POSIX syntax. +) +enditem() + +startitem() +findex(re2-match-posixperl) +item(var(expr) tt(-re2-match-posixperl) var(regex))( +Matches as per tt(-re2-match) but configuring the RE2 engine to use +POSIX syntax, with the Perl classes and word-boundary extensions re-enabled +too. + +This thus adds support for: +tt(\d), tt(\s), tt(\w), tt(\D), tt(\S), tt(\W), tt(\b), and tt(\B). +) +enditem() + +startitem() +findex(re2-match-longest) +item(var(expr) tt(-re2-match-longest) var(regex))( +Matches as per tt(-re2-match) but configuring the RE2 engine to find +the longest match, instead of the left-most. + +For example, given + +example([[ abb -re2-match-longest ^a+LPAR()b|bb+RPAR() ]]) + +This will match the right-branch, thus tt(abb), where tt(-re2-match) would +instead match only tt(ab). +) +enditem() diff --git a/INSTALL b/INSTALL index 99895bd..e7f7782 100644 --- a/INSTALL +++ b/INSTALL @@ -558,6 +558,19 @@ only be searched for if the option --enable-pcre is passed to configure. (Future versions of the shell may have a better fix for this problem.) +--enable-re2: + +The RE2 library is written in C++, so a C-library shim layer is needed for +use by Zsh. We use https://github.com/marcomaggi/cre2 for this, which is +currently at version 0.3.1. Both re2 and cre2 need to be installed for +this option to successfully enable the zsh/re2 module. The Zsh +functionality is currently experimental. + +Warning: compile cre2 with the same C++ compiler as is used for re2; the +Zsh developer who wrote this module finds that clang++ for both works +on MacOS, while g++ runs into various problems. Once cre2 is built and +installed, any C compiler can use the shim. + --enable-cap: This searches for POSIX capabilities; if found, the `cap' library diff --git a/Src/Modules/re2.c b/Src/Modules/re2.c new file mode 100644 index 0000000..1d04040 --- /dev/null +++ b/Src/Modules/re2.c @@ -0,0 +1,585 @@ +/* + * re2.c + * + * This file is part of zsh, the Z shell. + * + * Copyright (c) 2016 Phil Pennock + * All Rights Reserved. + * + * Permission is hereby granted, without written agreement and without + * license or royalty fees, to use, copy, modify, and distribute this + * software and to distribute modified versions of this software for any + * purpose, provided that the above copyright notice and the following + * two paragraphs appear in all copies of this software. + * + * In no event shall Phil Pennock or the Zsh Development Group be liable + * to any party for direct, indirect, special, incidental, or consequential + * damages arising out of the use of this software and its documentation, + * even if Phil Pennock and the Zsh Development Group have been advised of + * the possibility of such damage. + * + * Phil Pennock and the Zsh Development Group specifically disclaim any + * warranties, including, but not limited to, the implied warranties of + * merchantability and fitness for a particular purpose. The software + * provided hereunder is on an "as is" basis, and Phil Pennock and the + * Zsh Development Group have no obligation to provide maintenance, + * support, updates, enhancements, or modifications. + * + */ + +/* This is heavily based upon my earlier regex module, with Peter's fixes + * for the tougher stuff I had skipped / gotten wrong. */ + +#include "re2.mdh" +#include "re2.pro" + +/* + * re2 itself is a C++ library; zsh needs C language bindings. + * These come from <https://github.com/marcomaggi/cre2>. + */ +#include <cre2.h> + +/* the conditions we support */ +#define ZRE2_COND_RE2 0 +#define ZRE2_COND_POSIX 1 +#define ZRE2_COND_POSIXPERL 2 +#define ZRE2_COND_LONGEST 3 + +struct zcre2_regexp_wrapper { + cre2_regexp_t *rex; + cre2_anchor_t anchoring; + +/* We don't need to keep the cre_options_t* around while the cre2_regexp_t* exists; + * from the CRE documentation for the cre2_new() initializer: + * + * "The options object opt is duplicated in the internal state of the regular + * expression instance, so opt can be safely mutated or finalised after this + * call. If opt is NULL: the regular expression object is built with the + * default set of options." */ +}; + +/* collect names all the variables which we might set as side-effects; + * if name is NULL, do not set it. Inline comments give 'typical' vars */ +struct match_varset { + /* bash non-NULL inhibits use of all others */ + const char *bash; /* BASH_REMATCH */ + const char *total; /* MATCH */ + const char *substrings; /* match */ + const char *offset_start; /* MBEGIN */ + const char *offset_end; /* MEND */ + const char *subs_offset_starts; /* mbegin */ + const char *subs_offset_ends; /* mend */ +}; + +/* The implicit anonymous regular expression object carried between compile & match */ +static struct zcre2_regexp_wrapper anonymous; + +/* TODO: tree of zcre2_regexp_wrapper, for named regexps */ + +/* returns 1 if matched, 0 if did not match; sets vars as side-effect */ +static int zre2_match_capture(cre2_regexp_t *, cre2_anchor_t, + const char *, size_t, struct match_varset *); + +/**/ +static int +bin_re2_compile(char *nam, char **args, Options ops, UNUSED(int func)) +{ + struct zcre2_regexp_wrapper *target; + cre2_options_t *zre2_opt; + cre2_anchor_t anchoring; + char *pattern; + int return_value = 1; + + zre2_opt = cre2_opt_new(); + if (!zre2_opt) { + zwarnnam(nam, "re2 opt memory allocation failure"); + return 1; + } + /* nb: we can set encoding here; re2 assumes UTF-8 by default */ + cre2_opt_set_log_errors(zre2_opt, 0); /* don't hit stderr by default */ + + anchoring = OPT_ISSET(ops,'a') ? CRE2_ANCHOR_BOTH : CRE2_UNANCHORED; + if(OPT_ISSET(ops,'L')) cre2_opt_set_literal(zre2_opt, 1); + if(OPT_ISSET(ops,'P')) cre2_opt_set_posix_syntax(zre2_opt, 1); + if(OPT_ISSET(ops,'c')) cre2_opt_set_perl_classes(zre2_opt, 1); + if(OPT_ISSET(ops,'i')) cre2_opt_set_case_sensitive(zre2_opt, 0); + if(OPT_ISSET(ops,'l')) cre2_opt_set_longest_match(zre2_opt, 1); + if(OPT_ISSET(ops,'w')) cre2_opt_set_word_boundary(zre2_opt, 1); + + /* The cre docs on cre2_opt_set_one_line are misleading led me to believe + * that RE2 is doing the opposite of every other regexp engine when it + * comes to multiline; the Syntax docs make clear that RE2's defaults + * match every other regexp engine. + * I couldn't get cre2_opt_set_one_line() to do anything sane, so for now + * we're dropping support for -m/multiline. A feature for the future. */ + + if(OPT_ISSET(ops,'R')) { + /* we don't enable R as an option below. */ + target = /*TODO*/NULL; + zwarnnam(nam, "-R unimplemented, TODO"); + goto CLEANUP_OPT; + } else { + target = &anonymous; + } + + if (target->rex != NULL) { + cre2_delete(target->rex); + target->rex = NULL; + } + + pattern = ztrdup(*args); + unmetafy(pattern, NULL); + + target->anchoring = anchoring; + target->rex = cre2_new(pattern, strlen(pattern), zre2_opt); + if (!target->rex) { + zwarnnam(nam, "re2 regexp memory allocation failure"); + goto CLEANUP_OPT; + } + if (cre2_error_code(target->rex)) { + zwarnnam(nam, "re2 rexexp compilation failed: %s", cre2_error_string(target->rex)); + cre2_delete(target->rex); + target->rex = NULL; + goto CLEANUP_OPT; + } + + return_value = 0; + +CLEANUP_OPT: + cre2_opt_delete(zre2_opt); + return return_value; +} + +/**/ +static int +bin_re2_match(char *nam, char **args, Options ops, UNUSED(int func)) +{ + struct match_varset varnames; + struct zcre2_regexp_wrapper short_lived; + struct zcre2_regexp_wrapper *zrex; + cre2_options_t *zre2_opt; + char *pattern, *plaintext; + int c, return_value; + + return_value = 1; + memset((void *)&varnames, 0, sizeof(varnames)); + varnames.total = OPT_HASARG(ops,c='v') ? OPT_ARG(ops,c) : "MATCH"; + varnames.substrings = OPT_HASARG(ops,c='a') ? OPT_ARG(ops,c) : "match"; + + short_lived.rex = NULL; + if (OPT_HASARG(ops,c='P')) { + zrex = &short_lived; + zrex->anchoring = CRE2_UNANCHORED; + zre2_opt = cre2_opt_new(); + if (!zre2_opt) { + zwarnnam(nam, "re2 opt memory allocation failure"); + return 1; + } + cre2_opt_set_log_errors(zre2_opt, 0); /* don't hit stderr by default */ + pattern = ztrdup(OPT_ARG(ops,c)); + unmetafy(pattern, NULL); + zrex->rex = cre2_new(pattern, strlen(pattern), zre2_opt); + cre2_opt_delete(zre2_opt); + if (!zrex->rex) { + zwarnnam(nam, "re2 regexp memory allocation failure"); + return 1; + } + if (cre2_error_code(zrex->rex)) { + zwarnnam(nam, "re2 rexexp compilation failed: %s", cre2_error_string(zrex->rex)); + cre2_delete(zrex->rex); + return 1; + } + } else if (OPT_HASARG(ops,c='R')) { + /*TODO: get the target from the regexp named*/ + zwarnnam(nam, "-R unimplemented, BUG FIXME XXX"); + return 1; + } else { + zrex = &anonymous; + + if (zrex->rex == NULL) { + zwarnnam(nam, "no anonymous re2 object; did you call re2_compile?"); + return 1; + } + } + + plaintext = ztrdup(*args); + unmetafy(plaintext, NULL); + + /* beware bool sense */ + return_value = zre2_match_capture(zrex->rex, zrex->anchoring, + plaintext, strlen(plaintext), + &varnames) ? 0 : 1; + + if (short_lived.rex != NULL) { + cre2_delete(short_lived.rex); + } + return return_value; +} + +/**/ +static int +zcond_re2_match(char **a, int id) +{ + struct match_varset varnames; + cre2_regexp_t *rex; + cre2_options_t *opt; + char *lhstr, *lhstr_zshmeta, *rhre, *rhre_zshmeta; + int return_value; + + return_value = 0; /* 1 => matched successfully */ + + lhstr_zshmeta = cond_str(a,0,0); + rhre_zshmeta = cond_str(a,1,0); + lhstr = ztrdup(lhstr_zshmeta); + unmetafy(lhstr, NULL); + rhre = ztrdup(rhre_zshmeta); + unmetafy(rhre, NULL); + + opt = cre2_opt_new(); + if (!opt) { + zwarn("re2 opt memory allocation failure"); + goto CLEANUP_UNMETAONLY; + } + /* nb: we can set encoding here; re2 assumes UTF-8 by default */ + cre2_opt_set_log_errors(opt, 0); /* don't hit stderr by default */ + if (!isset(CASEMATCH)) { + cre2_opt_set_case_sensitive(opt, 0); + } + + /* "The following options are only consulted when POSIX syntax is enabled; + * when POSIX syntax is disabled: these features are always enabled and + * cannot be turned off." + * Seems hard to mis-parse, but I did. Okay, Perl classes \d,\w and friends + * always on normally, can _also_ be enabled in POSIX mode. */ + + switch (id) { + case ZRE2_COND_RE2: + /* nothing to do, this is default */ + break; + case ZRE2_COND_POSIX: + cre2_opt_set_posix_syntax(opt, 1); + break; + case ZRE2_COND_POSIXPERL: + cre2_opt_set_posix_syntax(opt, 1); + /* we enable Perl classes (\d, \s, \w, \D, \S, \W) + * and boundaries/not (\b \B) */ + cre2_opt_set_perl_classes(opt, 1); + cre2_opt_set_word_boundary(opt, 1); + break; + case ZRE2_COND_LONGEST: + cre2_opt_set_longest_match(opt, 1); + break; + default: + DPUTS(1, "bad re2 option"); + goto CLEANUP_OPT; + } + + rex = cre2_new(rhre, strlen(rhre), opt); + if (!rex) { + zwarn("re2 regular expression memory allocation failure"); + goto CLEANUP_OPT; + } + if (cre2_error_code(rex)) { + zwarn("re2 rexexp compilation failed: %s", cre2_error_string(rex)); + goto CLEANUP; + } + + memset((void *)&varnames, 0, sizeof(varnames)); + varnames.bash = isset(BASHREMATCH) ? "BASH_REMATCH" : NULL; + varnames.total = "MATCH"; + varnames.substrings = "match"; + varnames.offset_start = "MBEGIN"; + varnames.offset_end = "MEND"; + varnames.subs_offset_starts = "mbegin"; + varnames.subs_offset_ends = "mend"; + + return_value = zre2_match_capture(rex, CRE2_UNANCHORED, + lhstr, strlen(lhstr), + &varnames); + +CLEANUP: + cre2_delete(rex); +CLEANUP_OPT: + cre2_opt_delete(opt); +CLEANUP_UNMETAONLY: + free(lhstr); + free(rhre); + return return_value; +} + +/* This does the matching and capturing logic. + * Returns 1 if matched, 0 if did not match; sets vars as side-effect. + * Should guard against _any_ varname being NULL, let callers decide + * what they want. + * If .bash varname non-NULL, _only_ that varname set (because that's the + * logic used by the infix-operator; if you need something else, change this + * constraint). */ +static int +zre2_match_capture( + cre2_regexp_t *rex, + cre2_anchor_t anchoring, + const char *plaintext, + size_t plain_size, + struct match_varset *varnames +) { + cre2_string_t *m, *matches = NULL; + char **result_array, **x; + const char *s; + char **mbegin, **mend, **bptr, **eptr; + size_t matchessz = 0; + int return_value, ncaptures, matched, nelem, start, n, indexing_base; + int remaining_len, charlen; + zlong offs; + + if (rex == NULL) { + zwarn("BUG: got NULL re2 regexp"); + return 0; + } + + return_value = 0; + + ncaptures = cre2_num_capturing_groups(rex); + /* the nmatch for cre2_match follows the usual pattern of index 0 holding + * the entire matched substring, index 1 holding the first capturing + * sub-expression, etc. So we need ncaptures+1 elements. */ + matchessz = (ncaptures + 1) * sizeof(cre2_string_t); + matches = zalloc(matchessz); + + matched = cre2_match(rex, + plaintext, plain_size, /* text to match against */ + 0, plain_size, /* substring of text to consider */ + anchoring, + matches, (ncaptures+1)); + if (!matched) + goto CLEANUP; + return_value = 1; + + /* We have a match, we will return success, we have array of cre2_string_t + * items, each with .data and .length fields pointing into the matched text, + * all in unmetafied format. + * + * We need to collect the results, put together various arrays and offset + * variables, while respecting options to change the array set, the indexing + * of that array and everything else that 26 years of history has endowed + * upon us. */ + /* For the condition case: + * option BASHREMATCH set: + * set $BASH_REMATCH instead of $MATCH/$match + * entire matched portion in index 0 (useful with option KSH_ARRAYS) + * option _not_ set: + * $MATCH scalar gets entire string + * $match array gets substrings + * $MBEGIN $MEND scalars get offsets of entire match + * $mbegin $mend arrays get offsets of substrings + * all of the offsets depend upon KSHARRAYS to determine indexing! + * + * Our caller sets up the varnames bundle, sets ->bash non-NULL only + * if option BASHREMATCH set, and we use non-NULL string pointers to + * decide whether to set data. + */ + + if (varnames->bash != NULL) { + start = 0; + nelem = ncaptures + 1; + } else { + start = 1; + nelem = ncaptures; + } + result_array = NULL; + if (nelem) { + result_array = x = (char **) zalloc(sizeof(char *) * (nelem + 1)); + for (m = matches + start, n = start; n <= ncaptures; ++n, ++m, ++x) { + /* .data is (const char *), metafy can modify in-place so takes + * (char *) but doesn't modify given META_DUP, so safe to drop + * the const. */ + *x = metafy((char *)m->data, m->length, META_DUP); + } + *x = NULL; + } + + if (varnames->bash != NULL) { + setaparam((char *)varnames->bash, result_array); + goto CLEANUP; + } + + indexing_base = isset(KSHARRAYS) ? 0 : 1; + + if (varnames->total != NULL) { + setsparam((char *)varnames->total /* typically: MATCH */, + metafy((char *)matches[0].data, matches[0].length, META_DUP)); + } + + if ((varnames->offset_start != NULL) || (varnames->offset_end != NULL)) { + /* count characters before the match */ + s = plaintext; + remaining_len = matches[0].data - plaintext; + offs = 0; + MB_CHARINIT(); + while (remaining_len) { + offs++; + charlen = MB_CHARLEN(s, remaining_len); + s += charlen; + remaining_len -= charlen; + } + if (varnames->offset_start != NULL) { + setiparam((char *)varnames->offset_start, /* typically: MBEGIN */ + offs + indexing_base); + } + /* then the characters within the match */ + remaining_len = matches[0].length; + while (remaining_len) { + offs++; + charlen = MB_CHARLEN(s, remaining_len); + s += charlen; + remaining_len -= charlen; + } + if (varnames->offset_end != NULL) { + /* zsh ${foo[a,b]} is inclusive of end-points, [a,b] not [a,b) */ + setiparam((char *)varnames->offset_end, /* typically: MEND */ + offs + indexing_base - 1); + } + } + + if (!ncaptures) { + goto CLEANUP; + } + + if (varnames->substrings != NULL) { + setaparam((char *)varnames->substrings, /* typically: match */ + result_array); + } + + if ((varnames->subs_offset_starts == NULL) && (varnames->subs_offset_ends == NULL)) { + goto CLEANUP; + } + + bptr = mbegin = (char **)zalloc(sizeof(char *)*(ncaptures+1)); + eptr = mend = (char **)zalloc(sizeof(char *)*(ncaptures+1)); + for (m = matches + start, n = 0; + n < ncaptures; + ++n, ++m, ++bptr, ++eptr) + { + char buf[DIGBUFSIZE]; + if (m->data == NULL) { + /* FIXME: have assumed this is the API for non-matching substrings; confirm! */ + *bptr = ztrdup("-1"); + *eptr = ztrdup("-1"); + continue; + } + s = plaintext; + remaining_len = m->data - plaintext; + offs = 0; + /* Find the start offset */ + MB_CHARINIT(); + while (remaining_len) { + offs++; + charlen = MB_CHARLEN(s, remaining_len); + s += charlen; + remaining_len -= charlen; + } + convbase(buf, offs + indexing_base, 10); + *bptr = ztrdup(buf); + /* Continue to the end offset */ + remaining_len = m->length; + while (remaining_len) { + offs++; + charlen = MB_CHARLEN(s, remaining_len); + s += charlen; + remaining_len -= charlen; + } + convbase(buf, offs + indexing_base - 1, 10); + *eptr = ztrdup(buf); + } + *bptr = *eptr = NULL; + + if (varnames->subs_offset_starts != NULL) { + setaparam((char *)varnames->subs_offset_starts, /* typically: mbegin */ + mbegin); + } + if (varnames->subs_offset_ends != NULL) { + setaparam((char *)varnames->subs_offset_ends, /* typically: mend */ + mend); + } + +CLEANUP: + if (matches) + zfree(matches, matchessz); + return return_value; +} + +static struct builtin bintab[] = { + /* TODO: add "R:" in here for named regexps: */ + BUILTIN("re2_compile", 0, bin_re2_compile, 1, 1, 0, "acilwLP", NULL), + BUILTIN("re2_match", 0, bin_re2_match, 1, 1, 0, "a:v:P:", NULL), +}; + + +static struct conddef cotab[] = { + CONDDEF("re2-match", CONDF_INFIX, zcond_re2_match, 0, 0, ZRE2_COND_RE2), + CONDDEF("re2-match-posix", CONDF_INFIX, zcond_re2_match, 0, 0, ZRE2_COND_POSIX), + CONDDEF("re2-match-posixperl", CONDF_INFIX, zcond_re2_match, 0, 0, ZRE2_COND_POSIXPERL), + CONDDEF("re2-match-longest", CONDF_INFIX, zcond_re2_match, 0, 0, ZRE2_COND_LONGEST), +}; + + +static struct features module_features = { + bintab, sizeof(bintab)/sizeof(*bintab), + cotab, sizeof(cotab)/sizeof(*cotab), + NULL, 0, + NULL, 0, + 0 +}; + + +/**/ +int +setup_(UNUSED(Module m)) +{ + return 0; +} + +/**/ +int +features_(Module m, char ***features) +{ + *features = featuresarray(m, &module_features); + return 0; +} + +/**/ +int +enables_(Module m, int **enables) +{ + return handlefeatures(m, &module_features, enables); +} + +/**/ +int +boot_(UNUSED(Module m)) +{ + anonymous.rex = NULL; + return 0; +} + +/**/ +int +cleanup_(Module m) +{ + if (anonymous.rex != NULL) { + cre2_delete(anonymous.rex); + anonymous.rex = NULL; + } + /* TODO: + * when implement -R named variables, also clean those up. + * (isolating them, to allow for clean module unloading, is why I moved + * away from storing regexps in named variables and think we should use + * a separate tree instead. */ + return setfeatureenables(m, &module_features, NULL); +} + +/**/ +int +finish_(UNUSED(Module m)) +{ + return 0; +} diff --git a/Src/Modules/re2.mdd b/Src/Modules/re2.mdd new file mode 100644 index 0000000..5b23699 --- /dev/null +++ b/Src/Modules/re2.mdd @@ -0,0 +1,7 @@ +name=zsh/re2 +link='if test "x$enable_re2" = xyes && test "x$ac_cv_lib_cre2_cre2_version_string" = xyes; then echo dynamic; else echo no; fi' +load=no + +autofeatures="b:re2_compile b:re2_match C:re2-match" + +objects="re2.o" diff --git a/Test/V11re2.ztst b/Test/V11re2.ztst new file mode 100644 index 0000000..860233e --- /dev/null +++ b/Test/V11re2.ztst @@ -0,0 +1,317 @@ +%prep + + if ! zmodload -F zsh/re2 C:re2-match 2>/dev/null + then + ZTST_unimplemented="the zsh/re2 module is not available" + return 0 + fi +# Load the rest of the builtins + zmodload zsh/re2 + # TODO: use future mechanism to switch =~ to use re2 and test =~ too +# Find a UTF-8 locale. + setopt multibyte +# Don't let LC_* override our choice of locale. + unset -m LC_\* + mb_ok= + langs=(en_{US,GB}.{UTF-,utf}8 en.UTF-8 + $(locale -a 2>/dev/null | egrep 'utf8|UTF-8')) + for LANG in $langs; do + if [[ é = ? ]]; then + mb_ok=1 + break; + fi + done + if [[ -z $mb_ok ]]; then + ZTST_unimplemented="no UTF-8 locale or multibyte mode is not implemented" + else + print -u $ZTST_fd Testing RE2 multibyte with locale $LANG + mkdir multibyte.tmp && cd multibyte.tmp + fi + +%test + + [[ 'foo→bar' -re2-match .([^[:ascii:]]). ]] + print $MATCH + print $match[1] +0:Basic non-ASCII regexp matching +>o→b +>→ + + MATCH='' + [[ ÷x -re2-match '^(\p{Sm})(\p{Latin})$' ]] + print "$? <$MATCH> .${match[1]}|${match[2]}." +0:Unicode character class names & extracting correct widths +>0 <÷x> .÷|x. + + [[ alphabeta -re2-match a([^a]+)a ]] + echo "$? basic" + print $MATCH + print $match[1] + [[ ! alphabeta -re2-match a(.+)a ]] + echo "$? negated op" + [[ alphabeta -re2-match ^b ]] + echo "$? failed match" +# default matches on first, then takes longest substring +# -longest keeps looking + [[ abb -re2-match a(b|bb) ]] + echo "$? first .${MATCH}.${match[1]}." + [[ abb -re2-match-longest a(b|bb) ]] + echo "$? longest .${MATCH}.${match[1]}." + [[ alphabeta -re2-match ab ]]; echo "$? unanchored" + [[ alphabeta -re2-match ^ab ]]; echo "$? anchored" + [[ alphabeta -re2-match '^a(\w+)a$' ]] + echo "$? perl class used" + echo ".${MATCH}. .${match[1]}." + [[ alphabeta -re2-match-posix '^a(\w+)a$' ]] + echo "$? POSIX-mode, should inhibit Perl class" + [[ alphabeta -re2-match-posixperl '^a(\w+)a$' ]] + echo "$? POSIX-mode with Perl classes enabled .${match[1]}." + unset MATCH match + [[ alphabeta -re2-match ^a([^a]+)a([^a]+)a$ ]] + echo "$? matched, set vars" + echo ".$MATCH. ${#MATCH}" + echo ".${(j:|:)match[*]}." + unset MATCH match + [[ alphabeta -re2-match fr(.+)d ]] + echo "$? unmatched, not setting MATCH/match" + echo ".$MATCH. ${#MATCH}" + echo ".${(j:|:)match[*]}." +0:Basic matching & result codes +>0 basic +>alpha +>lph +>1 negated op +>1 failed match +>0 first .ab.b. +>0 longest .abb.bb. +>0 unanchored +>1 anchored +>0 perl class used +>.alphabeta. .lphabet. +*?\(eval\):*: re2 rexexp compilation failed: invalid escape sequence: \w +>1 POSIX-mode, should inhibit Perl class +>0 POSIX-mode with Perl classes enabled .lphabet. +>0 matched, set vars +>.alphabeta. 9 +>.lph|bet. +>1 unmatched, not setting MATCH/match +>.. 0 +>.. + + m() { + unset MATCH MBEGIN MEND match mbegin mend + [[ $2 -re2-match $3 ]] + print $? $1: m:${MATCH}: ma:${(j:|:)match}: MBEGIN=$MBEGIN MEND=$MEND mbegin="(${mbegin[*]})" mend="(${mend[*]})" + } + data='alpha beta gamma delta' + m uncapturing $data '\b\w+\b' + m capturing $data '\b(\w+)\b' + m 'capture 2' $data '\b(\w+)\s+(\w+)\b' + m 'capture repeat' $data '\b(?:(\w+)\s+)+(\w+)\b' +0:Beginning and end testing +>0 uncapturing: m:alpha: ma:: MBEGIN=1 MEND=5 mbegin=() mend=() +>0 capturing: m:alpha: ma:alpha: MBEGIN=1 MEND=5 mbegin=(1) mend=(5) +>0 capture 2: m:alpha beta: ma:alpha|beta: MBEGIN=1 MEND=10 mbegin=(1 7) mend=(5 10) +>0 capture repeat: m:alpha beta gamma delta: ma:gamma|delta: MBEGIN=1 MEND=22 mbegin=(12 18) mend=(16 22) + + + unset match mend + s=$'\u00a0' + [[ $s -re2-match '^.$' ]] && print OK + [[ A${s}B -re2-match .(.). && $match[1] == $s ]] && print OK + [[ A${s}${s}B -re2-match A([^[:ascii:]]*)B && $mend[1] == 3 ]] && print OK + unset s +0:Raw IMETA characters in input string +>OK +>OK +>OK + + [[ foo -re2-match f.+ ]] ; print $? + [[ foo -re2-match x.+ ]] ; print $? + [[ ! foo -re2-match f.+ ]] ; print $? + [[ ! foo -re2-match x.+ ]] ; print $? + [[ foo -re2-match f.+ && bar -re2-match b.+ ]] ; print $? + [[ foo -re2-match x.+ && bar -re2-match b.+ ]] ; print $? + [[ foo -re2-match f.+ && bar -re2-match x.+ ]] ; print $? + [[ ! foo -re2-match f.+ && bar -re2-match b.+ ]] ; print $? + [[ foo -re2-match f.+ && ! bar -re2-match b.+ ]] ; print $? + [[ ! ( foo -re2-match f.+ && bar -re2-match b.+ ) ]] ; print $? + [[ ! foo -re2-match x.+ && bar -re2-match b.+ ]] ; print $? + [[ foo -re2-match x.+ && ! bar -re2-match b.+ ]] ; print $? + [[ ! ( foo -re2-match x.+ && bar -re2-match b.+ ) ]] ; print $? +0:Regex result inversion detection +>0 +>1 +>1 +>0 +>0 +>1 +>1 +>1 +>1 +>1 +>0 +>1 +>0 + +# Subshell because crash on failure + ( [[ test.txt -re2-match '^(.*_)?(test)' ]] + echo $match[2] ) +0:regression for segmentation fault (pcre, dup for re2), workers/38307 +>test + + setopt BASH_REMATCH KSH_ARRAYS + unset MATCH MBEGIN MEND match mbegin mend BASH_REMATCH + [[ alphabeta -re2-match '^a([^a]+)(a)([^a]+)a$' ]] + print "$? bash_rematch" + print "m:${MATCH}: ma:${(j:|:)match}:" + print MBEGIN=$MBEGIN MEND=$MEND mbegin="(${mbegin[*]})" mend="(${mend[*]})" + print "BASH_REMATCH=[${(j:, :)BASH_REMATCH[@]}]" + print "[0]=${BASH_REMATCH[0]} [1]=${BASH_REMATCH[1]}" +0:bash_rematch works +>0 bash_rematch +>m:: ma:: +>MBEGIN= MEND= mbegin=() mend=() +>BASH_REMATCH=[alphabeta, lph, a, bet] +>[0]=alphabeta [1]=lph + + unsetopt BASH_REMATCH KSH_ARRAYS + m() { + local label="$1" text="$2" rc out + shift 2 + unset MATCH match + # can't capture stderr sanely for amalgamation, need compile to happen in parent + re2_compile "$@" + rc=$? + if (( rc )); then print "${rc}-NoCompile $label"; return 1; fi + print -n "$rc:" + re2_match "$text" + print $? $label: m:${MATCH}: ma:${(j:|:)match}: + } + # + m cmd-clean alphabeta lph + m cmd-anchored-nomatch alphabeta -a lph.+ + m cmd-anchored-match alphabeta -a alp.+ + m case-mismatch alphabeta 'A\w+' + m case-insensitive-pattern alphabeta -i 'A\w+' + m case-insensitive-text Alphabeta -i 'a\w+' + m case-sensitive-text Alphabeta 'a\w+' + m non-posix-okay-normal ÷1 '^(\p{Sm})\d$' + m non-posix-reject-normal ÷x '^(\p{Sm})\d$' + print -u2 'stderr start non-posix-posixmode' + m non-posix-posixmode ÷1 -P '^(\p{Sm})\d$' + print -u2 'stderr end non-posix-posixmode' + m literal-match x1 -L x1 + m literal-nomatch x1 -L .1 + m literal-match-substr abcd -L bc + m literal-nomatch-anchored abcd -aL bc + m not-longest abb 'a(b|bb)' + m longest abb -l 'a(b|bb)' +0:re2 compile/match testing with anonymous var +>0:0 cmd-clean: m:lph: ma:: +>0:1 cmd-anchored-nomatch: m:: ma:: +>0:0 cmd-anchored-match: m:alphabeta: ma:: +>0:1 case-mismatch: m:: ma:: +>0:0 case-insensitive-pattern: m:alphabeta: ma:: +>0:0 case-insensitive-text: m:Alphabeta: ma:: +>0:0 case-sensitive-text: m:abeta: ma:: +>0:0 non-posix-okay-normal: m:÷1: ma:÷: +>0:1 non-posix-reject-normal: m:: ma:: +>1-NoCompile non-posix-posixmode +?stderr start non-posix-posixmode +*?m:re2_compile:*: re2 rexexp compilation failed: invalid escape sequence: \p +?stderr end non-posix-posixmode +>0:0 literal-match: m:x1: ma:: +>0:1 literal-nomatch: m:: ma:: +>0:0 literal-match-substr: m:bc: ma:: +>0:1 literal-nomatch-anchored: m:: ma:: +>0:0 not-longest: m:ab: ma:b: +>0:0 longest: m:abb: ma:bb: + +### We've dropped multi-line support for now, rather than debug RE2/cre2 +### interactions and figure out how I (pdp) am mis-reading docs. Should +### we add it, this is the test which exposed the presence of problems: +# m multiline-reject-nom $'ab\ncd' '^cd' +# set -x +# m multiline-okay $'ab\ncd' -m '^cd' +# set +x +#0:re2 multiline matching +#>0:1 multiline-reject-nom: m:: ma:: +#>0:0 multiline-okay: m:cd: ma:: + + m posix-simple a1d -Pa '([[:alpha:]])([[:digit:]])([[:alpha:]])' + # + print -u2 'stderr start posix-reject-perlclass' + m posix-reject-perlclass a1d -Pa '(\w)(\d)(\w)' + print -u2 'stderr end posix-reject-perlclass' + m posix-perlclass-enabled a1d -Pac '(\w)(\d)(\w)' + m boundaries-normal 'def efg' '\be(.)' + print -u2 'stderr start posix-reject-boundaries' + m posix-reject-boundaries 'def efg' -P '\be(.)' + print -u2 'stderr end posix-reject-boundaries' + m posix-boundaries-enabled 'def efg' -Pw '\be(.)' + m posix-perlclass-boundaries 'de1g e2h' -Pcw '\be(\d)(\w)' + m posix-pcb-mattered 'de1g e2h' -Pcw 'e(\d)(\w)' +0:re2 POSIX mode with various features added back +>0:0 posix-simple: m:a1d: ma:a|1|d: +?stderr start posix-reject-perlclass +*?m:re2_compile:*: re2 rexexp compilation failed: invalid escape sequence: \\w +?stderr end posix-reject-perlclass +>1-NoCompile posix-reject-perlclass +>0:0 posix-perlclass-enabled: m:a1d: ma:a|1|d: +>0:0 boundaries-normal: m:ef: ma:f: +?stderr start posix-reject-boundaries +*?m:re2_compile:*: re2 rexexp compilation failed: invalid escape sequence: \\b +?stderr end posix-reject-boundaries +>1-NoCompile posix-reject-boundaries +>0:0 posix-boundaries-enabled: m:ef: ma:f: +>0:0 posix-perlclass-boundaries: m:e2h: ma:2|h: +>0:0 posix-pcb-mattered: m:e1g: ma:1|g: + + re2_compile -i '^([aeiou])(\w{2})' + mintov() { + local label="$1"; shift + unset MATCH match T1 t1 + re2_match "$@" + print "$? $label MATCH=<$MATCH> match=<${(j:|:)match}> T1=<$T1> t1=<${(j:|:)t1}>" + } + mintov not-first not_first + mintov simple orange + mintov redir-arr -a t1 orange + mintov redir-var -v T1 orange + mintov redir-both -v T1 -a t1 orange + mintov normal-after orange +0:re2_match capturing to named vars +>1 not-first MATCH=<> match=<> T1=<> t1=<> +>0 simple MATCH=<ora> match=<o|ra> T1=<> t1=<> +>0 redir-arr MATCH=<ora> match=<> T1=<> t1=<o|ra> +>0 redir-var MATCH=<> match=<o|ra> T1=<ora> t1=<> +>0 redir-both MATCH=<> match=<> T1=<ora> t1=<o|ra> +>0 normal-after MATCH=<ora> match=<o|ra> T1=<> t1=<> + + + re2_compile '^([aeiou])(\w{2})' + re2_match orange && echo "yes-1" + re2_match -P '^t.{3}' orange || echo "no-2" + re2_match -P '^t.{3}' tangerine && echo "yes-3" + re2_match tangerine || echo "no-4" + re2_match orange && echo "yes-5 ${match[2]}" +0:re2_match -P pattern works & doesn't mess with anonymous +>yes-1 +>no-2 +>yes-3 +>no-4 +>yes-5 ra + + + re2_compile '^(\p{Sm})(?!\d+)(?:.)$' +1:re2 check no crash on unsupported syntax +?(eval):re2_compile:1: re2 rexexp compilation failed: invalid perl operator: (?! + + re2_compile '(fred' +1:re2 complain parens not closed +?(eval):re2_compile:1: re2 rexexp compilation failed: missing ): (fred + + +%clean + unfunction -m 'm*' diff --git a/configure.ac b/configure.ac index 0e0bd53..c000d6a 100644 --- a/configure.ac +++ b/configure.ac @@ -442,6 +442,11 @@ AC_ARG_ENABLE(pcre, AC_HELP_STRING([--enable-pcre], [enable the search for the pcre library (may create run-time library dependencies)])) +dnl Do you want to look for re2 support? +AC_ARG_ENABLE(re2, +AC_HELP_STRING([--enable-re2], +[enable the search for cre2 C-language bindings and re2 library])) + dnl Do you want to look for capability support? AC_ARG_ENABLE(cap, AC_HELP_STRING([--enable-cap], @@ -683,6 +688,43 @@ if test "x$ac_cv_prog_PCRECONF" = xpcre-config; then fi fi +if test x$enable_re2 = xyes; then +AC_CHECK_LIB([re2],[main],, + [AC_MSG_FAILURE([test for RE2 library failed])]) +AC_CHECK_LIB([cre2],[cre2_version_string],, + [AC_MSG_FAILURE([test for CRE2 library failed])]) +AC_CHECK_HEADERS([cre2.h],, + [AC_MSG_ERROR([test for RE2 header failed])]) +dnl The C++ libraries cre2 & re2 need to be built with the same compiler +dnl to avoid chaotic failures; in testing, a mismatch was sufficient to +dnl cause this to either error or segfault during cre2_pattern(). +AC_CACHE_CHECK([whether cre2 is broken at runtime], + [zsh_cv_cre2_runtime_broken], + AC_TRY_RUN([ +#include <stdint.h> +#include <string.h> +#include <cre2.h> +#define Pattern "a.{2}(c|d)$" +int main(int argc, char **argv) { + cre2_options_t *opts; cre2_regexp_t *rex; + opts = cre2_opt_new(); + if (!opts) { return 1; } + rex = cre2_new(Pattern, strlen(Pattern), opts); + if (!rex) { return 2; } + if (cre2_error_code(rex)) { return 3; } + if (strcmp(cre2_pattern(rex), Pattern) != 0) { return 4; } + cre2_delete(rex); + cre2_opt_delete(opts); + return 0; +}], + zsh_cv_cre2_runtime_broken=no, + zsh_cv_cre2_runtime_broken=yes + zsh_cv_cre2_runtime_broken=yes)) + if test x$zsh_cv_cre2_runtime_broken = xyes; then + AC_MSG_ERROR([cre2 library hard-unusable, rebuild with same compiler as for RE2]) + fi +fi + AC_CHECK_HEADERS(sys/time.h sys/times.h sys/select.h termcap.h termio.h \ termios.h sys/param.h sys/filio.h string.h memory.h \ limits.h fcntl.h libc.h sys/utsname.h sys/resource.h \ -- 2.10.0
Attachment:
signature.asc
Description: Digital signature