On Wed, Oct 18, 2006 at 06:20:19PM +0100, Peter Stephenson wrote:
> Alexey Tourbin <at@xxxxxxxxxxx> wrote:
> > Thanks for the clue. git-bisect now blames 22544.
>
> That patch made the shell smarter about finding the end of
> special types of string known to the shell (identifiers in particular),
> using the multibyte code.
>
> I wonder if it's part of the problem Andrey noted? At some points the
> string we apply this too may contain tokenized characters, which
> aren't valid multibyte characters. Since the string must be metafied,
> these are easy to detect.
>
> The simplest fix is just to ensure we don't try to handle these as
> mulitbyte characters, telling the caller they're invalid. Most callers
> will just handle it as a single-byte character and move on, which
> is the right thing to do; some callers which really need valid characters
> will abort, but they shouldn't be getting a tokenized string. So
> this might actually work. If not, we need to be smarter, but probably at a
> higher level.
>
> We need some fix like this even if it isn't the root of the present
> problem. (If I could reproduce that it ought now to be easy to trace.)
>
> Index: Src/utils.c
> ===================================================================
> RCS file: /cvsroot/zsh/zsh/Src/utils.c,v
> retrieving revision 1.142
> diff -u -r1.142 utils.c
> --- Src/utils.c 10 Oct 2006 09:37:19 -0000 1.142
> +++ Src/utils.c 18 Oct 2006 17:09:16 -0000
> @@ -4003,6 +4003,21 @@
> *wcp = (wint_t)(*s == Meta ? s[1] ^ 32 : *s);
> return 1 + (*s == Meta);
> }
> + /*
> + * We have to handle tokens here, since we may be looking
> + * through a tokenized input. Obviously this isn't
> + * a valid multibyte character, so just return WEOF
> + * and let the caller handle it as a single character.
> + *
> + * TODO: I've a sneaking suspicion we could do more here
> + * to prevent the caller always needing to handle invalid
> + * characters specially, but sometimes it may need to know.
> + */
> + if (itok(*s)) {
> + if (wcp)
> + *wcp = EOF;
> + return 1;
> + }
>
> ret = MB_INVALID;
> for (ptr = s; *ptr; ) {
Thanks Peter! This patch resolves the problem.
(I quote the whole message because apparently it was not CC'ed to
zsh-wokers.)
Unfortunately I don't quite understand unicode issues in zsh. I build
zsh rpm package because I use it (and a few others use it, too). The
latest stable 4.2 release had problems in utf8 console, so I decided
to move to then-current cvs snapshot. I got my first decently working
utf8-enabled zsh with 20050926 snapshot.
So as for now there's just about the only thing I can provide is feedback.
This will change as I grok zsh code.
BTW, git archive is available at
git://git.altlinux.org/people/at/packages/zsh.git
The 'master' branch is for my own cooking, but "cvs" branch, as well
as "zsh-4_0-patches" and "zsh-4_2-patches" have pristine zsh sources.
I verified "cvs" branch against checkout, and it's almost zero-diff
(the only exception is that there's very old Completion/Core/_closequotes
is in there, but is not in checkout). I used Keith Packard's "parsecvs"
(with my changes, some of which already merged into mainline).
> --
> Peter Stephenson <pws@xxxxxxx> Software Engineer
> CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
> Cambridge, CB4 0WZ, UK Tel: +44 (0)1223 692070
>
>
> To access the latest news from CSR copy this link into a web browser: http://www.csr.com/email_sig.php
Attachment:
pgpNFFyW8N2mO.pgp
Description: PGP signature