Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: D07multibyte.ztst failure on HP-UX 11.11

X-seq: zsh-workers 26944
From: Peter Stephenson <pws@xxxxxxx>
Subject: Re: D07multibyte.ztst failure on HP-UX 11.11
Date: Thu, 7 May 2009 16:38:19 +0100
Cc: zsh-workers@xxxxxxxxxx
In-reply-to: <20090506215026.GA5565@otaku>
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
Organization: CSR
References: <20090501145253.GA5070@svalbard> <200905011518.n41FIlHi005089@xxxxxxxxxxxxxx> <20090505193931.GA2944@svalbard> <20090506202206.63bc26b0@pws-pc> <20090506215026.GA5565@otaku>

On Wed, 6 May 2009 21:50:26 +0000
Paul Ackersviller <pda@xxxxxxxxxxxxxxxx> wrote:
> On Wed, May 06, 2009 at 08:22:06PM +0100, Peter Stephenson wrote:
> > On Tue, 5 May 2009 19:39:31 +0000
> > Paul Ackersviller <pda@xxxxxxxxxxxxxxxx> wrote:
> > > I can get read to silently fail on the HP box with
> > > 
> > > env -i LANG=en_US.utf8 ../Src/zsh -fc \
> > > 	"(LC_ALL=C; print \$'\\u00e9') | read || print failure"
>
>> > Taking out the LC_ALL should produce some sensible output if you omit
> > the read.  (Replacing it with xxd or failing that od -x might make it
> > clearer what's going on.)
> 
> Not quite: "zsh:1: cannot do charset conversion (iconv failed)"

It's not clear why it should fail, but the error message is OK and allowed
for by the test.

> > If you're simply taking out the subshell and not replacing it with
> > anything then the LC_ALL=C covers the "read" as well as the "print".
> > So possibly something strange is happening in the read.  Replacing it
> > with xxd might be even more instructive here.
> 
> This gives
> 	0000000 c50a
> Does this mean the 0a should be the second byte, but is perhaps being
> interpreted as newline?

So this comes from

 env -i LANG=en_US.utf8 ../Src/zsh -fc \
   "LC_ALL=C; print \$'\\u00e9' | read || print failure"

I get "character not in range" here.  It looks like your system is
outputting 0xc5, which I wouldn't expect to be a valid character in the C
locale, and I can't work out why it comes from Unicode character 0xe9.  The
UTF-8 would be 0xc3a9, the ISO-8859-1 or -15 would be 0xe9.

The 0x0a really is a newline.

In the test you show, read is running with UTF-8.  I can confirm that
on my system (where I happen already to be in the en_GB.UTF-8 locale)

  (unsetopt multibyte; print $'\xc5') | xxd

gives what you're sending to read, and

  (unsetopt multibyte; print $'\xc5') | read

returns status 1 with no output.

So this all tallies, and I think we've found out all we need, but I'm not
sure about the fix; possibly read should output an error on an invalid
character in MULTIBYTE mode (which we could add to the test)?  Does anyone
see a problem with that?

I'm fairly happy this isn't a shell bug, but I'd still like the shell to
have enough facilities to be able to detect the problem.

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR PLC, Churchill House, Cambridge Business Park, Cowley Road
Cambridge, CB4 0WZ, UK                          Tel: +44 (0)1223 692070

Follow-Ups:
- Re: D07multibyte.ztst failure on HP-UX 11.11
  - From: Peter Stephenson

References:
- Re: D07multibyte.ztst failure on HP-UX 11.11
  - From: Paul Ackersviller
- Re: D07multibyte.ztst failure on HP-UX 11.11
  - From: Peter Stephenson
- Re: D07multibyte.ztst failure on HP-UX 11.11
  - From: Paul Ackersviller
- Re: D07multibyte.ztst failure on HP-UX 11.11
  - From: Peter Stephenson
- Re: D07multibyte.ztst failure on HP-UX 11.11
  - From: Paul Ackersviller

Messages sorted by: Reverse Date, Date, Thread, Author