Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Wide-character prompts, etc. (Re: Named directories)



On Feb 27, 11:30am, Andrey Borzenkov wrote:
} Subject: Re: Named directories
}
} On Sunday 27 February 2005 01:17, Bart Schaefer wrote:
} >
} > Hey, PWS and the UTF-8 gang: If "bin" were in fact a name consisting
} > of three wide characters, i.e., they display as three characters
} > but occupy 6 or more bytes, would the prompt code treat ~three as a
} > shorter string for purposes of the %~ expansion?

[ I'm going to reorder Andrey's followup a bit here ... ]
 
} I got a closer look at SUS. All description of shell is in terms of
} characters where characters are specifically defined (elsewhere in
} SUS) as multibyte characters in current locale as opposed to single
} byte. It means that answer to your question is - zsh should behave
} the same in single byte or multibyte locale as long as number of
} characters is the same.

Yes, my question was in part rhetorical -- to emphasize that the UNICODE
interpretation can't be entirely isolated to ZLE.

} Some other weird effects - older UNICODE versions did not rule out
} multiple encoding for the same character:
[...]
} Checking under Linux, apparently the latter won't be accepted on
} conversion to wc; it means that a file name containing such pattern
} will be shown correctly by simply "ls ." but can not be edited at all
} currently.

This seems to me to be an issue with the underlying wc libraries and not
something the shell is in a position to deal with.  If the library can't
identify a multibyte sequence, zsh shouldn't attempt second-guessing it.

} See 20870 for some more problems.

You listed:

- regexps ([[:print:]] et al.)?

Pattern matching of character classes and the ? metacharacter will, I
think, need to examine the current position in the scanned string to
see if it's the start of a multibyte character, and if so, convert the
next N bytes to a wide char and do a locale-based comparison against it.  

The real question is what to do if someone embeds a literal $'\xd0' in
a pattern.  Reject the pattern as bad?  Or allow it to match one byte of
a multibyte in the scanned string?

Similarly, does $'*\xb0' as a pattern match $'\xd0\xb0' as the scan?

We may need an option (and a corresponding globbing flag to toggle it)
to force the pattern to be interpreted as wide chars or bytes.  That
option would, ideally, be set by default to the most sensible value for
the current locale.  If set for wide characters, the answer to my first
question would be "reject as bad" and to the second would be "no".

- $foo[n,m] for scalar?

Peter previously suggested that this would operate bytewise except in
the presence of an expansion flag to force it to operate character-wise.
See also my previous remark about an option.  (Could the same option
apply to both patterns and expansion?)

- Upper/Lower conversion?

Conversion or comparison?  I think conversion only occurs in paramter
expansion, so the answer there is the same as for $foo[n,m].  Comparison
occurs in the completion matching code as well as in globbing.  My guess
is that completion could in fact attempt the comparison both with and
without locale, because it is by nature a heuristic anyway.

- comparison (collating)?

We already use strcoll() there.  Can we rely upon it to work properly in
multibyte locales?

-- 
Bart Schaefer                                 Brass Lantern Enterprises
http://www.well.com/user/barts              http://www.brasslantern.com

Zsh: http://www.zsh.org | PHPerl Project: http://phperl.sourceforge.net   



Messages sorted by: Reverse Date, Date, Thread, Author