Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: Named directories
- X-seq: zsh-workers 20880
- From: Andrey Borzenkov <arvidjaar@xxxxxxxxxx>
- To: zsh-workers@xxxxxxxxxx
- Subject: Re: Named directories
- Date: Sun, 27 Feb 2005 11:30:11 +0300
- In-reply-to: <Pine.LNX.4.61.0502261405550.18748@xxxxxxxxxxxxxxxxxx>
- Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
- References: <2e24629722633f291ea82664e14cbc5d@xxxxxxxxxx> <Pine.LNX.4.61.0502261405550.18748@xxxxxxxxxxxxxxxxxx>
On Sunday 27 February 2005 01:17, Bart Schaefer wrote:
>
> Hey, PWS and the UTF-8 gang: If "bin" were in fact a name consisting of
> three wide characters, i.e., they display as three characters but occupy
> 6 or more bytes, would the prompt code treat ~three as a shorter string
> for purposes of the %~ expansion?
>
yes I'd like to know too :) See 20870 for some more problems. Apparently in
addition to byte vs. character we get yet character length vs. character
width problem.
I got a closer look at SUS. All description of shell is in terms of characters
where characters are specifically defined (elsewhere in SUS) as multibyte
characters in current locale as opposed to single byte. It means that answer
to your question is - zsh should behave the same in single byte or multibyte
locale as long as number of characters is the same.
Some other weird effects - older UNICODE versions did not rule out multiple
encoding for the same character:
{pts/2}% print -- $'\xd0\xb0'
а
{pts/2}% print -- $'\xe0\x90\xb0'
а
Checking under Linux, apparently the latter won't be accepted on conversion to
wc; it means that a file name containing such pattern will be shown correctly
by simply "ls ." but can not be edited at all currently.
Looking at some other implementation - most maintain input in original
multibyte form and convert to wc only to apply wctype functions. They treat
invalid mb sequence as characters of size 1 and build extra vector with
character sizes to speed up iteration over input.
Perl internally tags string either as UNICODE or as character strings; but it
has completely different implementation, UNICODE is implemented internally
and not depends on locale.
-andrey
Messages sorted by:
Reverse Date,
Date,
Thread,
Author