Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: Named directories

X-seq: zsh-workers 20880
From: Andrey Borzenkov <arvidjaar@xxxxxxxxxx>
To: zsh-workers@xxxxxxxxxx
Subject: Re: Named directories
Date: Sun, 27 Feb 2005 11:30:11 +0300
In-reply-to: <Pine.LNX.4.61.0502261405550.18748@xxxxxxxxxxxxxxxxxx>
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
References: <2e24629722633f291ea82664e14cbc5d@xxxxxxxxxx> <Pine.LNX.4.61.0502261405550.18748@xxxxxxxxxxxxxxxxxx>

On Sunday 27 February 2005 01:17, Bart Schaefer wrote:
>
> Hey, PWS and the UTF-8 gang:  If "bin" were in fact a name consisting of
> three wide characters, i.e., they display as three characters but occupy
> 6 or more bytes, would the prompt code treat ~three as a shorter string
> for purposes of the %~ expansion?
>

yes I'd like to know too :) See 20870 for some more problems. Apparently in 
addition to byte vs. character we get yet character length vs. character 
width problem.

I got a closer look at SUS. All description of shell is in terms of characters 
where characters are specifically defined (elsewhere in SUS) as multibyte 
characters in current locale as opposed to single byte. It means that answer 
to your question is - zsh should behave the same in single byte or multibyte 
locale as long as number of characters is the same.

Some other weird effects - older UNICODE versions did not rule out multiple 
encoding for the same character:

{pts/2}% print -- $'\xd0\xb0'
а
{pts/2}% print -- $'\xe0\x90\xb0'
а

Checking under Linux, apparently the latter won't be accepted on conversion to 
wc; it means that a file name containing such pattern will be shown correctly 
by simply "ls ." but can not be edited at all currently.

Looking at some other implementation - most maintain input in original 
multibyte form and convert to wc only to apply wctype functions. They treat 
invalid mb sequence as characters of size 1 and build extra vector with 
character sizes to speed up iteration over input.

Perl internally tags string either as UNICODE or as character strings; but it 
has completely different implementation, UNICODE is implemented internally 
and not depends on locale.

-andrey

Follow-Ups:
- Wide-character prompts, etc. (Re: Named directories)
  - From: Bart Schaefer

Messages sorted by: Reverse Date, Date, Thread, Author