Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

RE: UTF-8 fonts

X-seq: zsh-workers 17729
From: Borzenkov Andrey <Andrej.Borsenkow@xxxxxxxxxxxxxx>
To: "'Zsh hackers list'" <zsh-workers@xxxxxxxxxx>
Subject: RE: UTF-8 fonts
Date: Wed, 25 Sep 2002 15:11:39 +0400
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm

Just to make it clear. Is the aim to use UTF-8 internally or to support
(arbitrary) multibyte encoding?

> See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> the subject.
> 
> My first thought about using UTF-8 instead of eight bit characters

this sounds like you want to convert input to UTF-8 internally?

> was
> that we would have to replace the current `Meta' system.  However, I
> don't think we do since the current system will seamlessly translate
> from UTF-8 input to UTF-8 output.
> 
> Therefore, all we have to do is modify the shell's internals at the
> point where it actually compares characters --- or, more generally,
> tries to turn metafied sequences into a single character --- to use the
> normal UTF8 rules.  There may also be some extra places where counting
> the length needs changing.
> 

You also need to modify any place where shell compares or translates (upper
<-> lower) characters. This is by definition locale dependent - collating
order is different is different languages even when they use the same
character set. Which means you can use UTF-8 (or, more generally, any
multibyte encoding) only if your current locale supports it. Which in effect
means using wc* and mb* function suite anyway.

But this also means you cannot assume anything about current character set
and cannot assume that it is transparent w.r.t. current string handling in
zsh.

> Unicode characters are up to 6 bytes, so either with 64-bit integers we
> can do a direct comparison some bit arithmetic, or we can just use
> strncmp.  (I don't fancy relying on internationalisation support for
> this this but in principle that's probably the right thing to do.)
> Hence I don't see the necessity for actually decoding UTF-8 into Unicode
> at any point, just deciding the number of bytes.  Not doing this avoids
> problems with overlong encodings (ones which illegally represent a
> character using too many bytes): an overlong encoding will always
> compare differently to the standard encoding.
> 

How do you know your input (and strings you are processing) are UTF-8?
Besides, standards do not provide a way to input multibyte character - you
can only read wide character.

> Probably we need a configuration option to switch this on or off.
> 

Yes, either we rely on standard locale support (and do not care what
character set is being used) or we must provide some OOB means to define
character set in use. 

> Zle might be a bit more of a problem.  The web page I referred to above
> gives the hopeful message that all encoding to/decoding from UTF-8 at
> the terminal is handled by the terminal driver.  So for zle we have to
> worry about things like
> - determining whether the terminal is actually in UTF-8 mode, probably
>   from the locale

Impossible. Local names are just arbitrary chosen strings; there is no
"character set code" defined in any locale definition, at least on Unix.

> - how UTF-8 encoded characters interfere with meta-bindings.  May be
>   good enough simply not to use these, at least while we work out what's
>   what
> - reading multi-byte characters --- timeouts and the like

use standard OS interfaces to read wide characters.

> - getting the right length for displaying, deleting, copying
>   etc. multi-byte characters.  Apart from counting continutation
>   bytes, we may be stuck with using wcwidth for display.  This is a pain
>   because it involves explicity wchar_t's, and I have no experience at
>   all with these (except that they mess up compilation of otherwise
> trivial
>   string-handling functions).
> - all the stuff I've forgotten.
> 
> Any comments?
> 

-andrey

Follow-Ups:
- Re: UTF-8 fonts
  - From: Peter Stephenson

Messages sorted by: Reverse Date, Date, Thread, Author