Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
RE: UTF-8 fonts
- X-seq: zsh-workers 17729
- From: Borzenkov Andrey <Andrej.Borsenkow@xxxxxxxxxxxxxx>
- To: "'Zsh hackers list'" <zsh-workers@xxxxxxxxxx>
- Subject: RE: UTF-8 fonts
- Date: Wed, 25 Sep 2002 15:11:39 +0400
- Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
Just to make it clear. Is the aim to use UTF-8 internally or to support
(arbitrary) multibyte encoding?
> See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
> the subject.
>
> My first thought about using UTF-8 instead of eight bit characters
this sounds like you want to convert input to UTF-8 internally?
> was
> that we would have to replace the current `Meta' system. However, I
> don't think we do since the current system will seamlessly translate
> from UTF-8 input to UTF-8 output.
>
> Therefore, all we have to do is modify the shell's internals at the
> point where it actually compares characters --- or, more generally,
> tries to turn metafied sequences into a single character --- to use the
> normal UTF8 rules. There may also be some extra places where counting
> the length needs changing.
>
You also need to modify any place where shell compares or translates (upper
<-> lower) characters. This is by definition locale dependent - collating
order is different is different languages even when they use the same
character set. Which means you can use UTF-8 (or, more generally, any
multibyte encoding) only if your current locale supports it. Which in effect
means using wc* and mb* function suite anyway.
But this also means you cannot assume anything about current character set
and cannot assume that it is transparent w.r.t. current string handling in
zsh.
> Unicode characters are up to 6 bytes, so either with 64-bit integers we
> can do a direct comparison some bit arithmetic, or we can just use
> strncmp. (I don't fancy relying on internationalisation support for
> this this but in principle that's probably the right thing to do.)
> Hence I don't see the necessity for actually decoding UTF-8 into Unicode
> at any point, just deciding the number of bytes. Not doing this avoids
> problems with overlong encodings (ones which illegally represent a
> character using too many bytes): an overlong encoding will always
> compare differently to the standard encoding.
>
How do you know your input (and strings you are processing) are UTF-8?
Besides, standards do not provide a way to input multibyte character - you
can only read wide character.
> Probably we need a configuration option to switch this on or off.
>
Yes, either we rely on standard locale support (and do not care what
character set is being used) or we must provide some OOB means to define
character set in use.
> Zle might be a bit more of a problem. The web page I referred to above
> gives the hopeful message that all encoding to/decoding from UTF-8 at
> the terminal is handled by the terminal driver. So for zle we have to
> worry about things like
> - determining whether the terminal is actually in UTF-8 mode, probably
> from the locale
Impossible. Local names are just arbitrary chosen strings; there is no
"character set code" defined in any locale definition, at least on Unix.
> - how UTF-8 encoded characters interfere with meta-bindings. May be
> good enough simply not to use these, at least while we work out what's
> what
> - reading multi-byte characters --- timeouts and the like
use standard OS interfaces to read wide characters.
> - getting the right length for displaying, deleting, copying
> etc. multi-byte characters. Apart from counting continutation
> bytes, we may be stuck with using wcwidth for display. This is a pain
> because it involves explicity wchar_t's, and I have no experience at
> all with these (except that they mess up compilation of otherwise
> trivial
> string-handling functions).
> - all the stuff I've forgotten.
>
> Any comments?
>
-andrey
Messages sorted by:
Reverse Date,
Date,
Thread,
Author