Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

UTF-8 fonts

X-seq: zsh-workers 17712
From: Peter Stephenson <pws@xxxxxxx>
To: zsh-workers@xxxxxxxxxx (Zsh hackers list)
Subject: UTF-8 fonts
Date: Thu, 19 Sep 2002 17:56:57 +0100
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for a nice summary of
the subject.

My first thought about using UTF-8 instead of eight bit characters was
that we would have to replace the current `Meta' system.  However, I
don't think we do since the current system will seamlessly translate
from UTF-8 input to UTF-8 output.

Therefore, all we have to do is modify the shell's internals at the
point where it actually compares characters --- or, more generally,
tries to turn metafied sequences into a single character --- to use the
normal UTF8 rules.  There may also be some extra places where counting
the length needs changing.

Unicode characters are up to 6 bytes, so either with 64-bit integers we
can do a direct comparison some bit arithmetic, or we can just use
strncmp.  (I don't fancy relying on internationalisation support for
this this but in principle that's probably the right thing to do.)
Hence I don't see the necessity for actually decoding UTF-8 into Unicode
at any point, just deciding the number of bytes.  Not doing this avoids
problems with overlong encodings (ones which illegally represent a
character using too many bytes): an overlong encoding will always
compare differently to the standard encoding.

Probably we need a configuration option to switch this on or off.

Zle might be a bit more of a problem.  The web page I referred to above
gives the hopeful message that all encoding to/decoding from UTF-8 at
the terminal is handled by the terminal driver.  So for zle we have to
worry about things like
- determining whether the terminal is actually in UTF-8 mode, probably
  from the locale
- how UTF-8 encoded characters interfere with meta-bindings.  May be
  good enough simply not to use these, at least while we work out what's
  what
- reading multi-byte characters --- timeouts and the like
- getting the right length for displaying, deleting, copying
  etc. multi-byte characters.  Apart from counting continutation
  bytes, we may be stuck with using wcwidth for display.  This is a pain
  because it involves explicity wchar_t's, and I have no experience at
  all with these (except that they mess up compilation of otherwise trivial
  string-handling functions).
- all the stuff I've forgotten.

Any comments?

-- 
Peter Stephenson <pws@xxxxxxx>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070


**********************************************************************
The information transmitted is intended only for the person or
entity to which it is addressed and may contain confidential 
and/or privileged material. 
Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is 
prohibited.  
If you received this in error, please contact the sender and 
delete the material from any computer.
**********************************************************************

Follow-Ups:
- Re: UTF-8 fonts
  - From: Oliver Kiddle
- Re: UTF-8 fonts
  - From: Clint Adams

Messages sorted by: Reverse Date, Date, Thread, Author