Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: UTF-8 fonts

X-seq: zsh-workers 17732
From: Oliver Kiddle <okiddle@xxxxxxxxxxx>
To: Peter Stephenson <pws@xxxxxxx>
Subject: Re: UTF-8 fonts
Date: Wed, 25 Sep 2002 18:29:35 +0100
Cc: zsh-workers@xxxxxxxxxx (Zsh hackers list)
In-reply-to: <10303.1032953780@xxxxxxx>
Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
References: <10303.1032953780@xxxxxxx>
Sender: kiddleo@xxxxxxxxxx

On 25 Sep, Peter Stephenson wrote:
> Borzenkov Andrey wrote:
> > Just to make it clear. Is the aim to use UTF-8 internally or to support
> > (arbitrary) multibyte encoding?
> 
> The first with as much of the second as we can get in without too much

So is your aim to use UTF-8 internally in all cases or only when it is
the selected character set? I would have thought it would be easier to
just use whatever LC_CTYPE (the locale's selected encoding) is
internally and use the mb* functions so things work regardless of
whether or not LC_CTYPE is a multi-byte character encoding. I don't
know much about other multi-byte character encodings that can be used
for the input/output locale but I had gathered they at least have the
level of compatibility with basic ASCII that allows you to use ASCII
characters in string literals. To convert everything to UTF-8
internally, you would have to either use iconv or do messy stuff: the
mb* functions deal with whatever LC_CTYPE is and not UTF-8 (unless
that's what LC_CTYPE happens to be of course).

> We are going to assume that bytes without the top-bit set are ASCII, and
> the remainder require mb* handling.

Isn't it easier to just do mb* handling on everything and not go around
checking the top bit. The mb*() functions should do that sort of stuff
for us. mbrtowc() can be used, discarding the returned wchar_t to, for
example consume one character of a string. So it worries about whatever
the top bit of the bytes are or whatever the underlying multi-byte
character encoding requires.

> > Impossible. Local names are just arbitrary chosen strings; there is no
> > "character set code" defined in any locale definition, at least on Unix.

as has been mentioned: nl_langinfo(CODESET)

> Read the document at the link I gave which suggests otherwise.  However,
> I now think we can in any case leave this to the mb* suite to decide.

Yes, I think we can.

I'm sure you can all use google, but other possibly useful links I had
in my bookmarks are these:

  IBM's patches to various GNU stuff:
    https://www-124.ibm.com/developer/opensource/linux/patches/i18n/
  IBM article that serves as a basic intro:
    http://www-106.ibm.com/developerworks/library/l-linuni.html
  howto
    http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html

Oliver

This e-mail and any attachment is for authorised use by the intended recipient(s) only.  It may contain proprietary material, confidential information and/or be subject to legal privilege.  It should not be copied, disclosed to, retained or used by, any other party.  If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender.  Thank you.

Follow-Ups:
- Re: UTF-8 fonts
  - From: Peter Stephenson

References:
- Re: UTF-8 fonts
  - From: Peter Stephenson

Messages sorted by: Reverse Date, Date, Thread, Author