Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: UTF-8 fonts
- X-seq: zsh-workers 17732
- From: Oliver Kiddle <okiddle@xxxxxxxxxxx>
- To: Peter Stephenson <pws@xxxxxxx>
- Subject: Re: UTF-8 fonts
- Date: Wed, 25 Sep 2002 18:29:35 +0100
- Cc: zsh-workers@xxxxxxxxxx (Zsh hackers list)
- In-reply-to: <10303.1032953780@xxxxxxx>
- Mailing-list: contact zsh-workers-help@xxxxxxxxxx; run by ezmlm
- References: <10303.1032953780@xxxxxxx>
- Sender: kiddleo@xxxxxxxxxx
On 25 Sep, Peter Stephenson wrote:
> Borzenkov Andrey wrote:
> > Just to make it clear. Is the aim to use UTF-8 internally or to support
> > (arbitrary) multibyte encoding?
>
> The first with as much of the second as we can get in without too much
So is your aim to use UTF-8 internally in all cases or only when it is
the selected character set? I would have thought it would be easier to
just use whatever LC_CTYPE (the locale's selected encoding) is
internally and use the mb* functions so things work regardless of
whether or not LC_CTYPE is a multi-byte character encoding. I don't
know much about other multi-byte character encodings that can be used
for the input/output locale but I had gathered they at least have the
level of compatibility with basic ASCII that allows you to use ASCII
characters in string literals. To convert everything to UTF-8
internally, you would have to either use iconv or do messy stuff: the
mb* functions deal with whatever LC_CTYPE is and not UTF-8 (unless
that's what LC_CTYPE happens to be of course).
> We are going to assume that bytes without the top-bit set are ASCII, and
> the remainder require mb* handling.
Isn't it easier to just do mb* handling on everything and not go around
checking the top bit. The mb*() functions should do that sort of stuff
for us. mbrtowc() can be used, discarding the returned wchar_t to, for
example consume one character of a string. So it worries about whatever
the top bit of the bytes are or whatever the underlying multi-byte
character encoding requires.
> > Impossible. Local names are just arbitrary chosen strings; there is no
> > "character set code" defined in any locale definition, at least on Unix.
as has been mentioned: nl_langinfo(CODESET)
> Read the document at the link I gave which suggests otherwise. However,
> I now think we can in any case leave this to the mb* suite to decide.
Yes, I think we can.
I'm sure you can all use google, but other possibly useful links I had
in my bookmarks are these:
IBM's patches to various GNU stuff:
https://www-124.ibm.com/developer/opensource/linux/patches/i18n/
IBM article that serves as a basic intro:
http://www-106.ibm.com/developerworks/library/l-linuni.html
howto
http://www.tldp.org/HOWTO/Unicode-HOWTO-6.html
Oliver
This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
Messages sorted by:
Reverse Date,
Date,
Thread,
Author