Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Unicode support in Zle



On 30 Apr, Phillip Vandry wrote:

> One difference between what's suggested there and what I am doing is that
> I chose not to use the libc/locale functions such as wcwidth() and
> mblen(). It is debatable whether I should have, but I did this for a
> couple of reasons:

The main advantage of the libc functions is that they work for other
multi-byte encodings than utf-8. They also do a lot of work for you but
don't let that stop you reproducing it if you want.

> - To enable the functionality to work on systems where Unicode is not
> handled at all in the system's libc & ascociated libraries. I still

On the basis that such systems won't have utf-8 handling xterms,
filesystems or anything else, I'm sceptical about the value of that.

> use lots of older systems that run things like Solaris 2.5.1. These
> wouldn't be able to support it if I depended on the libraries. I
> will use the locale information from the environment as a hint to
> turn on UTF-8 mode, but you can also do it manually (currently by
> typing "setopt utf8"). The alternative to using libc functions is
> to use glib functions, but I don't really want to add glib to the soup.

I'd agree that adding glib into the mix would not be what we want. I'm
not sure that a utf8 option achieves anything. Assigning to LC_CTYPE
ought to be sufficient.

I'd add a --disable-multibyte option to configure to cut out support
though.

> - To convince myself that the handling of overlong UTF-8 encodings is
> handled securely to my satisfaction. Encoding a character in UTF-8
> with an overlong encoding can be a security problem (example:
> software attempts to purify filenames by stripping slashes and other
> special characters but misses [0xc0 0xaf], an overlong encoding of
> the slash character in UTF-8).

Would zsh code actually do the encoding anywhere as opposed to getting
it from the terminal or wherever else? I can't particularly think of an
example where an encoding wouldn't have come from an input somewhere.

> - Both the function to calculate the length in bytes of a UTF-8 character
> and its Unicode value and the function to guess whether a character
> occupies a double width cell are easy enough to implement in under
> 30 lines of code each.

Do these functions map fairly closely onto the libc equivalents. We
could perhaps apply them on systems where configure doesn't find
functions like wctomb in libc? So systems with wctomb and friends would
get a little less bloat and support for other multi-byte encodings.

Besides these comments, this all sounds very good. I look forward to
hearing about further progress.

Oliver

PS. Just in case it is any use to you, I've attached UCS4 to UTF-8
conversion code which I meant to put into the \u/\U code as a fallback
for systems like Solaris 8. I had to do a bit of searching to find
examples of this that were not GPL'd.

#  if defined(HAVE_NL_LANGINFO) && defined(CODESET)
		if (!strcmp(nl_langinfo(CODESET), "UTF-8")) {
		    int len;

		    if (wval < 0x80)
        	      len = 1;
		    else if (wval < 0x800)
        	      len = 2;
		    else if (wval < 0x10000)
        	      len = 3;
		    else if (wval < 0x200000)
        	      len = 4;
		    else if (wval < 0x4000000)
        	      len = 5;
		    else
        	      len = 6;
		  
		    switch (len) { /* falls through except to the last case */
        	    case 6: t[5] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 5: t[4] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 4: t[3] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 3: t[2] = (wval & 0x3f) | 0x80; wval >>= 6;
        	    case 2: t[1] = (wval & 0x3f) | 0x80; wval >>= 6;
			*t = wval | (0xfc << (6 - len)) & 0xfc;
			break;
        	    case 1: *t = wval;
        	    }
		    t += len;
		    continue;
		}
#  endif




Messages sorted by: Reverse Date, Date, Thread, Author