Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Three-byte UTF-8 chars and $functions

X-seq: zsh-users 20234
From: Andrew Janke <janke@xxxxxxxxx>
To: Zsh Users <zsh-users@xxxxxxx>
Subject: Three-byte UTF-8 chars and $functions
Date: Thu, 28 May 2015 18:05:35 -0400
Dkim-signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:message-id:mime-version:subject:to:x-sasl-enc :x-sasl-enc; s=smtpout; bh=VdmKlGl8dks/BQGFApgnN4fpF/c=; b=LgOI0 hXYijWVmqu6U/3mssUfh3YkI0KvdWCRN+ZcPDbAPZt1jcak+1v/Pv0iN356BjYyf iGYyZ9cY7S80Zmr3DrayOs/6Vtg3SJJyc363dhaCB6x3kn7+1IFYe2w/2diEXfxq ObPcEbaRLcX/ktALpUiLchXEV9xAWlSNJfQ37A=
List-help: <mailto:zsh-users-help@zsh.org>
List-id: Zsh Users List <zsh-users.zsh.org>
List-post: <mailto:zsh-users@zsh.org>
Mailing-list: contact zsh-users-help@xxxxxxx; run by ezmlm

Hi, ZSH users and workers,

I'm seeing some odd behavior related to multibyte UTF-8 characters thatI don't understand. Maybe one of you could help me figure out? I don'tknow if it's user error, incorrect behavior expectations on my part, oran actual character handling issue.

The deal: If I define a function that contains a multibyte UTF-8character that's three bytes long, the function itself and itsrepresentation from `which` work as expected. But if I grab the functiondefinition to a string using `str="${functions[funcname]}"`, then thatmultibyte character seems to get garbled and come back as 3 characters,the bytes of which do not look like the original UTF-8 bytes. The${#str} operator reports a different string length than I expected.

Is there something going on with character encodings and $functions?This sounds almost like it's getting re-encoded in a different encoding.Or maybe my expectation that ${functions[func]} can be extracteddirectly to a string is wrong?

Here's a zsh script that will reproduce the problem. This is in Unicode,and this email message should be encoded in UTF-8, but I'm not sure itwon't get messed up on the way to the list. This code listing containsexactly one non-ASCII character, the character echoed by function foo().It is U+25B8. The function should look exactly like "function bar () {echo 'x' }", but with the character "x" replaced by U+25B8 (as a singlecharacter, not an escape sequence of some sort).


Thanks in advance for any assistance or info you can provide.


#!/bin/zsh
# weird_unicode_func.zsh
#
# Shows weird behavior of multibyte UTF-8 Unicode characters
# in function definitions exposed through $functions

export LC_ALL=en_US.UTF-8

uname -a
echo zsh $ZSH_VERSION

locale

# Function that echoes U+25B8 BLACK RIGHT-POINTING SMALL TRIANGLE
# (Chosen because that's single-width, at least in Menlo/Meslo)
function foo () { echo '▸' }

# And a function that echoes a normal character
function bar () { echo 'x' }


# I'd think these *should* be the same length
# And they are, for "which": 21 chars each, on OS X
# And their output should be 2 chars each
which foo
echo `which foo | wc -m` chars
echo "foo output: $(foo | wc -m) chars $(foo | wc -c) bytes"
which bar
echo `which bar | wc -m` chars
echo "bar output: $(bar | wc -m) chars $(bar | wc -c) bytes"

# But capture their string representations...
foo_str="${functions[foo]}"
bar_str="${functions[bar]}"

# ...and in those strings, I see foo as 2 chars longer than bar
# (11 chars vs 9 chars)
echo '${functions[foo]} is:'
echo "$foo_str"
echo ${#foo_str} chars
# And wc doesn't even like it; looks like it's invalid UTF-8?
echo -n "$foo_str" | wc -m
echo '${functions[bar]} is:'
echo "$bar_str"
echo ${#bar_str} chars
echo

# Plus, re-evaling it to define another function ends up
# producing weird output when that function is run
eval "function qux { ${functions[foo]} }"
which qux
echo qux output:
qux
echo $(qux | wc -c) bytes
qux | wc -m

# end of script

And here's the output I get, on Mac OS X 10.9.5, using either zsh 5.0.2(shipped with OS X) or 5.0.7 (installed via Homebrew). If the outputgets garbled, know that I'm seeing the triangle as a single charater inthe output of `foo` and `which foo`, but once it comes out of$functions, I'm seeing "?~?", where "?" is actually the"unsupported/invalid/no-glyph character" placeholder.

I see almost the same behavior on Debian Linux 7 with zsh 4.3.7. zsh'soutput seems the same, but there, wc (GNU wc 8.13) is happy with thelater outputs, seeing them as 4 bytes and 2 chars. On Windows 7/Cygwinwith zsh 5.0.7 and GNU wc 8.23, I see the same behavior as on Debian.

I get the same behavior if I don't explicitly assign LC_ALL; I'm justdoing that for consistency (and not sure if I should).


I'm running zsh with an empty ~/.zshrc and all the default settings (AFAIK).


eilonwy% zsh weird_unicode_func.zsh

Darwin eilonwy.local 13.4.0 Darwin Kernel Version 13.4.0: Wed Mar 1816:20:14 PDT 2015; root:xnu-2422.115.14~1/RELEASE_X86_64 x86_64

zsh 5.0.2
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
foo () {
    echo '▸'
}
21 chars
foo output:        2 chars        4 bytes
bar () {
    echo 'x'
}
21 chars
bar output:        2 chars        2 bytes
${functions[foo]} is:
    echo '�~�'
11 chars
wc: stdin: Illegal byte sequence
      11
${functions[bar]} is:
    echo 'x'
9 chars

qux () {
    echo '�~�'
}
qux output:
�~�
4 bytes
wc: stdin: Illegal byte sequence
       4
eilonwy%




Cheers,
Andrew Janke
janke@xxxxxxxxx

Follow-Ups:
- Re: Three-byte UTF-8 chars and $functions
  - From: Peter Stephenson

Messages sorted by: Reverse Date, Date, Thread, Author