Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: Globbing autocorrects misencoded filenames?
On Tue, 19 Jan 2010, waba@xxxxxxx wrote:
> Hello,
>
> While writing a script that deals with files named using an incorrect
> encoding (eg. latin1 on a utf8 system), I came across a strange zsh
> feature:
>
> [waba@waba]~ % zsh -f
> waba% locale
> LANG=POSIX
> LC_CTYPE=fr_BE.utf8
> LC_NUMERIC=POSIX
> LC_TIME=POSIX
> LC_COLLATE=fr_BE.utf8
> LC_MESSAGES=POSIX
> LC_ALL=
> waba% touch $( echo /tmp/Ãbc |iconv -tlatin1 )
> waba% echo /tmp/Ãb?
> /tmp/???bc
>
> IOW, I create a file named "Ãbc" (a-grave b c) using latin1 encoding
> on my utf8 system, but it still matches as utf8 during globbing.
>
> Expected behavior: the first character of that filename is not valid
> utf8 and should not match anything except itself. I expected the shell
> to return "zsh: no matches found: /tmp/Ãb?".
>
> So,
> 1) Is there a list of the encodings tried by ZSH ? latin1/iso-8859-1 is
> not mentioned anywhere in my configuration...
> 2) Can this feature be disabled at all ? I'm sure that it can be useful
> at the prompt, but it's a nuisance inside my script.
>
Huh, interesting. Just as another data point, A test with some
ISO-2022-JP- and UTF-8-encoded kanji-containing filenames shows that it
does glob the latin-1-ish chars, but not the Chinese chars.
Personally, I wouldn't do this in Zsh (I'm more a perler at heart
anyway, and Perl's Encode module handles all of this reeeeeally well).
But, obviously there are larger implications than your script.
I just found the 'u' and 'U' globbing flags in zshall. Perhaps '(#u)'
or '(#U)' is all you need (see last examples below), but it still seems
odd that it's showing the 'Ã' in the filename for me, whether or not
it's encoded properly.
Best,
Ben
# run in empty dir -- creates some test files (all one line)
$ perl -Mbytes -C0 -lwe 'open my $f, ">", $_ for map $_."=[".join("",map sprintf("%02x", ord $_), split//)."]", map {; "a${_}a", "b${_}b" } "\xe9","\xc3\xa9", "\xe6\x97\xa5\x0a", "\x1b\x24\x42\x46\x7c\x1b\x28\x42\x0a"'
# show that the files indeed have poorly-encoded names
$ rsync -r --list-only ./ | cut -c44-
.
a\#033$BF|\#033(B\#012a=[611b2442467c1b28420a61]
aÃa=[61c3a961]
aæ\#012a=[61e697a50a61]
a\#351a=[61e961]
b\#033$BF|\#033(B\#012b=[621b2442467c1b28420a62]
bÃb=[62c3a962]
bæ\#012b=[62e697a50a62]
b\#351b=[62e962]
# zsh globs 'em all:
$ for l in a* ; print -lr "got:$l"
got:aBF|
a=[611b2442467c1b28420a61]
got:aÃa=[61c3a961]
got:aæ
a=[61e697a50a61]
got:aÃa=[61e961]
# zsh globs 'em regardless of Latin-1 correctness
$ for l in aÃ* ; print -lr "got:$l"
got:aÃa=[61c3a961]
got:aÃa=[61e961]
# zsh globs only the UTF-8 encoded Japanese filename (w/ Chinese char)
$ for l in aæ* ; print -lr "got:$l"
got:aæ
a=[61e697a50a61]
# zsh has extended glob options to control this (somewhat?)
$ for l in (#u)aÃ* ; print -lr "got:$l"
got:aÃa=[61c3a961]
got:aÃa=[61e961]
$ for l in (#U)aÃ* ; print -lr "got:$l"
got:aÃa=[61c3a961]
Messages sorted by:
Reverse Date,
Date,
Thread,
Author