Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: Globbing autocorrects misencoded filenames?



On Tue, 19 Jan 2010, waba@xxxxxxx wrote:

> Hello,
> 
> While writing a script that deals with files named using an incorrect 
> encoding (eg. latin1 on a utf8 system), I came across a strange zsh 
> feature:
> 
>   [waba@waba]~ % zsh -f
>   waba% locale
>   LANG=POSIX
>   LC_CTYPE=fr_BE.utf8
>   LC_NUMERIC=POSIX
>   LC_TIME=POSIX
>   LC_COLLATE=fr_BE.utf8
>   LC_MESSAGES=POSIX
>   LC_ALL=
>   waba% touch $( echo /tmp/Ãbc |iconv -tlatin1 )
>   waba% echo /tmp/Ãb?
>   /tmp/???bc
> 
> IOW, I create a file named "Ãbc" (a-grave b c) using latin1 encoding 
> on my utf8 system, but it still matches as utf8 during globbing.
> 
> Expected behavior: the first character of that filename is not valid 
> utf8 and should not match anything except itself. I expected the shell 
> to return "zsh: no matches found: /tmp/Ãb?".
> 
> So,
> 1) Is there a list of the encodings tried by ZSH ? latin1/iso-8859-1 is
>    not mentioned anywhere in my configuration...
> 2) Can this feature be disabled at all ? I'm sure that it can be useful
>    at the prompt, but it's a nuisance inside my script.
> 

Huh, interesting.  Just as another data point, A test with some 
ISO-2022-JP- and UTF-8-encoded kanji-containing filenames shows that it 
does glob the latin-1-ish chars, but not the Chinese chars.

Personally, I wouldn't do this in Zsh (I'm more a perler at heart 
anyway, and Perl's Encode module handles all of this reeeeeally well).  
But, obviously there are larger implications than your script.

I just found the 'u' and 'U' globbing flags in zshall.  Perhaps '(#u)' 
or '(#U)' is all you need (see last examples below), but it still seems 
odd that it's showing the 'Ã' in the filename for me, whether or not 
it's encoded properly.

Best,
Ben

# run in empty dir -- creates some test files (all one line)
$ perl -Mbytes -C0 -lwe 'open my $f, ">", $_ for map $_."=[".join("",map sprintf("%02x", ord $_), split//)."]", map {; "a${_}a", "b${_}b" } "\xe9","\xc3\xa9", "\xe6\x97\xa5\x0a", "\x1b\x24\x42\x46\x7c\x1b\x28\x42\x0a"'

# show that the files indeed have poorly-encoded names
$ rsync -r --list-only ./ | cut -c44-
.
a\#033$BF|\#033(B\#012a=[611b2442467c1b28420a61]
aÃa=[61c3a961]
aæ\#012a=[61e697a50a61]
a\#351a=[61e961]
b\#033$BF|\#033(B\#012b=[621b2442467c1b28420a62]
bÃb=[62c3a962]
bæ\#012b=[62e697a50a62]
b\#351b=[62e962]

# zsh globs 'em all:
$ for l in a* ; print -lr "got:$l" 
got:aBF|
a=[611b2442467c1b28420a61]
got:aÃa=[61c3a961]
got:aæ
a=[61e697a50a61]
got:aÃa=[61e961]

# zsh globs 'em regardless of Latin-1 correctness
$ for l in aÃ* ; print -lr "got:$l"
got:aÃa=[61c3a961]
got:aÃa=[61e961]

# zsh globs only the UTF-8 encoded Japanese filename (w/ Chinese char)
$ for l in aæ* ; print -lr "got:$l"
got:aæ
a=[61e697a50a61]

# zsh has extended glob options to control this (somewhat?)
$ for l in (#u)aÃ* ; print -lr "got:$l"
got:aÃa=[61c3a961]
got:aÃa=[61e961]
$ for l in (#U)aÃ* ; print -lr "got:$l" 
got:aÃa=[61c3a961]


Messages sorted by: Reverse Date, Date, Thread, Author