Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

zsh generates invalid UTF-8 encoding in the history



With Debian's zsh 5.2-5 + some patches, when I execute commands with
some particular Unicode characters, the UTF-8 sequences are rewritten
incorrectly in the history. For instance:

cventin:~> unicode ─
U+2500 BOX DRAWINGS LIGHT HORIZONTAL
UTF-8: e2 94 80 UTF-16BE: 2500 Decimal: ─ Octal: \022400
─
Category: So (Symbol, Other)
Unicode block: 2500..257F; Box Drawing
Bidi: ON (Other Neutrals)

But in the history, instead of getting e2 94 80, I get: e2 83 b4 80.
Concerning "e2 83 b4 80":

cventin:~> unicode --fromcp utf-8 -x e283b4
U+20F4  - No such unicode character name in database
UTF-8: e2 83 b4 UTF-16BE: 20f4 Decimal: ⃴ Octal: \020364
⃴ (⃴)
Uppercase: 20F4
Category: Cn (Other, Not Assigned)
Unicode block: 20D0..20FF; Combining Diacritical Marks for Symbols

and the 80 on its own is not a valid UTF-8 sequence.

This breaks various tools processing the history (grep, lesspipe,
etc.), first because the expected character is no longer present,
also because of invalid UTF-8, which is not regarded as a character.
For instance:

cventin:~> grep -av '^.*$' .zhistory | tail -n 1 | hd
00000000  3a 20 31 34 37 35 36 36  36 34 31 38 3a 30 3b 75  |: 1475666418:0;u|
00000010  6e 69 63 6f 64 65 20 e2  83 b4 80 0a              |nicode .....|
0000001c

-- 
Vincent Lefèvre <vincent@xxxxxxxxxx> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)



Messages sorted by: Reverse Date, Date, Thread, Author