Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: multi-byte text decoding error can break word splitting by read at least
- X-seq: zsh-workers 54697
- From: Stephane Chazelas <stephane@xxxxxxxxxxxx>
- To: Mikael Magnusson <mikachu@xxxxxxxxx>
- Cc: Zsh hackers list <zsh-workers@xxxxxxx>
- Subject: Re: multi-byte text decoding error can break word splitting by read at least
- Date: Sun, 7 Jun 2026 10:23:03 +0100
- Archived-at: <https://zsh.org/workers/54697>
- In-reply-to: <CAHYJk3RFi9im87qoeMeVTptOZG=GJ2QCdyvqhqxjOxWJU7kWKw@mail.gmail.com>
- List-id: <zsh-workers.zsh.org>
- Mail-followup-to: Mikael Magnusson <mikachu@xxxxxxxxx>, Zsh hackers list <zsh-workers@xxxxxxx>
- References: <yzyhuwlykbdtojs6fsbyb6iynwri7pwe3wtk5rsgo52spni5ry@g5o5fzavdboh> <CAHYJk3RFi9im87qoeMeVTptOZG=GJ2QCdyvqhqxjOxWJU7kWKw@mail.gmail.com>
2026-06-06 23:03:03 +0200, Mikael Magnusson:
> On Sun, Apr 27, 2025 at 5:41 PM Stephane Chazelas <stephane@xxxxxxxxxxxx> wrote:
> >
> > There was some recent bug report on the bash mailing list about
> > "read" missing the delimiter when it followes a truncated
> > character, but zsh has similar issues when it comes to do doing
> > IFS splitting on the record once it has been read:
> >
> > $ print 'a\302×b' | IFS=× read -rA a; typeset a
> > a=( $'a\M-B×b' )
> >
> > Wasn't split on ×.
> >
> > One might argue that doing reliable word splitting on non-text
> > is illusory anyway, but note that the latest version of the
> > POSIX standard now requires that splitting be done by looking
> > for the byte encodings of the characters in $IFS which would
> > make the behaviour above non-conformant.
> >
> > See https://www.austingroupbugs.net/view.php?id=1920 for some
> > discussion on that though.
[...]
> I think we can probably leave it to users to do this if they want to:
> % sbread() { setopt localoptions nomultibyte; read "$@" }
> % print ''a\302×b'' | IFS=× sbread -rA a
> % typeset -p a'
> typeset -a a=( $'a\M-B' '' b )
[...]
Thanks (and for fixing those other bugs I reported in the past!)
but then that doesn't split on × characters, but on any of the
bytes of the encoding of × (in UTF-8: 0xc3 and 0x97)
$ echo 'Stéphane×Chazelas' | IFS=× sbread -rA a; typeset a
a=( St $'\M-)phane' '' Chazelas )
Stéphane was also split as é also contains the 0xc3 byte.
From POSIX 2024 (https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/utilities/V3_chap02.html#tag_19_06_05):
> The shell shall use the byte sequences that form the
> characters in the value of the IFS variable as delimiters
[...]
> Note that the shell processes arbitrary bytes from the input
> fields; there is no requirement that those bytes form valid
> characters
$IFS itself must contain valid characters or the behaviour is
unspecified:
> If the value of IFS includes any bytes that do not form part
> of a valid character, the results of field splitting,
> expansion of '*', and use of the read utility are unspecified.
In the next Technical Corrigenda update, it will be updated per
https://www.austingroupbugs.net/view.php?id=1924#c7196 so
requiring only that behaviour in locales with a self
synchronising character encoding such as UTF-8.
So a\302×b with IFS=× is and will be required to be split on
a\302 and b in a UTF-8 locale.
Note that splitting seems to work fine with the s parameter
expansion flag:
$ str=$'×a\302×b'; a=( "${(A@s[×])str}" ); typeset -p1 a
typeset -a a=(
''
$'a\M-B'
b
)
--
Stephane
Messages sorted by:
Reverse Date,
Date,
Thread,
Author