Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: multi-byte text decoding error can break word splitting by read at least

X-seq: zsh-workers 54697
From: Stephane Chazelas <stephane@xxxxxxxxxxxx>
To: Mikael Magnusson <mikachu@xxxxxxxxx>
Cc: Zsh hackers list <zsh-workers@xxxxxxx>
Subject: Re: multi-byte text decoding error can break word splitting by read at least
Date: Sun, 7 Jun 2026 10:23:03 +0100
Archived-at: <https://zsh.org/workers/54697>
In-reply-to: <CAHYJk3RFi9im87qoeMeVTptOZG=GJ2QCdyvqhqxjOxWJU7kWKw@mail.gmail.com>
List-id: <zsh-workers.zsh.org>
Mail-followup-to: Mikael Magnusson <mikachu@xxxxxxxxx>, Zsh hackers list <zsh-workers@xxxxxxx>
References: <yzyhuwlykbdtojs6fsbyb6iynwri7pwe3wtk5rsgo52spni5ry@g5o5fzavdboh> <CAHYJk3RFi9im87qoeMeVTptOZG=GJ2QCdyvqhqxjOxWJU7kWKw@mail.gmail.com>

2026-06-06 23:03:03 +0200, Mikael Magnusson:
> On Sun, Apr 27, 2025 at 5:41 PM Stephane Chazelas <stephane@xxxxxxxxxxxx> wrote:
> >
> > There was some recent bug report on the bash mailing list about
> > "read" missing the delimiter when it followes a truncated
> > character, but zsh has similar issues when it comes to do doing
> > IFS splitting on the record once it has been read:
> >
> > $ print 'a\302×b' | IFS=× read -rA a; typeset a
> > a=( $'a\M-B×b' )
> >
> > Wasn't split on ×.
> >
> > One might argue that doing reliable word splitting on non-text
> > is illusory anyway, but note that the latest version of the
> > POSIX standard now requires that splitting be done by looking
> > for the byte encodings of the characters in $IFS which would
> > make the behaviour above non-conformant.
> >
> > See https://www.austingroupbugs.net/view.php?id=1920 for some
> > discussion on that though.
[...]
> I think we can probably leave it to users to do this if they want to:
> % sbread() { setopt localoptions nomultibyte; read "$@" }
> % print ''a\302×b'' | IFS=× sbread -rA a
> % typeset -p a'
> typeset -a a=( $'a\M-B' '' b )
[...]

Thanks (and for fixing those other bugs I reported in the past!)
but then that doesn't split on × characters, but on any of the
bytes of the encoding of × (in UTF-8: 0xc3 and 0x97)

$ echo 'Stéphane×Chazelas' | IFS=× sbread -rA a; typeset a
a=( St $'\M-)phane' '' Chazelas )

Stéphane was also split as é also contains the 0xc3 byte.

From POSIX 2024 (https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/utilities/V3_chap02.html#tag_19_06_05):

> The shell shall use the byte sequences that form the
> characters in the value of the IFS variable as delimiters
[...]
> Note that the shell processes arbitrary bytes from the input
> fields; there is no requirement that those bytes form valid
> characters

$IFS itself must contain valid characters or the behaviour is
unspecified:

> If the value of IFS includes any bytes that do not form part
> of a valid character, the results of field splitting,
> expansion of '*', and use of the read utility are unspecified. 

In the next Technical Corrigenda update, it will be updated per
https://www.austingroupbugs.net/view.php?id=1924#c7196 so
requiring only that behaviour in locales with a self
synchronising character encoding such as UTF-8.

So a\302×b with IFS=× is and will be required to be split on
a\302 and b in a UTF-8 locale.

Note that splitting seems to work fine with the s parameter
expansion flag:

$ str=$'×a\302×b'; a=( "${(A@s[×])str}" ); typeset -p1 a
typeset -a a=(
  ''
  $'a\M-B'
  b
)

-- 
Stephane

Follow-Ups:
- Re: multi-byte text decoding error can break word splitting by read at least
  - From: Mikael Magnusson

References:
- Re: multi-byte text decoding error can break word splitting by read at least
  - From: Mikael Magnusson

Messages sorted by: Reverse Date, Date, Thread, Author