Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: multi-byte text decoding error can break word splitting by read at least



2026-06-06 23:03:03 +0200, Mikael Magnusson:
> On Sun, Apr 27, 2025 at 5:41 PM Stephane Chazelas <stephane@xxxxxxxxxxxx> wrote:
> >
> > There was some recent bug report on the bash mailing list about
> > "read" missing the delimiter when it followes a truncated
> > character, but zsh has similar issues when it comes to do doing
> > IFS splitting on the record once it has been read:
> >
> > $ print 'a\302×b' | IFS=× read -rA a; typeset a
> > a=( $'a\M-B×b' )
> >
> > Wasn't split on ×.
> >
> > One might argue that doing reliable word splitting on non-text
> > is illusory anyway, but note that the latest version of the
> > POSIX standard now requires that splitting be done by looking
> > for the byte encodings of the characters in $IFS which would
> > make the behaviour above non-conformant.
> >
> > See https://www.austingroupbugs.net/view.php?id=1920 for some
> > discussion on that though.
[...]
> I think we can probably leave it to users to do this if they want to:
> % sbread() { setopt localoptions nomultibyte; read "$@" }
> % print ''a\302×b'' | IFS=× sbread -rA a
> % typeset -p a'
> typeset -a a=( $'a\M-B' '' b )
[...]

Thanks (and for fixing those other bugs I reported in the past!)
but then that doesn't split on × characters, but on any of the
bytes of the encoding of × (in UTF-8: 0xc3 and 0x97)

$ echo 'Stéphane×Chazelas' | IFS=× sbread -rA a; typeset a
a=( St $'\M-)phane' '' Chazelas )

Stéphane was also split as é also contains the 0xc3 byte.

From POSIX 2024 (https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/utilities/V3_chap02.html#tag_19_06_05):

> The shell shall use the byte sequences that form the
> characters in the value of the IFS variable as delimiters
[...]
> Note that the shell processes arbitrary bytes from the input
> fields; there is no requirement that those bytes form valid
> characters

$IFS itself must contain valid characters or the behaviour is
unspecified:

> If the value of IFS includes any bytes that do not form part
> of a valid character, the results of field splitting,
> expansion of '*', and use of the read utility are unspecified. 

In the next Technical Corrigenda update, it will be updated per
https://www.austingroupbugs.net/view.php?id=1924#c7196 so
requiring only that behaviour in locales with a self
synchronising character encoding such as UTF-8.

So a\302×b with IFS=× is and will be required to be split on
a\302 and b in a UTF-8 locale.

Note that splitting seems to work fine with the s parameter
expansion flag:

$ str=$'×a\302×b'; a=( "${(A@s[×])str}" ); typeset -p1 a
typeset -a a=(
  ''
  $'a\M-B'
  b
)

-- 
Stephane




Messages sorted by: Reverse Date, Date, Thread, Author