Zsh Mailing List Archive
Messages sorted by: Reverse Date, Date, Thread, Author

Re: multi-byte text decoding error can break word splitting by read at least



On Sun, Jun 7, 2026 at 11:23 AM Stephane Chazelas <stephane@xxxxxxxxxxxx> wrote:
>
> 2026-06-06 23:03:03 +0200, Mikael Magnusson:
> > On Sun, Apr 27, 2025 at 5:41 PM Stephane Chazelas <stephane@xxxxxxxxxxxx> wrote:
> > >
> > > There was some recent bug report on the bash mailing list about
> > > "read" missing the delimiter when it followes a truncated
> > > character, but zsh has similar issues when it comes to do doing
> > > IFS splitting on the record once it has been read:
> > >
> > > $ print 'a\302×b' | IFS=× read -rA a; typeset a
> > > a=( $'a\M-B×b' )
> > >
> > > Wasn't split on ×.
> > >
> > > One might argue that doing reliable word splitting on non-text
> > > is illusory anyway, but note that the latest version of the
> > > POSIX standard now requires that splitting be done by looking
> > > for the byte encodings of the characters in $IFS which would
> > > make the behaviour above non-conformant.
> > >
> > > See https://www.austingroupbugs.net/view.php?id=1920 for some
> > > discussion on that though.
> [...]
> > I think we can probably leave it to users to do this if they want to:
> > % sbread() { setopt localoptions nomultibyte; read "$@" }
> > % print ''a\302×b'' | IFS=× sbread -rA a
> > % typeset -p a'
> > typeset -a a=( $'a\M-B' '' b )
> [...]
>
> Thanks (and for fixing those other bugs I reported in the past!)
> but then that doesn't split on × characters, but on any of the
> bytes of the encoding of × (in UTF-8: 0xc3 and 0x97)
>
> $ echo 'Stéphane×Chazelas' | IFS=× sbread -rA a; typeset a
> a=( St $'\M-)phane' '' Chazelas )
>
> Stéphane was also split as é also contains the 0xc3 byte.
>
> From POSIX 2024 (https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/utilities/V3_chap02.html#tag_19_06_05):
>
> > The shell shall use the byte sequences that form the
> > characters in the value of the IFS variable as delimiters
> [...]
> > Note that the shell processes arbitrary bytes from the input
> > fields; there is no requirement that those bytes form valid
> > characters
>
> $IFS itself must contain valid characters or the behaviour is
> unspecified:
>
> > If the value of IFS includes any bytes that do not form part
> > of a valid character, the results of field splitting,
> > expansion of '*', and use of the read utility are unspecified.
>
> In the next Technical Corrigenda update, it will be updated per
> https://www.austingroupbugs.net/view.php?id=1924#c7196 so
> requiring only that behaviour in locales with a self
> synchronising character encoding such as UTF-8.
>
> So a\302×b with IFS=× is and will be required to be split on
> a\302 and b in a UTF-8 locale.
>
> Note that splitting seems to work fine with the s parameter
> expansion flag:
>
> $ str=$'×a\302×b'; a=( "${(A@s[×])str}" ); typeset -p1 a
> typeset -a a=(
>   ''
>   $'a\M-B'
>   b
> )

Ah, for some reason I had assumed × was one byte but that's obviously
not the case now that I look again.

PS when I reply to all to your mails in gmail, it fills out To: me &
the list (and not you), cc: the list, which is less than super
helpful.

-- 
Mikael Magnusson




Messages sorted by: Reverse Date, Date, Thread, Author