Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: multi-byte text decoding error can break word splitting by read at least

X-seq: zsh-workers 54699
From: Mikael Magnusson <mikachu@xxxxxxxxx>
To: Zsh hackers list <zsh-workers@xxxxxxx>
Subject: Re: multi-byte text decoding error can break word splitting by read at least
Date: Sun, 7 Jun 2026 12:41:21 +0200
Arc-authentication-results: i=1; mx.google.com; arc=none
Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=lZ+sqn69KRQARjVlxSrDf/CAzPPGQ7o+ZweJfw6BEDw=; fh=BgAYDYpL6Ne/A5nWEMVJiHiBtrz8Imz3uf26RDwgQX4=; b=KG5OYthMPW0gOdKg2iyiYq13iZK5xBuBk3FT3u9jfBxKwCmfLNDEgltjKg1Le0D6tZ CT5kF+ahCfoWXo8b03veHaHpXWdrLaFdq0nVgPPtEeZ8TwPtRDN2K0rkYJQL1F+8kSx9 pAZQbrzqJGqBeZJ7CiOXDWWizne+rq9DAZs4HbKBbL18cEsS05hiageFmtzno1VJwiq2 Nwma2DjsvlAQO2hudP4OiCPFitRa83EH4u1nIO62fYJQI5phimJ6yLU2rxK+hwHbH+wX HYDISvCMLg2Tah2erbtlHODcy2M6P3jNc+GS1fD2hHR4t6FAoTYnHMFfwe4rjIZUkaKA NJ9g==; darn=zsh.org
Arc-seal: i=1; a=rsa-sha256; t=1780828895; cv=none; d=google.com; s=arc-20240605; b=iIQzhUAbYBXc/Aqw6yWv16BgTydE6rjPqjuJ0oNk+9o01RtVo/wOG5eOsJ5nAwu4mu gaiH87LPG1U72mu2TexVgRRNAnT4bC/ihI8vHDhW6dF/duw9/GxUb3alymXtYg9CqA9i PK2r3KwQ8bfv1vKOGmpUiUxp6hPIufhlRWsDGG9i6sascYhJwH4k7SPWUKPiE/fgx87R W/9I8xfjCPZFJHN9y23mmU1aaDereq5IKpP5yNlxCqqiYB3AJvrkmgZZOkgeStFaeENa w8k8Y35tM3+pVR1ewzdEQDw820cKlqwxBuh7xsCl2qLdbI44T3yfEdaqd7P3KMwt/P9O 9aJA==
Archived-at: <https://zsh.org/workers/54699>
In-reply-to: <aiUxwVDCdA9S_OZg@chazelas.org>
List-id: <zsh-workers.zsh.org>
References: <yzyhuwlykbdtojs6fsbyb6iynwri7pwe3wtk5rsgo52spni5ry@g5o5fzavdboh> <CAHYJk3RFi9im87qoeMeVTptOZG=GJ2QCdyvqhqxjOxWJU7kWKw@mail.gmail.com> <aiUxwVDCdA9S_OZg@chazelas.org>

On Sun, Jun 7, 2026 at 11:23 AM Stephane Chazelas <stephane@xxxxxxxxxxxx> wrote:
>
> 2026-06-06 23:03:03 +0200, Mikael Magnusson:
> > On Sun, Apr 27, 2025 at 5:41 PM Stephane Chazelas <stephane@xxxxxxxxxxxx> wrote:
> > >
> > > There was some recent bug report on the bash mailing list about
> > > "read" missing the delimiter when it followes a truncated
> > > character, but zsh has similar issues when it comes to do doing
> > > IFS splitting on the record once it has been read:
> > >
> > > $ print 'a\302×b' | IFS=× read -rA a; typeset a
> > > a=( $'a\M-B×b' )
> > >
> > > Wasn't split on ×.
> > >
> > > One might argue that doing reliable word splitting on non-text
> > > is illusory anyway, but note that the latest version of the
> > > POSIX standard now requires that splitting be done by looking
> > > for the byte encodings of the characters in $IFS which would
> > > make the behaviour above non-conformant.
> > >
> > > See https://www.austingroupbugs.net/view.php?id=1920 for some
> > > discussion on that though.
> [...]
> > I think we can probably leave it to users to do this if they want to:
> > % sbread() { setopt localoptions nomultibyte; read "$@" }
> > % print ''a\302×b'' | IFS=× sbread -rA a
> > % typeset -p a'
> > typeset -a a=( $'a\M-B' '' b )
> [...]
>
> Thanks (and for fixing those other bugs I reported in the past!)
> but then that doesn't split on × characters, but on any of the
> bytes of the encoding of × (in UTF-8: 0xc3 and 0x97)
>
> $ echo 'Stéphane×Chazelas' | IFS=× sbread -rA a; typeset a
> a=( St $'\M-)phane' '' Chazelas )
>
> Stéphane was also split as é also contains the 0xc3 byte.
>
> From POSIX 2024 (https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/utilities/V3_chap02.html#tag_19_06_05):
>
> > The shell shall use the byte sequences that form the
> > characters in the value of the IFS variable as delimiters
> [...]
> > Note that the shell processes arbitrary bytes from the input
> > fields; there is no requirement that those bytes form valid
> > characters
>
> $IFS itself must contain valid characters or the behaviour is
> unspecified:
>
> > If the value of IFS includes any bytes that do not form part
> > of a valid character, the results of field splitting,
> > expansion of '*', and use of the read utility are unspecified.
>
> In the next Technical Corrigenda update, it will be updated per
> https://www.austingroupbugs.net/view.php?id=1924#c7196 so
> requiring only that behaviour in locales with a self
> synchronising character encoding such as UTF-8.
>
> So a\302×b with IFS=× is and will be required to be split on
> a\302 and b in a UTF-8 locale.
>
> Note that splitting seems to work fine with the s parameter
> expansion flag:
>
> $ str=$'×a\302×b'; a=( "${(A@s[×])str}" ); typeset -p1 a
> typeset -a a=(
>   ''
>   $'a\M-B'
>   b
> )

Ah, for some reason I had assumed × was one byte but that's obviously
not the case now that I look again.

PS when I reply to all to your mails in gmail, it fills out To: me &
the list (and not you), cc: the list, which is less than super
helpful.

-- 
Mikael Magnusson

References:
- Re: multi-byte text decoding error can break word splitting by read at least
  - From: Mikael Magnusson
- Re: multi-byte text decoding error can break word splitting by read at least
  - From: Stephane Chazelas

Messages sorted by: Reverse Date, Date, Thread, Author