Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])

X-seq: zsh-workers 51239
From: "Jun. T" <takimoto-j@xxxxxxxxxxxxxxxxx>
To: zsh-workers@xxxxxxx
Subject: Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
Date: Sun, 18 Dec 2022 19:51:22 +0900
Archived-at: <https://zsh.org/workers/51239>
In-reply-to: <18686-1671179384.136789@8qJu.Y1PF.BJgr>
List-id: <zsh-workers.zsh.org>
References: <20221209154225.2z3lbtf422ypnmjx@chazelas.org> <99492-1670616302.663548@1brw.o7tP.wgJL> <20221210090626.mkv7bxeqnap6awah@chazelas.org> <1FF79E35-0103-4B80-BA4A-ECC6FD2ADF7E@kba.biglobe.ne.jp> <46661-1671054174.401235@OHsn.sB58.XThR> <C63F0FA2-1730-40DB-9C12-28FECE3EC406@kba.biglobe.ne.jp> <18686-1671179384.136789@8qJu.Y1PF.BJgr>

> 2022/12/16 17:29, Oliver Kiddle <opk@xxxxxxx> wrote:
> 
>>> +  read -ed $'\xc2'
>>> +0:read delimited by a single byte terminates if the byte is part of a multibyte character
>>> +<one£two
>>> +>one
>> 
>> Is this really what the standard requires (or will require)?
>> Breaking in the middle of a valid multibyte character looks
>> rather odd to me.
> 
> The proposed standard wording appears to only talk about the case of the
> delimiter consisting of "one single-byte character". $'\xc2' is not a
> valid UTF-8 character so my interpretation is that they are leaving this
> undefined.

I thought the "one single-byte character" etc. applies only when C or
POSIX locale is in use.

> Behaviour that treats the input as raw bytes for a raw byte delimiter
> is consistent. This retains compatibility with the way things
> work for a non-multibyte locale. Not all files are valid UTF-8 and it
> can be useful to force things to work at a raw byte level.

I was thinking it would be enough if we can do 'byte-by-byte' analysis by
using C/POSIX locale (or by setting MULTIBYTE option to off).

In the web page Stehane mentioned:
https://austingroupbugs.net/view.php?id=243#c6091

"When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input."

But I'm not familiar with this type of documents.

Follow-Ups:
- Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  - From: Stephane Chazelas

References:
- read -d $'\200' doesn't work with set +o multibyte
  - From: Stephane Chazelas
- Re: read -d $'\200' doesn't work with set +o multibyte
  - From: Oliver Kiddle
- Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  - From: Stephane Chazelas
- Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  - From: Jun T
- Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  - From: Oliver Kiddle
- Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  - From: Jun. T
- Re: read -d $'\200' doesn't work with set +o multibyte (and [PATCH])
  - From: Oliver Kiddle

Messages sorted by: Reverse Date, Date, Thread, Author